Pull AFS updates from Al Viro:
"Assorted AFS stuff - ended up in vfs.git since most of that consists
of David's AFS-related followups to Christoph's procfs series"
* 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Optimise callback breaking by not repeating volume lookup
afs: Display manually added cells in dynamic root mount
afs: Enable IPv6 DNS lookups
afs: Show all of a server's addresses in /proc/fs/afs/servers
afs: Handle CONFIG_PROC_FS=n
proc: Make inline name size calculation automatic
afs: Implement network namespacing
afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
proc: Add a way to make network proc files writable
afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
afs: Rearrange fs/afs/proc.c to move the show routines up
afs: Rearrange fs/afs/proc.c by moving fops and open functions down
afs: Move /proc management functions to the end of the file
At the moment, afs_break_callbacks calls afs_break_one_callback() for each
separate FID it was given, and the latter looks up the volume individually
for each one.
However, this is inefficient if two or more FIDs have the same vid as we
could reuse the volume. This is complicated by cell aliasing whereby we
may have multiple cells sharing a volume and can therefore have multiple
callback interests for any particular volume ID.
At the moment afs_break_one_callback() scans the entire list of volumes
we're getting from a server and breaks the appropriate callback in every
matching volume, regardless of cell. This scan is done for every FID.
Optimise callback breaking by the following means:
(1) Sort the FID list by vid so that all FIDs belonging to the same volume
are clumped together.
This is done through the use of an indirection table as we cannot do
an insertion sort on the afs_callback_break array as we decode FIDs
into it as we subsequently also have to decode callback info into it
that corresponds by array index only.
We also don't really want to bubblesort afterwards if we can avoid it.
(2) Sort the server->cb_interests array by vid so that all the matching
volumes are grouped together. This permits the scan to stop after
finding a record that has a higher vid.
(3) When breaking FIDs, we try to keep server->cb_break_lock as long as
possible, caching the start point in the array for that volume group
as long as possible.
It might make sense to add another layer in that list and have a
refcounted volume ID anchor that has the matching interests attached
to it rather than being in the list. This would allow the lock to be
dropped without losing the cursor.
Signed-off-by: David Howells <dhowells@redhat.com>
Alter the dynroot mount so that cells created by manipulation of
/proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
cell as a module parameter will cause directories for those cells to be
created in the dynamic root superblock for the network namespace[*].
To this end:
(1) Only one dynamic root superblock is now created per network namespace
and this is shared between all attempts to mount it. This makes it
easier to find the superblock to modify.
(2) When a dynamic root superblock is created, the list of cells is walked
and directories created for each cell already defined.
(3) When a new cell is added, if a dynamic root superblock exists, a
directory is created for it.
(4) When a cell is destroyed, the directory is removed.
(5) These directories are created by calling lookup_one_len() on the root
dir which automatically creates them if they don't exist.
[*] Inasmuch as network namespaces are currently supported here.
Signed-off-by: David Howells <dhowells@redhat.com>
Show all of a server's addresses in /proc/fs/afs/servers, placing the
second plus addresses on padded lines of their own. The current address is
marked with a star.
Signed-off-by: David Howells <dhowells@redhat.com>
The AFS filesystem depends at the moment on /proc for configuration and
also presents information that way - however, this causes a compilation
failure if procfs is disabled.
Fix it so that the procfs bits aren't compiled in if procfs is disabled.
This means that you can't configure the AFS filesystem directly, but it is
still usable provided that an up-to-date keyutils is installed to look up
cells by SRV or AFSDB DNS records.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
Pull overflow updates from Kees Cook:
"This adds the new overflow checking helpers and adds them to the
2-factor argument allocators. And this adds the saturating size
helpers and does a treewide replacement for the struct_size() usage.
Additionally this adds the overflow testing modules to make sure
everything works.
I'm still working on the treewide replacements for allocators with
"simple" multiplied arguments:
*alloc(a * b, ...) -> *alloc_array(a, b, ...)
and
*zalloc(a * b, ...) -> *calloc(a, b, ...)
as well as the more complex cases, but that's separable from this
portion of the series. I expect to have the rest sent before -rc1
closes; there are a lot of messy cases to clean up.
Summary:
- Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)"
* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use struct_size() for devm_kmalloc() and friends
treewide: Use struct_size() for vmalloc()-family
treewide: Use struct_size() for kmalloc()-family
device: Use overflow helpers for devm_kmalloc()
mm: Use overflow helpers in kvmalloc()
mm: Use overflow helpers in kmalloc_array*()
test_overflow: Add memory allocation overflow tests
overflow.h: Add allocation size calculation helpers
test_overflow: Report test failures
test_overflow: macrofy some more, do more tests for free
lib: add runtime test of check_*_overflow functions
compiler.h: enable builtin overflow checkers and add fallback code
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:
// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@
- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
Signed-off-by: Kees Cook <keescook@chromium.org>
Sometimes an in-progress call will stop responding on the fileserver when
the fileserver quietly cancels the call with an internally marked abort
(RX_CALL_DEAD), without sending an ABORT to the client.
This causes the client's call to eventually expire from lack of incoming
packets directed its way, which currently leads to it being cancelled
locally with ETIME. Note that it's not currently clear as to why this
happens as it's really hard to reproduce.
The rotation policy implement by kAFS, however, doesn't differentiate
between ETIME meaning we didn't get any response from the server and ETIME
meaning the call got cancelled mid-flow. The latter leads to an oops when
fetching data as the rotation partially resets the afs_read descriptor,
which can result in a cleared page pointer being dereferenced because that
page has already been filled.
Handle this by the following means:
(1) Set a flag on a call when we receive a packet for it.
(2) Store the highest packet serial number so far received for a call
(bearing in mind this may wrap).
(3) If, when the "not received anything recently" timeout expires on a
call, we've received at least one packet for a call and the connection
as a whole has received packets more recently than that call, then
cancel the call locally with ECONNRESET rather than ETIME.
This indicates that the call was definitely in progress on the server.
(4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
don't try the next server, but rather abort the call.
This avoids the oops as we don't try to reuse the afs_read struct.
Rather, as-yet ungotten pages will be reread at a later data.
Also:
(5) Add an rxrpc tracepoint to log detection of the call being reset.
Without this, I occasionally see an oops like the following:
general protection fault: 0000 [#1] SMP PTI
...
RIP: 0010:_copy_to_iter+0x204/0x310
RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
FS: 00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
Call Trace:
skb_copy_datagram_iter+0x14e/0x289
rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
? trace_buffer_unlock_commit_regs+0x4f/0x89
rxrpc_kernel_recv_data+0x149/0x421
afs_extract_data+0x1e0/0x798
? afs_wait_for_call_to_complete+0xc9/0x52e
afs_deliver_fs_fetch_data+0x33a/0x5ab
afs_deliver_to_call+0x1ee/0x5e0
? afs_wait_for_call_to_complete+0xc9/0x52e
afs_wait_for_call_to_complete+0x12b/0x52e
? wake_up_q+0x54/0x54
afs_make_call+0x287/0x462
? afs_fs_fetch_data+0x3e6/0x3ed
? rcu_read_lock_sched_held+0x5d/0x63
afs_fs_fetch_data+0x3e6/0x3ed
afs_fetch_data+0xbb/0x14a
afs_readpages+0x317/0x40d
__do_page_cache_readahead+0x203/0x2ba
? ondemand_readahead+0x3a7/0x3c1
ondemand_readahead+0x3a7/0x3c1
generic_file_buffered_read+0x18b/0x62f
__vfs_read+0xdb/0xfe
vfs_read+0xb2/0x137
ksys_read+0x50/0x8c
do_syscall_64+0x7d/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Note the weird value in RDI which is a result of trying to kmap() a NULL
page pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement network namespacing within AFS, but don't yet let mounts occur
outside the init namespace. An additional patch will be required propagate
the network namespace across automounts.
Signed-off-by: David Howells <dhowells@redhat.com>
The afs_net::ws_cell member is sometimes used under RCU conditions from
within an seq-readlock. It isn't, however, marked __rcu and it isn't set
using the proper RCU barrier-imposing functions.
Fix this by annotating it with __rcu and using appropriate barriers to
make sure accesses are correctly ordered.
Without this, the code can produce the following warning:
>> fs/afs/proc.c:151:24: sparse: incompatible types in comparison expression (different address spaces)
Fixes: f044c8847b ("afs: Lay the groundwork for supporting network namespaces")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Sparse doesn't appear able to handle the conditionally-taken locks in
xdr_decode_AFSFetchStatus(), even though the lock and unlock are both
contingent on the same unvarying function argument.
Deal with this by interpolating a wrapper function that takes the lock if
needed and calls xdr_decode_AFSFetchStatus() on two separate branches, one
with the lock held and one without.
This allows Sparse to work out the locking.
Signed-off-by: David Howells <dhowells@redhat.com>
Rearrange fs/afs/proc.c to move the show routines up to the top of each
block so the order is show, iteration, ops, file ops, fops.
Signed-off-by: David Howells <dhowells@redhat.com>
In fs/afs/proc.c, move functions that create and remove /proc files to the
end of the source file as a first stage in getting rid of all the forward
declarations.
Signed-off-by: David Howells <dhowells@redhat.com>
In theory the AFS_VLSF_BACKVOL flag for a server in a vldb entry
would indicate the presence of a backup volume on that server.
In practice however, this flag is never set, and the presence of
a backup volume is implied by the entry having AFS_VLF_BACKEXISTS set,
for the server that hosts the read-write volume (has AFS_VLSF_RWVOL).
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Doing faccessat("/afs/some/directory", 0) triggers a BUG in the permissions
check code.
Fix this by just removing the BUG section. If no permissions are asked
for, just return okay if the file exists.
Also:
(1) Split up the directory check so that it has separate if-statements
rather than if-else-if (e.g. checking for MAY_EXEC shouldn't skip the
check for MAY_READ and MAY_WRITE).
(2) Check for MAY_CHDIR as MAY_EXEC.
Without the main fix, the following BUG may occur:
kernel BUG at fs/afs/security.c:386!
invalid opcode: 0000 [#1] SMP PTI
...
RIP: 0010:afs_permission+0x19d/0x1a0 [kafs]
...
Call Trace:
? inode_permission+0xbe/0x180
? do_faccessat+0xdc/0x270
? do_syscall_64+0x60/0x1f0
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 00d3b7a453 ("[AFS]: Add security support.")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Use remove_proc_subtree to remove the whole subtree on cleanup, and
unwind the registration loop into individual calls. Switch to use
proc_create_seq where applicable.
Signed-off-by: Christoph Hellwig <hch@lst.de>