Userspace cannot compile <asm/sembuf.h> due to some missing type
definitions. For example, building it for x86 fails as follows:
CC usr/include/asm/sembuf.h.s
In file included from <command-line>:32:0:
usr/include/asm/sembuf.h:17:20: error: field `sem_perm' has incomplete type
struct ipc64_perm sem_perm; /* permissions .. see ipc.h */
^~~~~~~~
usr/include/asm/sembuf.h:24:2: error: unknown type name `__kernel_time_t'
__kernel_time_t sem_otime; /* last semop time */
^~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:25:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused1;
^~~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:26:2: error: unknown type name `__kernel_time_t'
__kernel_time_t sem_ctime; /* last change time */
^~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:27:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused2;
^~~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:29:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t sem_nsems; /* no. of semaphores in array */
^~~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:30:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused3;
^~~~~~~~~~~~~~~~
usr/include/asm/sembuf.h:31:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused4;
^~~~~~~~~~~~~~~~
It is just a matter of missing include directive.
Include <asm/ipcbuf.h> to make it self-contained, and add it to
the compile-test coverage.
Link: http://lkml.kernel.org/r/20191030063855.9989-3-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace cannot compile <asm/msgbuf.h> due to some missing type
definitions. For example, building it for x86 fails as follows:
CC usr/include/asm/msgbuf.h.s
In file included from usr/include/asm/msgbuf.h:6:0,
from <command-line>:32:
usr/include/asm-generic/msgbuf.h:25:20: error: field `msg_perm' has incomplete type
struct ipc64_perm msg_perm;
^~~~~~~~
usr/include/asm-generic/msgbuf.h:27:2: error: unknown type name `__kernel_time_t'
__kernel_time_t msg_stime; /* last msgsnd time */
^~~~~~~~~~~~~~~
usr/include/asm-generic/msgbuf.h:28:2: error: unknown type name `__kernel_time_t'
__kernel_time_t msg_rtime; /* last msgrcv time */
^~~~~~~~~~~~~~~
usr/include/asm-generic/msgbuf.h:29:2: error: unknown type name `__kernel_time_t'
__kernel_time_t msg_ctime; /* last change time */
^~~~~~~~~~~~~~~
usr/include/asm-generic/msgbuf.h:41:2: error: unknown type name `__kernel_pid_t'
__kernel_pid_t msg_lspid; /* pid of last msgsnd */
^~~~~~~~~~~~~~
usr/include/asm-generic/msgbuf.h:42:2: error: unknown type name `__kernel_pid_t'
__kernel_pid_t msg_lrpid; /* last receive pid */
^~~~~~~~~~~~~~
It is just a matter of missing include directive.
Include <asm/ipcbuf.h> to make it self-contained, and add it to
the compile-test coverage.
Link: http://lkml.kernel.org/r/20191030063855.9989-2-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace cannot compile <asm/ipcbuf.h> due to some missing type
definitions. For example, building it for x86 fails as follows:
CC usr/include/asm/ipcbuf.h.s
In file included from usr/include/asm/ipcbuf.h:1:0,
from <command-line>:32:
usr/include/asm-generic/ipcbuf.h:21:2: error: unknown type name `__kernel_key_t'
__kernel_key_t key;
^~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:22:2: error: unknown type name `__kernel_uid32_t'
__kernel_uid32_t uid;
^~~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:23:2: error: unknown type name `__kernel_gid32_t'
__kernel_gid32_t gid;
^~~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:24:2: error: unknown type name `__kernel_uid32_t'
__kernel_uid32_t cuid;
^~~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:25:2: error: unknown type name `__kernel_gid32_t'
__kernel_gid32_t cgid;
^~~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:26:2: error: unknown type name `__kernel_mode_t'
__kernel_mode_t mode;
^~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:28:35: error: `__kernel_mode_t' undeclared here (not in a function)
unsigned char __pad1[4 - sizeof(__kernel_mode_t)];
^~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:31:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused1;
^~~~~~~~~~~~~~~~
usr/include/asm-generic/ipcbuf.h:32:2: error: unknown type name `__kernel_ulong_t'
__kernel_ulong_t __unused2;
^~~~~~~~~~~~~~~~
It is just a matter of missing include directive.
Include <linux/posix_types.h> to make it self-contained, and add it to
the compile-test coverage.
Link: http://lkml.kernel.org/r/20191030063855.9989-1-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two structures based on time_t that conflict between libc and
kernel: timeval and timespec. Both are now renamed to __kernel_old_timeval
and __kernel_old_timespec.
For time_t, the old typedef is still __kernel_time_t. There is nothing
wrong with that name, but it would be nice to not use that going forward
as this type is used almost only in deprecated interfaces because of
the y2038 overflow.
In the IPC headers (msgbuf.h, sembuf.h, shmbuf.h), __kernel_time_t is only
used for the 64-bit variants, which are not deprecated.
Change these to a plain 'long', which is the same type as __kernel_time_t
on all 64-bit architectures anyway, to reduce the number of users of the
old type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The 'struct timespec' definition can no longer be part of the uapi headers
because it conflicts with a a now incompatible libc definition. Also,
we really want to remove it in order to prevent new uses from creeping in.
The same namespace conflict exists with time_t, which should also be
removed. __kernel_time_t could be used safely, but adding 'old' in the
name makes it clearer that this should not be used for new interfaces.
Add a replacement __kernel_old_timespec structure and __kernel_old_time_t
along the lines of __kernel_old_timeval.
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use. This could reduce workingset
eviction so it ends up increasing performance.
This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly. The hint can help kernel in deciding which pages to
evict proactively.
A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size. If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].
- man-page material
MADV_PAGEOUT (since Linux x.x)
Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.
MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.
[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
"VM:
- z3fold fixes and enhancements by Henry Burns and Vitaly Wool
- more accurate reclaimed slab caches calculations by Yafang Shao
- fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
Christoph Hellwig
- !CONFIG_MMU fixes by Christoph Hellwig
- new novmcoredd parameter to omit device dumps from vmcore, by
Kairui Song
- new test_meminit module for testing heap and pagealloc
initialization, by Alexander Potapenko
- ioremap improvements for huge mappings, by Anshuman Khandual
- generalize kprobe page fault handling, by Anshuman Khandual
- device-dax hotplug fixes and improvements, by Pavel Tatashin
- enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V
- add pte_devmap() support for arm64, by Robin Murphy
- unify locked_vm accounting with a helper, by Daniel Jordan
- several misc fixes
core/lib:
- new typeof_member() macro including some users, by Alexey Dobriyan
- make BIT() and GENMASK() available in asm, by Masahiro Yamada
- changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
code generation, by Alexey Dobriyan
- rbtree code size optimizations, by Michel Lespinasse
- convert struct pid count to refcount_t, by Joel Fernandes
get_maintainer.pl:
- add --no-moderated switch to skip moderated ML's, by Joe Perches
misc:
- ptrace PTRACE_GET_SYSCALL_INFO interface
- coda updates
- gdb scripts, various"
[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
fs/select.c: use struct_size() in kmalloc()
mm: add account_locked_vm utility function
arm64: mm: implement pte_devmap support
mm: introduce ARCH_HAS_PTE_DEVMAP
mm: clean up is_device_*_page() definitions
mm/mmap: move common defines to mman-common.h
mm: move MAP_SYNC to asm-generic/mman-common.h
device-dax: "Hotremove" persistent memory that is used like normal RAM
mm/hotplug: make remove_memory() interface usable
device-dax: fix memory and resource leak if hotplug fails
include/linux/lz4.h: fix spelling and copy-paste errors in documentation
ipc/mqueue.c: only perform resource calculation if user valid
include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
scripts/gdb: add helpers to find and list devices
scripts/gdb: add lx-genpd-summary command
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
kernel/pid.c: convert struct pid count to refcount_t
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
...
This lets us catch new architectures that implicitly make use of clone3
without setting __ARCH_WANT_SYS_CLONE3.
Failing on missing __ARCH_WANT_SYS_CLONE3 is a good indicator that they
either did not really want this syscall or haven't really thought about
whether it needs special treatment and just accidently included it in
their entrypoints by e.g. generating their syscall table automatically
via asm-generic/unistd.h
This patch has been compile-tested for the h8300 architecture which is
one of the architectures that does not yet implement clone3 and
generates its syscall table via asm-generic/unistd.h.
Signed-off-by: Christian Brauner <christian@brauner.io>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20190714192205.27190-3-christian@brauner.io
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <christian@brauner.io>
Pull networking updates from David Miller:
"Some highlights from this development cycle:
1) Big refactoring of ipv6 route and neigh handling to support
nexthop objects configurable as units from userspace. From David
Ahern.
2) Convert explored_states in BPF verifier into a hash table,
significantly decreased state held for programs with bpf2bpf
calls, from Alexei Starovoitov.
3) Implement bpf_send_signal() helper, from Yonghong Song.
4) Various classifier enhancements to mvpp2 driver, from Maxime
Chevallier.
5) Add aRFS support to hns3 driver, from Jian Shen.
6) Fix use after free in inet frags by allocating fqdirs dynamically
and reworking how rhashtable dismantle occurs, from Eric Dumazet.
7) Add act_ctinfo packet classifier action, from Kevin
Darbyshire-Bryant.
8) Add TFO key backup infrastructure, from Jason Baron.
9) Remove several old and unused ISDN drivers, from Arnd Bergmann.
10) Add devlink notifications for flash update status to mlxsw driver,
from Jiri Pirko.
11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.
12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.
13) Various enhancements to ipv6 flow label handling, from Eric
Dumazet and Willem de Bruijn.
14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
der Merwe, and others.
15) Various improvements to axienet driver including converting it to
phylink, from Robert Hancock.
16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.
17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
Radulescu.
18) Add devlink health reporting to mlx5, from Moshe Shemesh.
19) Convert stmmac over to phylink, from Jose Abreu.
20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
Shalom Toledo.
21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.
22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.
23) Track spill/fill of constants in BPF verifier, from Alexei
Starovoitov.
24) Support bounded loops in BPF, from Alexei Starovoitov.
25) Various page_pool API fixes and improvements, from Jesper Dangaard
Brouer.
26) Just like ipv4, support ref-countless ipv6 route handling. From
Wei Wang.
27) Support VLAN offloading in aquantia driver, from Igor Russkikh.
28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.
29) Add flower GRE encap/decap support to nfp driver, from Pieter
Jansen van Vuuren.
30) Protect against stack overflow when using act_mirred, from John
Hurley.
31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.
32) Use page_pool API in netsec driver, Ilias Apalodimas.
33) Add Google gve network driver, from Catherine Sullivan.
34) More indirect call avoidance, from Paolo Abeni.
35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.
36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.
37) Add MPLS manipulation actions to TC, from John Hurley.
38) Add sending a packet to connection tracking from TC actions, and
then allow flower classifier matching on conntrack state. From
Paul Blakey.
39) Netfilter hw offload support, from Pablo Neira Ayuso"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
net/mlx5e: Return in default case statement in tx_post_resync_params
mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
net: dsa: add support for BRIDGE_MROUTER attribute
pkt_sched: Include const.h
net: netsec: remove static declaration for netsec_set_tx_de()
net: netsec: remove superfluous if statement
netfilter: nf_tables: add hardware offload support
net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
net: flow_offload: add flow_block_cb_is_busy() and use it
net: sched: remove tcf block API
drivers: net: use flow block API
net: sched: use flow block API
net: flow_offload: add flow_block_cb_{priv, incref, decref}()
net: flow_offload: add list handling functions
net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
net: flow_offload: add flow_block_cb_setup_simple()
net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
...
Pull clone3 system call from Christian Brauner:
"This adds the clone3 syscall which is an extensible successor to clone
after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
window for clone(). It cleanly supports all of the flags from clone()
and thus all legacy workloads.
There are few user visible differences between clone3 and clone.
First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
this flag. Second, the CSIGNAL flag is deprecated and will cause
EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
argument in struct clone_args thus freeing up even more flags. And
third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
argument.
The clone3 uapi is designed to be easy to handle on 32- and 64 bit:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
and a separate kernel struct is used that uses proper kernel typing:
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
The system call comes with a size argument which enables the kernel to
detect what version of clone_args userspace is passing in. clone3
validates that any additional bytes a given kernel does not know about
are set to zero and that the size never exceeds a page.
A nice feature is that this patchset allowed us to cleanup and
simplify various core kernel codepaths in kernel/fork.c by making the
internal _do_fork() function take struct kernel_clone_args even for
legacy clone().
This patch also unblocks the time namespace patchset which wants to
introduce a new CLONE_TIMENS flag.
Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
xtensa. These were the architectures that did not require special
massaging.
Other architectures treat fork-like system calls individually and
after some back and forth neither Arnd nor I felt confident that we
dared to add clone3 unconditionally to all architectures. We agreed to
leave this up to individual architecture maintainers. This is why
there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
which any architecture can set once it has implemented support for
clone3. The patch also adds a cond_syscall(clone3) for architectures
such as nios2 or h8300 that generate their syscall table by simply
including asm-generic/unistd.h. The hope is to get rid of
__ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"
* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
arch: handle arches who do not yet define clone3
arch: wire-up clone3() syscall
fork: add clone3
There is SO_ATTACH_REUSEPORT_[CE]BPF but there is no DETACH.
This patch adds SO_DETACH_REUSEPORT_BPF sockopt. The same
sockopt can be used to undo both SO_ATTACH_REUSEPORT_[CE]BPF.
reseport_detach_prog() is added and it is mostly a mirror
of the existing reuseport_attach_prog(). The differences are,
it does not call reuseport_alloc() and returns -ENOENT when
there is no old prog.
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The 'timeval' and 'timespec' data structures used for socket timestamps
are going to be redefined in user space based on 64-bit time_t in future
versions of the C library to deal with the y2038 overflow problem,
which breaks the ABI definition.
Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
use the _IOR() macro to encode the size of the transferred data, so it
remains ambiguous whether the application uses the old or new layout.
The best workaround I could find is rather ugly: we redefine the command
code based on the size of the respective data structure with a ternary
operator. This lets it get evaluated as late as possible, hopefully after
that structure is visible to the caller. We cannot use an #ifdef here,
because inux/sockios.h might have been included before any libc header
that could determine the size of time_t.
The ioctl implementation now interprets the new command codes as always
referring to the 64-bit structure on all architectures, while the old
architecture specific command code still refers to the old architecture
specific layout. The new command number is only used when they are
actually different.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more Kbuild updates from Masahiro Yamada:
- add more Build-Depends to Debian source package
- prefix header search paths with $(srctree)/
- make modpost show verbose section mismatch warnings
- avoid hard-coded CROSS_COMPILE for h8300
- fix regression for Debian make-kpkg command
- add semantic patch to detect missing put_device()
- fix some warnings of 'make deb-pkg'
- optimize NOSTDINC_FLAGS evaluation
- add warnings about redundant generic-y
- clean up Makefiles and scripts
* tag 'kbuild-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: remove stale lxdialog/.gitignore
kbuild: force all architectures except um to include mandatory-y
kbuild: warn redundant generic-y
Revert "modsign: Abort modules_install when signing fails"
kbuild: Make NOSTDINC_FLAGS a simply expanded variable
kbuild: deb-pkg: avoid implicit effects
coccinelle: semantic code search for missing put_device()
kbuild: pkg: grep include/config/auto.conf instead of $KCONFIG_CONFIG
kbuild: deb-pkg: introduce is_enabled and if_enabled_echo to builddeb
kbuild: deb-pkg: add CONFIG_ prefix to kernel config options
kbuild: add workaround for Debian make-kpkg
kbuild: source include/config/auto.conf instead of ${KCONFIG_CONFIG}
unicore32: simplify linker script generation for decompressor
h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
kbuild: move archive command to scripts/Makefile.lib
modpost: always show verbose warning for section mismatch
ia64: prefix header search path with $(srctree)/
libfdt: prefix header search paths with $(srctree)/
deb-pkg: generate correct build dependencies
Currently, every arch/*/include/uapi/asm/Kbuild explicitly includes
the common Kbuild.asm file. Factor out the duplicated include directives
to scripts/Makefile.asm-generic so that no architecture would opt out
of the mandatory-y mechanism.
um is not forced to include mandatory-y since it is a very exceptional
case which does not support UAPI.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Pull pidfd system call from Christian Brauner:
"This introduces the ability to use file descriptors from /proc/<pid>/
as stable handles on struct pid. Even if a pid is recycled the handle
will not change. For a start these fds can be used to send signals to
the processes they refer to.
With the ability to use /proc/<pid> fds as stable handles on struct
pid we can fix a long-standing issue where after a process has exited
its pid can be reused by another process. If a caller sends a signal
to a reused pid it will end up signaling the wrong process.
With this patchset we enable a variety of use cases. One obvious
example is that we can now safely delegate an important part of
process management - sending signals - to processes other than the
parent of a given process by sending file descriptors around via scm
rights and not fearing that the given process will have been recycled
in the meantime. It also allows for easy testing whether a given
process is still alive or not by sending signal 0 to a pidfd which is
quite handy.
There has been some interest in this feature e.g. from systems
management (systemd, glibc) and container managers. I have requested
and gotten comments from glibc to make sure that this syscall is
suitable for their needs as well. In the future I expect it to take on
most other pid-based signal syscalls. But such features are left for
the future once they are needed.
This has been sitting in linux-next for quite a while and has not
caused any issues. It comes with selftests which verify basic
functionality and also test that a recycled pid cannot be signaled via
a pidfd.
Jon has written about a prior version of this patchset. It should
cover the basic functionality since not a lot has changed since then:
https://lwn.net/Articles/773459/
The commit message for the syscall itself is extensively documenting
the syscall, including it's functionality and extensibility"
* tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add tests for pidfd_send_signal()
signal: add pidfd_send_signal() syscall
Pull networking fixes from David Miller:
"More fixes in the queue:
1) Netfilter nat can erroneously register the device notifier twice,
fix from Florian Westphal.
2) Use after free in nf_tables, from Pablo Neira Ayuso.
3) Parallel update of steering rule fix in mlx5 river, from Eli
Britstein.
4) RX processing panic in lan743x, fix from Bryan Whitehead.
5) Use before initialization of TCP_SKB_CB, fix from Christoph Paasch.
6) Fix locking in SRIOV mode of mlx4 driver, from Jack Morgenstein.
7) Fix TX stalls in lan743x due to mishandling of interrupt ACKing
modes, from Bryan Whitehead.
8) Fix infoleak in l2tp_ip6_recvmsg(), from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
pptp: dst_release sk_dst_cache in pptp_sock_destruct
MAINTAINERS: GENET & SYSTEMPORT: Add internal Broadcom list
l2tp: fix infoleak in l2tp_ip6_recvmsg()
net/tls: Inform user space about send buffer availability
net_sched: return correct value for *notify* functions
lan743x: Fix TX Stall Issue
net/mlx4_core: Fix qp mtt size calculation
net/mlx4_core: Fix locking in SRIOV mode when switching between events and polling
net/mlx4_core: Fix reset flow when in command polling mode
mlxsw: minimal: Initialize base_mac
mlxsw: core: Prevent duplication during QSFP module initialization
net: dwmac-sun8i: fix a missing check of of_get_phy_mode
net: sh_eth: fix a missing check of of_get_phy_mode
net: 8390: fix potential NULL pointer dereferences
net: fujitsu: fix a potential NULL pointer dereference
net: qlogic: fix a potential NULL pointer dereference
isdn: hfcpci: fix potential NULL pointer dereference
Documentation: devicetree: add a new optional property for port mac address
net: rocker: fix a potential NULL pointer dereference
net: qlge: fix a potential NULL pointer dereference
...
Referencing the __kernel_long_t type caused some user space applications
to stop compiling when they had not already included linux/posix_types.h,
e.g.
s/multicast.c -o ext/sockets/multicast.lo
In file included from /builddir/build/BUILD/php-7.3.3/main/php.h:468,
from /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:27:
/builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c: In function 'zm_startup_sockets':
/builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:776:40: error: '__kernel_long_t' undeclared (first use in this function)
776 | REGISTER_LONG_CONSTANT("SO_SNDTIMEO", SO_SNDTIMEO, CONST_CS | CONST_PERSISTENT);
It is safe to include that header here, since it only contains kernel
internal types that do not conflict with other user space types.
It's still possible that some related build failures remain, but those
are likely to be for code that is not already y2038 safe.
Reported-by: Laura Abbott <labbott@redhat.com>
Fixes: a9beb86ae6 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>