The following
if (val < 0)
*lvalp = (unsigned long)-val;
is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern. This is
a problem if val == INT_MIN:
# echo -2147483648 >/proc/sys/dev/scsi/logging_level
# cat /proc/sys/dev/scsi/logging_level
-18446744071562067968
Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined. With this:
# cat /proc/sys/dev/scsi/logging_level
-2147483648
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cc: Mikulas Patocka <mikulas@twibright.com>
Cc: Robert Xiao <nneonneo@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G. That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:
#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted. Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.
Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure. With code debugging and log we got the root
cause. This is because percpu variable crash_notes are allocated in 2
vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages. So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:
/sys/devices/system/cpu/cpux/crash_notes
/sys/devices/system/cpu/cpux/crash_notes_size
kdump kernel use them to get the content of crash_notes. However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages. That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.
In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two. This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages. That need be avoided, and need
be reported if it's unavoidable.
[akpm@linux-foundation.org: use correct comment layout]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.
And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.
The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.
Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.
Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.
[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The UMH_WAIT_PROC handler runs in its own thread in order to make sure
that waiting for the exec kernel thread completion won't block other
usermodehelper queued jobs.
On older workqueue implementations, worklets couldn't sleep without
blocking the rest of the queue. But now the workqueue subsystem handles
that. Khelper still had the older limitation due to its singlethread
properties but we replaced it to system unbound workqueues.
Those are affine to the current node and can block up to some number of
instances.
They are a good candidate to handle UMH_WAIT_PROC assuming that we have
enough system unbound workers to handle lots of parallel usermodehelper
jobs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.
Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.
The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.
The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.
This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There seem to be quite some confusions on the comments, likely due to
changes that came after them.
Now since it's very non obvious why we have 3 levels of asynchronous code
to implement usermodehelpers, it's important to comment in detail the
reason of this layout.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Khelper is affine to all CPUs. Now since it creates the
call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
affinity.
As such explicitly forcing a wide affinity from those kernel threads
is like a no-op.
Just remove it. It's needless and it breaks CPU isolation users who
rely on workqueue affinity tuning.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset does a bunch of cleanups and converts khelper to use system
unbound workqueues. The 3 first patches should be uncontroversial. The
last 2 patches are debatable.
Kmod creates kernel threads that perform userspace jobs and we want those
to have a large affinity in order not to contend busy CPUs. This is
(partly) why we use khelper which has a wide affinity that the kernel
threads it create can inherit from. Now khelper is a dedicated workqueue
that has singlethread properties which we aren't interested in.
Hence those two debatable changes:
_ We would like to use generic workqueues. System unbound workqueues are
a very good candidate but they are not wide affine, only node affine.
Now probably a node is enough to perform many parallel kmod jobs.
_ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
handler) to use the workqueue. It means that if the workqueue blocks,
and no other worker can take pending kmod request, we can be screwed.
Now if we have 512 threads, this should be enough.
This patch (of 5):
Underscores on function names aren't much verbose to explain the purpose
of a function. And kmod has interesting such flavours.
Lets rename the following functions:
* __call_usermodehelper -> call_usermodehelper_exec_work
* ____call_usermodehelper -> call_usermodehelper_exec_async
* wait_for_helper -> call_usermodehelper_exec_sync
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If request_module() successfully runs modprobe, but modprobe exits with a
non-zero status, then the return value from request_module() will be that
(positive) error status. So the return from request_module can be:
negative errno
zero for success
positive exit code.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e0e817392b ("CRED: Add some configurable debugging [try #6]")
added the kdebug mechanism to this file back in 2009.
The kdebug macro calls no_printk which always evaluates arguments.
Most of the kdebug uses have an unnecessary call of
atomic_read(&cred->usage)
Make the kdebug macro do nothing by defining it with
do { if (0) no_printk(...); } while (0)
when not enabled.
$ size kernel/cred.o* (defconfig x86-64)
text data bss dec hex filename
2748 336 8 3092 c14 kernel/cred.o.new
2788 336 8 3132 c3c kernel/cred.o.old
Miscellanea:
o Neaten the #define kdebug macros while there
Signed-off-by: Joe Perches <joe@perches.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
alloc_pages_exact_node() was introduced in commit 6484eb3e2a ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac3 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When seq_show_option (commit a068acf2ee: "fs: create and use
seq_show_option for escaping") was merged, it did not correctly collide
with cgroup's addition of legacy_name (commit 3e1d2eed39: "cgroup:
introduce cgroup_subsys->legacy_name") changes.
This fixes the reported name.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
Pull tracing update from Steven Rostedt:
"Mostly this is just clean ups and micro optimizations.
The changes with more meat are:
- Allowing the trace event filters to filter on CPU number and
process ids
- Two new markers for trace output latency were added (10 and 100
msec latencies)
- Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future work,
and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression as
well as the code to revert that change without touching the other
changes that were made on top of it"
* tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
tracing: Don't make assumptions about length of string on task rename
tracing: Allow triggers to filter for CPU ids and process names
ftrace: Format MCOUNT_ADDR address as type unsigned long
tracing: Introduce two additional marks for delay
ftrace: Fix function_graph duration spacing with 7-digits
ftrace: add tracing_thresh to function profile
tracing: Clean up stack tracing and fix fentry updates
ring-buffer: Reorganize function locations
ring-buffer: Make sure event has enough room for extend and padding
ring-buffer: Get timestamp after event is allocated
ring-buffer: Move the adding of the extended timestamp out of line
ring-buffer: Add event descriptor to simplify passing data
ftrace: correct the counter increment for trace_buffer data
tracing: Fix for non-continuous cpu ids
tracing: Prefer kcalloc over kzalloc with multiply
Pull audit update from Paul Moore:
"This is one of the larger audit patchsets in recent history,
consisting of eight patches and almost 400 lines of changes.
The bulk of the patchset is the new "audit by executable"
functionality which allows admins to set an audit watch based on the
executable on disk. Prior to this, admins could only track an
application by PID, which has some obvious limitations.
Beyond the new functionality we also have some refcnt fixes and a few
minor cleanups"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
fixup: audit: implement audit by executable
audit: implement audit by executable
audit: clean simple fsnotify implementation
audit: use macros for unset inode and device values
audit: make audit_del_rule() more robust
audit: fix uninitialized variable in audit_add_rule()
audit: eliminate unnecessary extra layer of watch parent references
audit: eliminate unnecessary extra layer of watch references
Pull security subsystem updates from James Morris:
"Highlights:
- PKCS#7 support added to support signed kexec, also utilized for
module signing. See comments in 3f1e1bea.
** NOTE: this requires linking against the OpenSSL library, which
must be installed, e.g. the openssl-devel on Fedora **
- Smack
- add IPv6 host labeling; ignore labels on kernel threads
- support smack labeling mounts which use binary mount data
- SELinux:
- add ioctl whitelisting (see
http://kernsec.org/files/lss2015/vanderstoep.pdf)
- fix mprotect PROT_EXEC regression caused by mm change
- Seccomp:
- add ptrace options for suspend/resume"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
Documentation/Changes: Now need OpenSSL devel packages for module signing
scripts: add extract-cert and sign-file to .gitignore
modsign: Handle signing key in source tree
modsign: Use if_changed rule for extracting cert from module signing key
Move certificate handling to its own directory
sign-file: Fix warning about BIO_reset() return value
PKCS#7: Add MODULE_LICENSE() to test module
Smack - Fix build error with bringup unconfigured
sign-file: Document dependency on OpenSSL devel libraries
PKCS#7: Appropriately restrict authenticated attributes and content type
KEYS: Add a name for PKEY_ID_PKCS7
PKCS#7: Improve and export the X.509 ASN.1 time object decoder
modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
extract-cert: Cope with multiple X.509 certificates in a single file
sign-file: Generate CMS message as signature instead of PKCS#7
PKCS#7: Support CMS messages also [RFC5652]
X.509: Change recorded SKID & AKID to not include Subject or Issuer
PKCS#7: Check content type and versions
MAINTAINERS: The keyrings mailing list has moved
...
Pull vfs updates from Al Viro:
"In this one:
- d_move fixes (Eric Biederman)
- UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
handling ought to be fixed)
- switch of sb_writers to percpu rwsem (Oleg Nesterov)
- superblock scalability (Josef Bacik and Dave Chinner)
- swapon(2) race fix (Hugh Dickins)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
vfs: Test for and handle paths that are unreachable from their mnt_root
dcache: Reduce the scope of i_lock in d_splice_alias
dcache: Handle escaped paths in prepend_path
mm: fix potential data race in SyS_swapon
inode: don't softlockup when evicting inodes
inode: rename i_wb_list to i_io_list
sync: serialise per-superblock sync operations
inode: convert inode_sb_list_lock to per-sb
inode: add hlist_fake to avoid the inode hash lock in evict
writeback: plug writeback at a high level
change sb_writers to use percpu_rw_semaphore
shift percpu_counter_destroy() into destroy_super_work()
percpu-rwsem: kill CONFIG_PERCPU_RWSEM
percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
percpu-rwsem: introduce percpu_down_read_trylock()
document rwsem_release() in sb_wait_write()
fix the broken lockdep logic in __sb_start_write()
introduce __sb_writers_{acquired,release}() helpers
ufs_inode_get{frag,block}(): get rid of 'phys' argument
ufs_getfrag_block(): tidy up a bit
...
Merge patch-bomb from Andrew Morton:
- a few misc things
- Andy's "ambient capabilities"
- fs/nofity updates
- the ocfs2 queue
- kernel/watchdog.c updates and feature work.
- some of MM. Includes Andrea's userfaultfd feature.
[ Hadn't noticed that userfaultfd was 'default y' when applying the
patches, so that got fixed in this merge instead. We do _not_ mark
new features that nobody uses yet 'default y' - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
mm/hugetlb.c: make vma_has_reserves() return bool
mm/madvise.c: make madvise_behaviour_valid() return bool
mm/memory.c: make tlb_next_batch() return bool
mm/dmapool.c: change is_page_busy() return from int to bool
mm: remove struct node_active_region
mremap: simplify the "overlap" check in mremap_to()
mremap: don't do uneccesary checks if new_len == old_len
mremap: don't do mm_populate(new_addr) on failure
mm: move ->mremap() from file_operations to vm_operations_struct
mremap: don't leak new_vma if f_op->mremap() fails
mm/hugetlb.c: make vma_shareable() return bool
mm: make GUP handle pfn mapping unless FOLL_GET is requested
mm: fix status code which move_pages() returns for zero page
mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
genalloc: add support of multiple gen_pools per device
genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
mm/memblock: WARN_ON when nid differs from overlap region
Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
mm: defer flush of writable TLB entries
mm: send one IPI per CPU to TLB flush all entries after unmapping pages
...
In commit f341861fb0 ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.
Later, commit ac3d0da8f3 added yet another loop to reverse a list,
bringing back the latency spike :
I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.
We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.
Fixes: ac3d0da8f3 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>