SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
bunch of debug files. Usually, there aren't that many caches on a
system and this doesn't really matter; however, if memcg is in use, each
cache can have per-cgroup sub-caches. SLUB creates the same directories
for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
Unfortunately, because there can be a lot of cgroups, active or
draining, the product of the numbers of caches, cgroups and files in
each directory can reach a very high number - hundreds of thousands is
commonplace. Millions and beyond aren't difficult to reach either.
What's under /sys/kernel/slab is primarily for debugging and the
information and control on the a root cache already cover its
sub-caches. While having a separate directory for each sub-cache can be
helpful for development, it doesn't make much sense to pay this amount
of overhead by default.
This patch introduces a boot parameter slub_memcg_sysfs which determines
whether to create sysfs directories for per-memcg sub-caches. It also
adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
default value and defaults to 0.
[akpm@linux-foundation.org: kset_unregister(NULL) is legal]
Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull char/misc driver updates from Greg KH:
"Here is the big char/misc driver patchset for 4.11-rc1.
Lots of different driver subsystems updated here: rework for the
hyperv subsystem to handle new platforms better, mei and w1 and extcon
driver updates, as well as a number of other "minor" driver updates.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (169 commits)
goldfish: Sanitize the broken interrupt handler
x86/platform/goldfish: Prevent unconditional loading
vmbus: replace modulus operation with subtraction
vmbus: constify parameters where possible
vmbus: expose hv_begin/end_read
vmbus: remove conditional locking of vmbus_write
vmbus: add direct isr callback mode
vmbus: change to per channel tasklet
vmbus: put related per-cpu variable together
vmbus: callback is in softirq not workqueue
binder: Add support for file-descriptor arrays
binder: Add support for scatter-gather
binder: Add extra size to allocator
binder: Refactor binder_transact()
binder: Support multiple /dev instances
binder: Deal with contexts in debugfs
binder: Support multiple context managers
binder: Split flat_binder_object
auxdisplay: ht16k33: remove private workqueue
auxdisplay: ht16k33: rework input device initialization
...
Pull rodata updates from Kees Cook:
"This renames the (now inaccurate) DEBUG_RODATA and related
SET_MODULE_RONX configs to the more sensible STRICT_KERNEL_RWX and
STRICT_MODULE_RWX"
* tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
Pull exception table module split from Paul Gortmaker:
"Final extable.h related changes.
This completes the separation of exception table content from the
module.h header file. This is achieved with the final commit that
removes the one line back compatible change that sourced extable.h
into the module.h file.
The commits are unchanged since January, with the exception of a
couple Acks that came in for the last two commits a bit later. The
changes have been in linux-next for quite some time[1] and have got
widespread arch coverage via toolchains I have and also from
additional ones the kbuild bot has.
Maintaners of the various arch were Cc'd during the postings to
lkml[2] and informed that the intention was to take the remaining arch
specific changes and lump them together with the final two non-arch
specific changes and submit for this merge window.
The ia64 diffstat stands out and probably warrants a mention. In an
earlier review, Al Viro made a valid comment that the original header
separation of content left something to be desired, and that it get
fixed as a part of this change, hence the larger diffstat"
* tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (21 commits)
module.h: remove extable.h include now users have migrated
core: migrate exception table users off module.h and onto extable.h
cris: migrate exception table users off module.h and onto extable.h
hexagon: migrate exception table users off module.h and onto extable.h
microblaze: migrate exception table users off module.h and onto extable.h
unicore32: migrate exception table users off module.h and onto extable.h
score: migrate exception table users off module.h and onto extable.h
metag: migrate exception table users off module.h and onto extable.h
arc: migrate exception table users off module.h and onto extable.h
nios2: migrate exception table users off module.h and onto extable.h
sparc: migrate exception table users onto extable.h
openrisc: migrate exception table users off module.h and onto extable.h
frv: migrate exception table users off module.h and onto extable.h
sh: migrate exception table users off module.h and onto extable.h
xtensa: migrate exception table users off module.h and onto extable.h
mn10300: migrate exception table users off module.h and onto extable.h
alpha: migrate exception table users off module.h and onto extable.h
arm: migrate exception table users off module.h and onto extable.h
m32r: migrate exception table users off module.h and onto extable.h
ia64: ensure exception table search users include extable.h
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Changes to the EFI init code to establish whether secure boot
authentication was performed at boot time. (Josh Boyer, David
Howells)
- Wire up the UEFI memory attributes table for x86. This eliminates
any runtime memory regions that are both writable and executable,
on recent firmware versions. (Sai Praneeth)
- Move the BGRT init code to an earlier stage so that we can still
use efi_mem_reserve(). (Dave Young)
- Preserve debug symbols in the ARM/arm64 UEFI stub (Ard Biesheuvel)
- Code deduplication work and various other cleanups (Lukas Wunner)
- ... plus various other fixes and cleanups"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: Make file I/O chunking x86-specific
efi: Print the secure boot status in x86 setup_arch()
efi: Disable secure boot if shim is in insecure mode
efi: Get and store the secure boot status
efi: Add SHIM and image security database GUID definitions
arm/efi: Allow invocation of arbitrary runtime services
x86/efi: Allow invocation of arbitrary runtime services
efi/libstub: Preserve .debug sections after absolute relocation check
efi/x86: Add debug code to print cooked memmap
efi/x86: Move the EFI BGRT init code to early init code
efi: Use typed function pointers for the runtime services table
efi/esrt: Fix typo in pr_err() message
x86/efi: Add support for EFI_MEMORY_ATTRIBUTES_TABLE
efi: Introduce the EFI_MEM_ATTR bit and set it from the memory attributes table
efi: Make EFI_MEMORY_ATTRIBUTES_TABLE initialization common across all architectures
x86/efi: Deduplicate efi_char16_printk()
efi: Deduplicate efi_file_size() / _read() / _close()
Pull RCU updates from Ingo Molnar:
"The RCU changes in this cycle are:
- Dynticks updates, consolidating open-coded counter accesses into a
well-defined API
- SRCU updates: Simplify algorithm, add formal verification
- Documentation updates
- Miscellaneous fixes
- Torture-test updates
Most of the diffstat comes from the relatively large documentation
update"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
srcu: Reduce probability of SRCU ->unlock_count[] counter overflow
rcutorture: Add CBMC-based formal verification for SRCU
srcu: Force full grace-period ordering
srcu: Implement more-efficient reader counts
rcu: Adjust FQS offline checks for exact online-CPU detection
rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
rcu: Abstract extended quiescent state determination
rcu: Abstract dynticks extended quiescent state enter/exit operations
rcu: Add lockdep checks to synchronous expedited primitives
rcu: Eliminate unused expedited_normal counter
llist: Clarify comments about when locking is needed
rcu: Fix comment in rcu_organize_nocb_kthreads()
rcu: Enable RCU tracepoints by default to aid in debugging
rcu: Make rcu_cpu_starting() use its "cpu" argument
rcu: Add comment headers to expedited-grace-period counter functions
rcu: Don't wake rcuc/X kthreads on NOCB CPUs
rcu: Re-enable TASKS_RCU for User Mode Linux
rcu: Once again use NMI-based stack traces in stall warnings
rcu: Remove short-term CPU kicking
rcu: Add long-term CPU kicking
...
These files were including module.h for exception table related
functions. We've now separated that content out into its own file
"extable.h" so now move over to that and where possible, avoid all
the extra header content in module.h that we don't really need to
compile these non-modular files.
Note:
init/main.c still needs module.h for __init_or_module
kernel/extable.c still needs module.h for is_module_text_address
...and so we don't get the benefit of removing module.h from the cpp
feed for these two files, unlike the almost universal 1:1 exchange
of module.h for extable.h we were able to do in the arch dirs.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Both of these options are poorly named. The features they provide are
necessary for system security and should not be considered debug only.
Change the names to CONFIG_STRICT_KERNEL_RWX and
CONFIG_STRICT_MODULE_RWX to better describe what these options do.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
This add the kbuild infrastructure that will allow architectures to emit
vmlinux symbol CRCs as 32-bit offsets to another location in the kernel
where the actual value is stored. This works around problems with CRCs
being mistaken for relocatable symbols on kernels that self relocate at
runtime (i.e., powerpc with CONFIG_RELOCATABLE=y)
For the kbuild side of things, this comes down to the following:
- introducing a Kconfig symbol MODULE_REL_CRCS
- adding a -R switch to genksyms to instruct it to emit the CRC symbols
as references into the .rodata section
- making modpost distinguish such references from absolute CRC symbols
by the section index (SHN_ABS)
- making kallsyms disregard non-absolute symbols with a __crc_ prefix
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that User Mode Linux supports arch_irqs_disabled_flags(), this
commit re-enables TASKS_RCU for User Mode Linux.
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
PC/104 form factor devices serve a specific niche of embedded system
users; most Linux users will not have PC/104 form factor devices. This
patch introduces the PC104 Kconfig option, which should be used to
filter PC/104 specific device drivers and options, so that only those
users interested in PC/104 related options are exposed to them.
Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
RCU_EXPEDITE_BOOT should speed up the boot process by enforcing
synchronize_rcu_expedited() instead of synchronize_rcu() during the boot
process. There should be no reason why one does not want this and there
is no need worry about real time latency at this point.
Therefore make it default.
Note that users wishing to avoid expediting entirely, for example when
bringing up new hardware possibly having flaky IPIs, can use the
rcu_normal boot parameter to override boot-time expediting.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Reworded commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.
Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
not right when CONFIG_NET is disabled and there is no socket interface:
warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)
I don't know what the correct solution for this is, but simply removing
the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
'if NET' section avoids the warning and does not produce other build
errors.
Fixes: 483c4933ea ("cgroup: Fix CGROUP_BPF config")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.
This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.
The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).
This improves the performance of page lock intensive microbenchmarks by
2-3%.
Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes and cleanups from David Miller:
1) Revert bogus nla_ok() change, from Alexey Dobriyan.
2) Various bpf validator fixes from Daniel Borkmann.
3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
drivers, from Dongpo Li.
4) Several ethtool ksettings conversions from Philippe Reynes.
5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.
6) XDP support for virtio_net, from John Fastabend.
7) Fix NAT handling within a vrf, from David Ahern.
8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
net: mv643xx_eth: fix build failure
isdn: Constify some function parameters
mlxsw: spectrum: Mark split ports as such
cgroup: Fix CGROUP_BPF config
qed: fix old-style function definition
net: ipv6: check route protocol when deleting routes
r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
irda: w83977af_ir: cleanup an indent issue
net: sfc: use new api ethtool_{get|set}_link_ksettings
net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
bpf: fix mark_reg_unknown_value for spilled regs on map value marking
bpf: fix overflow in prog accounting
bpf: dynamically allocate digest scratch buffer
gtp: Fix initialization of Flags octet in GTPv1 header
gtp: gtp_check_src_ms_ipv4() always return success
net/x25: use designated initializers
isdn: use designated initializers
...
CGROUP_BPF depended on SOCK_CGROUP_DATA which can't be manually
enabled, making it rather challenging to turn CGROUP_BPF on.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>