2019-05-19 13:08:55 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
|
* sysctl.c: General linux system control interface
|
|
|
|
|
*
|
|
|
|
|
* Begun 24 March 1995, Stephen Tweedie
|
|
|
|
|
* Added /proc support, Dec 1995
|
|
|
|
|
* Added bdflush entry and intvec min/max checking, 2/23/96, Tom Dyas.
|
|
|
|
|
* Added hooks for /proc/sys/net (minor, minor patch), 96/4/1, Mike Shaver.
|
|
|
|
|
* Added kernel/java-{interpreter,appletviewer}, 96/5/10, Mike Shaver.
|
|
|
|
|
* Dynamic registration fixes, Stephen Tweedie.
|
|
|
|
|
* Added kswapd-interval, ctrl-alt-del, printk stuff, 1/8/97, Chris Horn.
|
|
|
|
|
* Made sysctl support optional via CONFIG_SYSCTL, 1/10/97, Chris
|
|
|
|
|
* Horn.
|
|
|
|
|
* Added proc_doulongvec_ms_jiffies_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
|
* Added proc_doulongvec_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
|
* Changed linked lists to use list.h instead of lists.h, 02/24/00, Bill
|
|
|
|
|
* Wendling.
|
|
|
|
|
* The list_for_each() macro wasn't appropriate for the sysctl loop.
|
|
|
|
|
* Removed it and replaced it with older style, 03/23/00, Bill Wendling
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
|
#include <linux/swap.h>
|
|
|
|
|
#include <linux/slab.h>
|
|
|
|
|
#include <linux/sysctl.h>
|
2012-03-28 14:42:50 -07:00
|
|
|
#include <linux/bitmap.h>
|
2010-03-10 15:23:59 -08:00
|
|
|
#include <linux/signal.h>
|
2021-06-30 18:54:59 -07:00
|
|
|
#include <linux/panic.h>
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12 16:59:41 -08:00
|
|
|
#include <linux/printk.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/proc_fs.h>
|
V3 file capabilities: alter behavior of cap_setpcap
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 03:05:59 -07:00
|
|
|
#include <linux/security.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/ctype.h>
|
2012-07-30 14:42:48 -07:00
|
|
|
#include <linux/kmemleak.h>
|
2021-12-28 16:49:13 -08:00
|
|
|
#include <linux/filter.h>
|
2007-07-17 04:03:45 -07:00
|
|
|
#include <linux/fs.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/init.h>
|
|
|
|
|
#include <linux/kernel.h>
|
2005-11-11 05:33:52 +01:00
|
|
|
#include <linux/kobject.h>
|
2005-08-16 02:18:02 -03:00
|
|
|
#include <linux/net.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/sysrq.h>
|
|
|
|
|
#include <linux/highuid.h>
|
|
|
|
|
#include <linux/writeback.h>
|
2009-09-22 16:18:09 +02:00
|
|
|
#include <linux/ratelimit.h>
|
2010-05-24 14:32:28 -07:00
|
|
|
#include <linux/compaction.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
|
#include <linux/initrd.h>
|
2008-04-29 01:01:32 -07:00
|
|
|
#include <linux/key.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/times.h>
|
|
|
|
|
#include <linux/limits.h>
|
|
|
|
|
#include <linux/dcache.h>
|
|
|
|
|
#include <linux/syscalls.h>
|
2008-07-23 21:27:03 -07:00
|
|
|
#include <linux/vmstat.h>
|
2006-02-20 18:27:58 -08:00
|
|
|
#include <linux/nfs_fs.h>
|
|
|
|
|
#include <linux/acpi.h>
|
2007-07-17 18:37:02 -07:00
|
|
|
#include <linux/reboot.h>
|
2008-05-12 21:20:43 +02:00
|
|
|
#include <linux/ftrace.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
#include <linux/perf_event.h>
|
2010-08-09 17:18:56 -07:00
|
|
|
#include <linux/oom.h>
|
2011-04-01 17:07:50 -04:00
|
|
|
#include <linux/kmod.h>
|
2011-10-31 17:11:20 -07:00
|
|
|
#include <linux/capability.h>
|
2012-02-13 03:58:52 +00:00
|
|
|
#include <linux/binfmts.h>
|
2013-02-07 09:46:59 -06:00
|
|
|
#include <linux/sched/sysctl.h>
|
2016-09-28 00:27:17 -05:00
|
|
|
#include <linux/mount.h>
|
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13 17:16:41 -07:00
|
|
|
#include <linux/userfaultfd_k.h>
|
2020-04-24 08:43:36 +02:00
|
|
|
#include <linux/pid.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2019-03-07 16:29:40 -08:00
|
|
|
#include "../lib/kstrtox.h"
|
|
|
|
|
|
2016-12-24 11:46:01 -08:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <asm/processor.h>
|
|
|
|
|
|
2006-09-30 01:47:55 +02:00
|
|
|
#ifdef CONFIG_X86
|
|
|
|
|
#include <asm/nmi.h>
|
2006-12-07 02:14:11 +01:00
|
|
|
#include <asm/stacktrace.h>
|
2008-01-30 13:30:05 +01:00
|
|
|
#include <asm/io.h>
|
2006-09-30 01:47:55 +02:00
|
|
|
#endif
|
2012-03-28 18:30:03 +01:00
|
|
|
#ifdef CONFIG_SPARC
|
|
|
|
|
#include <asm/setup.h>
|
|
|
|
|
#endif
|
2010-03-10 15:24:09 -08:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
|
#include <linux/rtmutex.h>
|
|
|
|
|
#endif
|
2010-02-12 17:19:19 -05:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
#if defined(CONFIG_SYSCTL)
|
|
|
|
|
|
2007-10-16 23:26:09 -07:00
|
|
|
/* Constants used for minimum and maximum */
|
|
|
|
|
|
2016-04-21 12:28:50 -03:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
2022-01-21 22:11:14 -08:00
|
|
|
static const int six_hundred_forty_kb = 640 * 1024;
|
2016-04-21 12:28:50 -03:00
|
|
|
#endif
|
2007-10-16 23:26:09 -07:00
|
|
|
|
2009-04-30 15:08:57 -07:00
|
|
|
|
2022-01-21 22:11:09 -08:00
|
|
|
static const int ngroups_max = NGROUPS_MAX;
|
2011-10-31 17:11:20 -07:00
|
|
|
static const int cap_last_cap = CAP_LAST_CAP;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-10-19 23:28:34 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2017-07-12 14:33:30 -07:00
|
|
|
/**
|
|
|
|
|
* enum sysctl_writes_mode - supported sysctl write modes
|
|
|
|
|
*
|
|
|
|
|
* @SYSCTL_WRITES_LEGACY: each write syscall must fully contain the sysctl value
|
2019-07-16 16:26:54 -07:00
|
|
|
* to be written, and multiple writes on the same sysctl file descriptor
|
|
|
|
|
* will rewrite the sysctl value, regardless of file position. No warning
|
|
|
|
|
* is issued when the initial position is not 0.
|
2017-07-12 14:33:30 -07:00
|
|
|
* @SYSCTL_WRITES_WARN: same as above but warn when the initial file position is
|
2019-07-16 16:26:54 -07:00
|
|
|
* not 0.
|
2017-07-12 14:33:30 -07:00
|
|
|
* @SYSCTL_WRITES_STRICT: writes to numeric sysctl entries must always be at
|
2019-07-16 16:26:54 -07:00
|
|
|
* file position 0 and the value must be fully contained in the buffer
|
|
|
|
|
* sent to the write syscall. If dealing with strings respect the file
|
|
|
|
|
* position, but restrict this to the max length of the buffer, anything
|
|
|
|
|
* passed the max length will be ignored. Multiple writes will append
|
|
|
|
|
* to the buffer.
|
2017-07-12 14:33:30 -07:00
|
|
|
*
|
|
|
|
|
* These write modes control how current file position affects the behavior of
|
|
|
|
|
* updating sysctl values through the proc interface on each write.
|
|
|
|
|
*/
|
|
|
|
|
enum sysctl_writes_mode {
|
|
|
|
|
SYSCTL_WRITES_LEGACY = -1,
|
|
|
|
|
SYSCTL_WRITES_WARN = 0,
|
|
|
|
|
SYSCTL_WRITES_STRICT = 1,
|
|
|
|
|
};
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2017-07-12 14:33:30 -07:00
|
|
|
static enum sysctl_writes_mode sysctl_writes_strict = SYSCTL_WRITES_STRICT;
|
2020-04-24 08:43:37 +02:00
|
|
|
#endif /* CONFIG_PROC_SYSCTL */
|
2018-03-10 06:14:51 -08:00
|
|
|
|
2019-09-23 15:38:47 -07:00
|
|
|
#if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \
|
|
|
|
|
defined(CONFIG_ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT)
|
2005-04-16 15:20:36 -07:00
|
|
|
int sysctl_legacy_va_layout;
|
|
|
|
|
#endif
|
|
|
|
|
|
2010-05-24 14:32:31 -07:00
|
|
|
#ifdef CONFIG_COMPACTION
|
2022-01-21 22:11:19 -08:00
|
|
|
/* min_extfrag_threshold is SYSCTL_ZERO */;
|
2022-01-21 22:11:14 -08:00
|
|
|
static const int max_extfrag_threshold = 1000;
|
2010-05-24 14:32:31 -07:00
|
|
|
#endif
|
|
|
|
|
|
2006-09-27 01:51:04 -07:00
|
|
|
#endif /* CONFIG_SYSCTL */
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
|
* /proc/sys support
|
|
|
|
|
*/
|
|
|
|
|
|
2006-09-27 01:51:04 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2014-06-06 14:37:17 -07:00
|
|
|
static int _proc_do_string(char *data, int maxlen, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
char *buffer, size_t *lenp, loff_t *ppos)
|
2006-10-02 02:18:04 -07:00
|
|
|
{
|
|
|
|
|
size_t len;
|
2020-04-24 08:43:38 +02:00
|
|
|
char c, *p;
|
2007-02-10 01:46:38 -08:00
|
|
|
|
|
|
|
|
if (!data || !maxlen || !*lenp) {
|
2006-10-02 02:18:04 -07:00
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
2007-02-10 01:46:38 -08:00
|
|
|
|
2006-10-02 02:18:04 -07:00
|
|
|
if (write) {
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
if (sysctl_writes_strict == SYSCTL_WRITES_STRICT) {
|
|
|
|
|
/* Only continue writes not past the end of buffer. */
|
|
|
|
|
len = strlen(data);
|
|
|
|
|
if (len > maxlen - 1)
|
|
|
|
|
len = maxlen - 1;
|
|
|
|
|
|
|
|
|
|
if (*ppos > len)
|
|
|
|
|
return 0;
|
|
|
|
|
len = *ppos;
|
|
|
|
|
} else {
|
|
|
|
|
/* Start writing from beginning of buffer. */
|
|
|
|
|
len = 0;
|
|
|
|
|
}
|
|
|
|
|
|
2014-06-06 14:37:18 -07:00
|
|
|
*ppos += *lenp;
|
2006-10-02 02:18:04 -07:00
|
|
|
p = buffer;
|
2014-06-06 14:37:18 -07:00
|
|
|
while ((p - buffer) < *lenp && len < maxlen - 1) {
|
2020-04-24 08:43:38 +02:00
|
|
|
c = *(p++);
|
2006-10-02 02:18:04 -07:00
|
|
|
if (c == 0 || c == '\n')
|
|
|
|
|
break;
|
2014-06-06 14:37:18 -07:00
|
|
|
data[len++] = c;
|
2006-10-02 02:18:04 -07:00
|
|
|
}
|
2014-06-06 14:37:17 -07:00
|
|
|
data[len] = 0;
|
2006-10-02 02:18:04 -07:00
|
|
|
} else {
|
|
|
|
|
len = strlen(data);
|
|
|
|
|
if (len > maxlen)
|
|
|
|
|
len = maxlen;
|
2007-02-10 01:46:38 -08:00
|
|
|
|
|
|
|
|
if (*ppos > len) {
|
|
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
data += *ppos;
|
|
|
|
|
len -= *ppos;
|
|
|
|
|
|
2006-10-02 02:18:04 -07:00
|
|
|
if (len > *lenp)
|
|
|
|
|
len = *lenp;
|
|
|
|
|
if (len)
|
2020-04-24 08:43:38 +02:00
|
|
|
memcpy(buffer, data, len);
|
2006-10-02 02:18:04 -07:00
|
|
|
if (len < *lenp) {
|
2020-04-24 08:43:38 +02:00
|
|
|
buffer[len] = '\n';
|
2006-10-02 02:18:04 -07:00
|
|
|
len++;
|
|
|
|
|
}
|
|
|
|
|
*lenp = len;
|
|
|
|
|
*ppos += len;
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
static void warn_sysctl_write(struct ctl_table *table)
|
|
|
|
|
{
|
|
|
|
|
pr_warn_once("%s wrote to %s when file position was not 0!\n"
|
|
|
|
|
"This will not be supported in the future. To silence this\n"
|
|
|
|
|
"warning, set kernel.sysctl_writes_strict = -1\n",
|
|
|
|
|
current->comm, table->procname);
|
|
|
|
|
}
|
|
|
|
|
|
2017-07-12 14:33:33 -07:00
|
|
|
/**
|
2018-08-21 22:01:06 -07:00
|
|
|
* proc_first_pos_non_zero_ignore - check if first position is allowed
|
2017-07-12 14:33:33 -07:00
|
|
|
* @ppos: file position
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
*
|
|
|
|
|
* Returns true if the first position is non-zero and the sysctl_writes_strict
|
|
|
|
|
* mode indicates this is not allowed for numeric input types. String proc
|
2018-08-21 22:01:06 -07:00
|
|
|
* handlers can ignore the return value.
|
2017-07-12 14:33:33 -07:00
|
|
|
*/
|
|
|
|
|
static bool proc_first_pos_non_zero_ignore(loff_t *ppos,
|
|
|
|
|
struct ctl_table *table)
|
|
|
|
|
{
|
|
|
|
|
if (!*ppos)
|
|
|
|
|
return false;
|
|
|
|
|
|
|
|
|
|
switch (sysctl_writes_strict) {
|
|
|
|
|
case SYSCTL_WRITES_STRICT:
|
|
|
|
|
return true;
|
|
|
|
|
case SYSCTL_WRITES_WARN:
|
|
|
|
|
warn_sysctl_write(table);
|
|
|
|
|
return false;
|
|
|
|
|
default:
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
|
* proc_dostring - read a string sysctl
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes a string from/to the user buffer. If the kernel
|
|
|
|
|
* buffer provided is not large enough to hold the string, the
|
|
|
|
|
* string is truncated. The copied string is %NULL-terminated.
|
|
|
|
|
* If the string is being read by the user process, it is copied
|
|
|
|
|
* and a newline '\n' is added. It is truncated if the buffer is
|
|
|
|
|
* not large enough.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2017-07-12 14:33:33 -07:00
|
|
|
if (write)
|
|
|
|
|
proc_first_pos_non_zero_ignore(ppos, table);
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
return _proc_do_string(table->data, table->maxlen, write, buffer, lenp,
|
|
|
|
|
ppos);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static size_t proc_skip_spaces(char **buf)
|
|
|
|
|
{
|
|
|
|
|
size_t ret;
|
|
|
|
|
char *tmp = skip_spaces(*buf);
|
|
|
|
|
ret = tmp - *buf;
|
|
|
|
|
*buf = tmp;
|
|
|
|
|
return ret;
|
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
static void proc_skip_char(char **buf, size_t *size, const char v)
|
|
|
|
|
{
|
|
|
|
|
while (*size) {
|
|
|
|
|
if (**buf != v)
|
|
|
|
|
break;
|
|
|
|
|
(*size)--;
|
|
|
|
|
(*buf)++;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2019-03-07 16:29:40 -08:00
|
|
|
/**
|
|
|
|
|
* strtoul_lenient - parse an ASCII formatted integer from a buffer and only
|
|
|
|
|
* fail on overflow
|
|
|
|
|
*
|
|
|
|
|
* @cp: kernel buffer containing the string to parse
|
|
|
|
|
* @endp: pointer to store the trailing characters
|
|
|
|
|
* @base: the base to use
|
|
|
|
|
* @res: where the parsed integer will be stored
|
|
|
|
|
*
|
|
|
|
|
* In case of success 0 is returned and @res will contain the parsed integer,
|
|
|
|
|
* @endp will hold any trailing characters.
|
|
|
|
|
* This function will fail the parse on overflow. If there wasn't an overflow
|
|
|
|
|
* the function will defer the decision what characters count as invalid to the
|
|
|
|
|
* caller.
|
|
|
|
|
*/
|
|
|
|
|
static int strtoul_lenient(const char *cp, char **endp, unsigned int base,
|
|
|
|
|
unsigned long *res)
|
|
|
|
|
{
|
|
|
|
|
unsigned long long result;
|
|
|
|
|
unsigned int rv;
|
|
|
|
|
|
|
|
|
|
cp = _parse_integer_fixup_radix(cp, &base);
|
|
|
|
|
rv = _parse_integer(cp, base, &result);
|
|
|
|
|
if ((rv & KSTRTOX_OVERFLOW) || (result != (unsigned long)result))
|
|
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
|
|
cp += rv;
|
|
|
|
|
|
|
|
|
|
if (endp)
|
|
|
|
|
*endp = (char *)cp;
|
|
|
|
|
|
|
|
|
|
*res = (unsigned long)result;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
#define TMPBUFLEN 22
|
|
|
|
|
/**
|
2010-05-21 11:29:53 -07:00
|
|
|
* proc_get_long - reads an ASCII formatted integer from a user buffer
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* @buf: a kernel buffer
|
|
|
|
|
* @size: size of the kernel buffer
|
|
|
|
|
* @val: this is where the number will be stored
|
|
|
|
|
* @neg: set to %TRUE if number is negative
|
|
|
|
|
* @perm_tr: a vector which contains the allowed trailers
|
|
|
|
|
* @perm_tr_len: size of the perm_tr vector
|
|
|
|
|
* @tr: pointer to store the trailer character
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* In case of success %0 is returned and @buf and @size are updated with
|
|
|
|
|
* the amount of bytes read. If @tr is non-NULL and a trailing
|
|
|
|
|
* character exists (size is non-zero after returning from this
|
|
|
|
|
* function), @tr is updated with the trailing character.
|
2010-05-05 00:26:45 +00:00
|
|
|
*/
|
|
|
|
|
static int proc_get_long(char **buf, size_t *size,
|
|
|
|
|
unsigned long *val, bool *neg,
|
|
|
|
|
const char *perm_tr, unsigned perm_tr_len, char *tr)
|
|
|
|
|
{
|
|
|
|
|
int len;
|
|
|
|
|
char *p, tmp[TMPBUFLEN];
|
|
|
|
|
|
|
|
|
|
if (!*size)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
|
|
len = *size;
|
|
|
|
|
if (len > TMPBUFLEN - 1)
|
|
|
|
|
len = TMPBUFLEN - 1;
|
|
|
|
|
|
|
|
|
|
memcpy(tmp, *buf, len);
|
|
|
|
|
|
|
|
|
|
tmp[len] = 0;
|
|
|
|
|
p = tmp;
|
|
|
|
|
if (*p == '-' && *size > 1) {
|
|
|
|
|
*neg = true;
|
|
|
|
|
p++;
|
|
|
|
|
} else
|
|
|
|
|
*neg = false;
|
|
|
|
|
if (!isdigit(*p))
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
2019-03-07 16:29:40 -08:00
|
|
|
if (strtoul_lenient(p, &p, 0, val))
|
|
|
|
|
return -EINVAL;
|
2010-05-05 00:26:45 +00:00
|
|
|
|
|
|
|
|
len = p - tmp;
|
|
|
|
|
|
|
|
|
|
/* We don't know if the next char is whitespace thus we may accept
|
|
|
|
|
* invalid integers (e.g. 1234...a) or two integers instead of one
|
|
|
|
|
* (e.g. 123...1). So lets not allow such large numbers. */
|
|
|
|
|
if (len == TMPBUFLEN - 1)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
|
|
if (len < *size && perm_tr_len && !memchr(perm_tr, *p, perm_tr_len))
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
|
|
if (tr && (len < *size))
|
|
|
|
|
*tr = *p;
|
|
|
|
|
|
|
|
|
|
*buf += len;
|
|
|
|
|
*size -= len;
|
|
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
2010-05-21 11:29:53 -07:00
|
|
|
* proc_put_long - converts an integer to a decimal ASCII formatted string
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* @buf: the user buffer
|
|
|
|
|
* @size: the size of the user buffer
|
|
|
|
|
* @val: the integer to be converted
|
|
|
|
|
* @neg: sign of the number, %TRUE for negative
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2020-04-24 08:43:38 +02:00
|
|
|
* In case of success @buf and @size are updated with the amount of bytes
|
|
|
|
|
* written.
|
2010-05-05 00:26:45 +00:00
|
|
|
*/
|
2020-04-24 08:43:38 +02:00
|
|
|
static void proc_put_long(void **buf, size_t *size, unsigned long val, bool neg)
|
2010-05-05 00:26:45 +00:00
|
|
|
{
|
|
|
|
|
int len;
|
|
|
|
|
char tmp[TMPBUFLEN], *p = tmp;
|
|
|
|
|
|
|
|
|
|
sprintf(p, "%s%lu", neg ? "-" : "", val);
|
|
|
|
|
len = strlen(tmp);
|
|
|
|
|
if (len > *size)
|
|
|
|
|
len = *size;
|
2020-04-24 08:43:38 +02:00
|
|
|
memcpy(*buf, tmp, len);
|
2010-05-05 00:26:45 +00:00
|
|
|
*size -= len;
|
|
|
|
|
*buf += len;
|
|
|
|
|
}
|
|
|
|
|
#undef TMPBUFLEN
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
static void proc_put_char(void **buf, size_t *size, char c)
|
2010-05-05 00:26:45 +00:00
|
|
|
{
|
|
|
|
|
if (*size) {
|
2020-04-24 08:43:38 +02:00
|
|
|
char **buffer = (char **)buf;
|
|
|
|
|
**buffer = c;
|
|
|
|
|
|
|
|
|
|
(*size)--;
|
|
|
|
|
(*buffer)++;
|
2010-05-05 00:26:45 +00:00
|
|
|
*buf = *buffer;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2021-08-03 12:59:36 +02:00
|
|
|
static int do_proc_dobool_conv(bool *negp, unsigned long *lvalp,
|
|
|
|
|
int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
|
|
|
|
if (write) {
|
|
|
|
|
*(bool *)valp = *lvalp;
|
|
|
|
|
} else {
|
|
|
|
|
int val = *(bool *)valp;
|
|
|
|
|
|
|
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
|
*negp = false;
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
|
|
|
|
if (write) {
|
2015-04-16 12:48:07 -07:00
|
|
|
if (*negp) {
|
|
|
|
|
if (*lvalp > (unsigned long) INT_MAX + 1)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
*valp = -*lvalp;
|
|
|
|
|
} else {
|
|
|
|
|
if (*lvalp > (unsigned long) INT_MAX)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
*valp = *lvalp;
|
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
|
|
|
|
int val = *valp;
|
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
*lvalp = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2017-07-12 14:33:36 -07:00
|
|
|
static int do_proc_douintvec_conv(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data)
|
2016-08-25 15:16:51 -07:00
|
|
|
{
|
|
|
|
|
if (write) {
|
2017-04-07 23:51:07 +08:00
|
|
|
if (*lvalp > UINT_MAX)
|
|
|
|
|
return -EINVAL;
|
2016-08-25 15:16:51 -07:00
|
|
|
*valp = *lvalp;
|
|
|
|
|
} else {
|
|
|
|
|
unsigned int val = *valp;
|
|
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
|
2020-04-24 08:43:38 +02:00
|
|
|
int write, void *buffer,
|
2006-10-02 02:18:23 -07:00
|
|
|
size_t *lenp, loff_t *ppos,
|
2010-05-05 00:26:45 +00:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
|
|
|
|
{
|
2010-05-05 00:26:45 +00:00
|
|
|
int *i, vleft, first = 1, err = 0;
|
|
|
|
|
size_t left;
|
2020-04-24 08:43:38 +02:00
|
|
|
char *p;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2006-10-02 02:18:23 -07:00
|
|
|
i = (int *) tbl_data;
|
2005-04-16 15:20:36 -07:00
|
|
|
vleft = table->maxlen / sizeof(*i);
|
|
|
|
|
left = *lenp;
|
|
|
|
|
|
|
|
|
|
if (!conv)
|
|
|
|
|
conv = do_proc_dointvec_conv;
|
|
|
|
|
|
|
|
|
|
if (write) {
|
2017-07-12 14:33:33 -07:00
|
|
|
if (proc_first_pos_non_zero_ignore(ppos, table))
|
|
|
|
|
goto out;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
|
left = PAGE_SIZE - 1;
|
2020-04-24 08:43:38 +02:00
|
|
|
p = buffer;
|
2010-05-05 00:26:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
for (; left && vleft--; i++, first=0) {
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
bool neg;
|
|
|
|
|
|
|
|
|
|
if (write) {
|
2015-12-24 00:13:10 -05:00
|
|
|
left -= proc_skip_spaces(&p);
|
2010-05-05 00:26:45 +00:00
|
|
|
|
2010-05-25 16:10:14 -07:00
|
|
|
if (!left)
|
|
|
|
|
break;
|
2015-12-24 00:13:10 -05:00
|
|
|
err = proc_get_long(&p, &left, &lval, &neg,
|
2010-05-05 00:26:45 +00:00
|
|
|
proc_wspace_sep,
|
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
|
if (err)
|
|
|
|
|
break;
|
|
|
|
|
if (conv(&neg, &lval, i, 1, data)) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
if (conv(&neg, &lval, i, 0, data)) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
if (!first)
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '\t');
|
|
|
|
|
proc_put_long(&buffer, &left, lval, neg);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
}
|
2010-05-05 00:26:45 +00:00
|
|
|
|
|
|
|
|
if (!write && !first && left && !err)
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '\n');
|
2010-05-25 16:10:14 -07:00
|
|
|
if (write && !err && left)
|
2015-12-24 00:13:10 -05:00
|
|
|
left -= proc_skip_spaces(&p);
|
2020-04-24 08:43:38 +02:00
|
|
|
if (write && first)
|
|
|
|
|
return err ? : -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp -= left;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
out:
|
2005-04-16 15:20:36 -07:00
|
|
|
*ppos += *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
static int do_proc_dointvec(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos,
|
2010-05-05 00:26:45 +00:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2006-10-02 02:18:23 -07:00
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return __do_proc_dointvec(table->data, table, write,
|
2006-10-02 02:18:23 -07:00
|
|
|
buffer, lenp, ppos, conv, data);
|
|
|
|
|
}
|
|
|
|
|
|
2017-07-12 14:33:36 -07:00
|
|
|
static int do_proc_douintvec_w(unsigned int *tbl_data,
|
|
|
|
|
struct ctl_table *table,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer,
|
2017-07-12 14:33:36 -07:00
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
|
int (*conv)(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
|
|
|
|
{
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
int err = 0;
|
|
|
|
|
size_t left;
|
|
|
|
|
bool neg;
|
2020-04-24 08:43:38 +02:00
|
|
|
char *p = buffer;
|
2017-07-12 14:33:36 -07:00
|
|
|
|
|
|
|
|
left = *lenp;
|
|
|
|
|
|
|
|
|
|
if (proc_first_pos_non_zero_ignore(ppos, table))
|
|
|
|
|
goto bail_early;
|
|
|
|
|
|
|
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
|
|
|
|
|
|
left -= proc_skip_spaces(&p);
|
|
|
|
|
if (!left) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
goto out_free;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
err = proc_get_long(&p, &left, &lval, &neg,
|
|
|
|
|
proc_wspace_sep,
|
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
|
if (err || neg) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
goto out_free;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (conv(&lval, tbl_data, 1, data)) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
goto out_free;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (!err && left)
|
|
|
|
|
left -= proc_skip_spaces(&p);
|
|
|
|
|
|
|
|
|
|
out_free:
|
|
|
|
|
if (err)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
|
|
/* This is in keeping with old __do_proc_dointvec() */
|
|
|
|
|
bail_early:
|
|
|
|
|
*ppos += *lenp;
|
|
|
|
|
return err;
|
|
|
|
|
}
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
static int do_proc_douintvec_r(unsigned int *tbl_data, void *buffer,
|
2017-07-12 14:33:36 -07:00
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
|
int (*conv)(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
|
|
|
|
{
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
int err = 0;
|
|
|
|
|
size_t left;
|
|
|
|
|
|
|
|
|
|
left = *lenp;
|
|
|
|
|
|
|
|
|
|
if (conv(&lval, tbl_data, 0, data)) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_long(&buffer, &left, lval, false);
|
|
|
|
|
if (!left)
|
2017-07-12 14:33:36 -07:00
|
|
|
goto out;
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '\n');
|
2017-07-12 14:33:36 -07:00
|
|
|
|
|
|
|
|
out:
|
|
|
|
|
*lenp -= left;
|
|
|
|
|
*ppos += *lenp;
|
|
|
|
|
|
|
|
|
|
return err;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static int __do_proc_douintvec(void *tbl_data, struct ctl_table *table,
|
2020-04-24 08:43:38 +02:00
|
|
|
int write, void *buffer,
|
2017-07-12 14:33:36 -07:00
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
|
int (*conv)(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
|
|
|
|
{
|
|
|
|
|
unsigned int *i, vleft;
|
|
|
|
|
|
|
|
|
|
if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
|
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
i = (unsigned int *) tbl_data;
|
|
|
|
|
vleft = table->maxlen / sizeof(*i);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Arrays are not supported, keep this simple. *Do not* add
|
|
|
|
|
* support for them.
|
|
|
|
|
*/
|
|
|
|
|
if (vleft != 1) {
|
|
|
|
|
*lenp = 0;
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (!conv)
|
|
|
|
|
conv = do_proc_douintvec_conv;
|
|
|
|
|
|
|
|
|
|
if (write)
|
|
|
|
|
return do_proc_douintvec_w(i, table, buffer, lenp, ppos,
|
|
|
|
|
conv, data);
|
|
|
|
|
return do_proc_douintvec_r(i, buffer, lenp, ppos, conv, data);
|
|
|
|
|
}
|
|
|
|
|
|
2022-01-21 22:13:20 -08:00
|
|
|
int do_proc_douintvec(struct ctl_table *table, int write,
|
|
|
|
|
void *buffer, size_t *lenp, loff_t *ppos,
|
|
|
|
|
int (*conv)(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data),
|
|
|
|
|
void *data)
|
2017-07-12 14:33:36 -07:00
|
|
|
{
|
|
|
|
|
return __do_proc_douintvec(table->data, table, write,
|
|
|
|
|
buffer, lenp, ppos, conv, data);
|
|
|
|
|
}
|
|
|
|
|
|
2021-08-03 12:59:36 +02:00
|
|
|
/**
|
|
|
|
|
* proc_dobool - read/write a bool
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
|
|
|
|
int proc_dobool(struct ctl_table *table, int write, void *buffer,
|
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
|
{
|
|
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
|
|
|
|
do_proc_dobool_conv, NULL);
|
|
|
|
|
}
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
|
* proc_dointvec - read a vector of integers
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2020-04-24 08:43:38 +02:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write, void *buffer,
|
|
|
|
|
size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2016-08-25 15:16:51 -07:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL);
|
|
|
|
|
}
|
|
|
|
|
|
2020-04-01 21:10:42 -07:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
|
static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table,
|
2020-04-24 08:43:38 +02:00
|
|
|
int write, void *buffer, size_t *lenp, loff_t *ppos)
|
2020-04-01 21:10:42 -07:00
|
|
|
{
|
|
|
|
|
int ret, old;
|
|
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_PREEMPT_RT) || !write)
|
|
|
|
|
return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
|
|
|
|
|
|
|
|
|
old = *(int *)table->data;
|
|
|
|
|
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
|
|
|
|
if (ret)
|
|
|
|
|
return ret;
|
|
|
|
|
if (old != *(int *)table->data)
|
|
|
|
|
pr_warn_once("sysctl attribute %s changed by %s[%d]\n",
|
|
|
|
|
table->procname, current->comm,
|
|
|
|
|
task_pid_nr(current));
|
|
|
|
|
return ret;
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
2016-08-25 15:16:51 -07:00
|
|
|
/**
|
|
|
|
|
* proc_douintvec - read a vector of unsigned integers
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) unsigned integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2020-04-24 08:43:38 +02:00
|
|
|
int proc_douintvec(struct ctl_table *table, int write, void *buffer,
|
|
|
|
|
size_t *lenp, loff_t *ppos)
|
2016-08-25 15:16:51 -07:00
|
|
|
{
|
2017-07-12 14:33:36 -07:00
|
|
|
return do_proc_douintvec(table, write, buffer, lenp, ppos,
|
|
|
|
|
do_proc_douintvec_conv, NULL);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
2007-02-10 01:45:24 -08:00
|
|
|
/*
|
2008-10-15 22:01:41 -07:00
|
|
|
* Taint values can only be increased
|
|
|
|
|
* This means we can safely use a temporary.
|
2007-02-10 01:45:24 -08:00
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
static int proc_taint(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2007-02-10 01:45:24 -08:00
|
|
|
{
|
2008-10-15 22:01:41 -07:00
|
|
|
struct ctl_table t;
|
|
|
|
|
unsigned long tmptaint = get_taint();
|
|
|
|
|
int err;
|
2007-02-10 01:45:24 -08:00
|
|
|
|
2007-04-23 14:41:14 -07:00
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
2007-02-10 01:45:24 -08:00
|
|
|
return -EPERM;
|
|
|
|
|
|
2008-10-15 22:01:41 -07:00
|
|
|
t = *table;
|
|
|
|
|
t.data = &tmptaint;
|
2009-09-23 15:57:19 -07:00
|
|
|
err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos);
|
2008-10-15 22:01:41 -07:00
|
|
|
if (err < 0)
|
|
|
|
|
return err;
|
|
|
|
|
|
|
|
|
|
if (write) {
|
2020-06-07 21:40:17 -07:00
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If we are relying on panic_on_taint not producing
|
|
|
|
|
* false positives due to userspace input, bail out
|
|
|
|
|
* before setting the requested taint flags.
|
|
|
|
|
*/
|
|
|
|
|
if (panic_on_taint_nousertaint && (tmptaint & panic_on_taint))
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
2008-10-15 22:01:41 -07:00
|
|
|
/*
|
|
|
|
|
* Poor man's atomic or. Not worth adding a primitive
|
|
|
|
|
* to everyone's atomic.h for this
|
|
|
|
|
*/
|
2020-06-07 21:40:51 -07:00
|
|
|
for (i = 0; i < TAINT_FLAGS_COUNT; i++)
|
|
|
|
|
if ((1UL << i) & tmptaint)
|
2013-01-21 17:17:39 +10:30
|
|
|
add_taint(i, LOCKDEP_STILL_OK);
|
2008-10-15 22:01:41 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return err;
|
2007-02-10 01:45:24 -08:00
|
|
|
}
|
|
|
|
|
|
2018-04-10 16:35:38 -07:00
|
|
|
/**
|
|
|
|
|
* struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
|
|
|
|
|
* @min: pointer to minimum allowable value
|
|
|
|
|
* @max: pointer to maximum allowable value
|
|
|
|
|
*
|
|
|
|
|
* The do_proc_dointvec_minmax_conv_param structure provides the
|
|
|
|
|
* minimum and maximum values for doing range checking for those sysctl
|
|
|
|
|
* parameters that use the proc_dointvec_minmax() handler.
|
|
|
|
|
*/
|
2005-04-16 15:20:36 -07:00
|
|
|
struct do_proc_dointvec_minmax_conv_param {
|
|
|
|
|
int *min;
|
|
|
|
|
int *max;
|
|
|
|
|
};
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
|
|
|
|
|
int *valp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int write, void *data)
|
|
|
|
|
{
|
2019-03-11 23:28:06 -07:00
|
|
|
int tmp, ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
struct do_proc_dointvec_minmax_conv_param *param = data;
|
2019-03-11 23:28:06 -07:00
|
|
|
/*
|
|
|
|
|
* If writing, first do so via a temporary local int so we can
|
|
|
|
|
* bounds-check it before touching *valp.
|
|
|
|
|
*/
|
|
|
|
|
int *ip = write ? &tmp : valp;
|
|
|
|
|
|
|
|
|
|
ret = do_proc_dointvec_conv(negp, lvalp, ip, write, data);
|
|
|
|
|
if (ret)
|
|
|
|
|
return ret;
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
2019-03-11 23:28:06 -07:00
|
|
|
if ((param->min && *param->min > tmp) ||
|
|
|
|
|
(param->max && *param->max < tmp))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EINVAL;
|
2019-03-11 23:28:06 -07:00
|
|
|
*valp = tmp;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2019-03-11 23:28:06 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_dointvec_minmax - read a vector of integers with min/max values
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
*
|
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
|
*
|
2018-04-10 16:35:38 -07:00
|
|
|
* Returns 0 on success or -EINVAL on write when the range check fails.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
struct do_proc_dointvec_minmax_conv_param param = {
|
|
|
|
|
.min = (int *) table->extra1,
|
|
|
|
|
.max = (int *) table->extra2,
|
|
|
|
|
};
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_minmax_conv, ¶m);
|
|
|
|
|
}
|
|
|
|
|
|
2018-04-10 16:35:38 -07:00
|
|
|
/**
|
|
|
|
|
* struct do_proc_douintvec_minmax_conv_param - proc_douintvec_minmax() range checking structure
|
|
|
|
|
* @min: pointer to minimum allowable value
|
|
|
|
|
* @max: pointer to maximum allowable value
|
|
|
|
|
*
|
|
|
|
|
* The do_proc_douintvec_minmax_conv_param structure provides the
|
|
|
|
|
* minimum and maximum values for doing range checking for those sysctl
|
|
|
|
|
* parameters that use the proc_douintvec_minmax() handler.
|
|
|
|
|
*/
|
2017-07-12 14:33:40 -07:00
|
|
|
struct do_proc_douintvec_minmax_conv_param {
|
|
|
|
|
unsigned int *min;
|
|
|
|
|
unsigned int *max;
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
|
|
|
|
|
unsigned int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
2019-03-11 23:28:06 -07:00
|
|
|
int ret;
|
|
|
|
|
unsigned int tmp;
|
2017-07-12 14:33:40 -07:00
|
|
|
struct do_proc_douintvec_minmax_conv_param *param = data;
|
2019-03-11 23:28:06 -07:00
|
|
|
/* write via temporary local uint for bounds-checking */
|
|
|
|
|
unsigned int *up = write ? &tmp : valp;
|
|
|
|
|
|
|
|
|
|
ret = do_proc_douintvec_conv(lvalp, up, write, data);
|
|
|
|
|
if (ret)
|
|
|
|
|
return ret;
|
2017-07-12 14:33:40 -07:00
|
|
|
|
|
|
|
|
if (write) {
|
2019-03-11 23:28:06 -07:00
|
|
|
if ((param->min && *param->min > tmp) ||
|
|
|
|
|
(param->max && *param->max < tmp))
|
2017-07-12 14:33:40 -07:00
|
|
|
return -ERANGE;
|
|
|
|
|
|
2019-03-11 23:28:06 -07:00
|
|
|
*valp = tmp;
|
2017-07-12 14:33:40 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_douintvec_minmax - read a vector of unsigned ints with min/max values
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) unsigned integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string. Negative
|
|
|
|
|
* strings are not allowed.
|
|
|
|
|
*
|
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
|
* table->extra1 (min) and table->extra2 (max). There is a final sanity
|
|
|
|
|
* check for UINT_MAX to avoid having to support wrap around uses from
|
|
|
|
|
* userspace.
|
|
|
|
|
*
|
2018-04-10 16:35:38 -07:00
|
|
|
* Returns 0 on success or -ERANGE on write when the range check fails.
|
2017-07-12 14:33:40 -07:00
|
|
|
*/
|
|
|
|
|
int proc_douintvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2017-07-12 14:33:40 -07:00
|
|
|
{
|
|
|
|
|
struct do_proc_douintvec_minmax_conv_param param = {
|
|
|
|
|
.min = (unsigned int *) table->extra1,
|
|
|
|
|
.max = (unsigned int *) table->extra2,
|
|
|
|
|
};
|
|
|
|
|
return do_proc_douintvec(table, write, buffer, lenp, ppos,
|
|
|
|
|
do_proc_douintvec_minmax_conv, ¶m);
|
|
|
|
|
}
|
|
|
|
|
|
2021-03-25 11:08:13 -07:00
|
|
|
/**
|
|
|
|
|
* proc_dou8vec_minmax - read a vector of unsigned chars with min/max values
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(u8) unsigned chars
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string. Negative
|
|
|
|
|
* strings are not allowed.
|
|
|
|
|
*
|
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success or an error on write when the range check fails.
|
|
|
|
|
*/
|
|
|
|
|
int proc_dou8vec_minmax(struct ctl_table *table, int write,
|
|
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
|
{
|
|
|
|
|
struct ctl_table tmp;
|
|
|
|
|
unsigned int min = 0, max = 255U, val;
|
|
|
|
|
u8 *data = table->data;
|
|
|
|
|
struct do_proc_douintvec_minmax_conv_param param = {
|
|
|
|
|
.min = &min,
|
|
|
|
|
.max = &max,
|
|
|
|
|
};
|
|
|
|
|
int res;
|
|
|
|
|
|
|
|
|
|
/* Do not support arrays yet. */
|
|
|
|
|
if (table->maxlen != sizeof(u8))
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
|
|
if (table->extra1) {
|
|
|
|
|
min = *(unsigned int *) table->extra1;
|
|
|
|
|
if (min > 255U)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
}
|
|
|
|
|
if (table->extra2) {
|
|
|
|
|
max = *(unsigned int *) table->extra2;
|
|
|
|
|
if (max > 255U)
|
|
|
|
|
return -EINVAL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
tmp = *table;
|
|
|
|
|
|
|
|
|
|
tmp.maxlen = sizeof(val);
|
|
|
|
|
tmp.data = &val;
|
|
|
|
|
val = *data;
|
|
|
|
|
res = do_proc_douintvec(&tmp, write, buffer, lenp, ppos,
|
|
|
|
|
do_proc_douintvec_minmax_conv, ¶m);
|
|
|
|
|
if (res)
|
|
|
|
|
return res;
|
|
|
|
|
if (write)
|
|
|
|
|
*data = val;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
EXPORT_SYMBOL_GPL(proc_dou8vec_minmax);
|
|
|
|
|
|
2020-03-02 17:51:34 +00:00
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
|
|
|
|
static int sysrq_sysctl_handler(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2020-03-02 17:51:34 +00:00
|
|
|
{
|
|
|
|
|
int tmp, ret;
|
|
|
|
|
|
|
|
|
|
tmp = sysrq_mask();
|
|
|
|
|
|
|
|
|
|
ret = __do_proc_dointvec(&tmp, table, write, buffer,
|
|
|
|
|
lenp, ppos, NULL, NULL);
|
|
|
|
|
if (ret || !write)
|
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
|
|
if (write)
|
|
|
|
|
sysrq_toggle_support(tmp);
|
|
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table,
|
|
|
|
|
int write, void *buffer, size_t *lenp, loff_t *ppos,
|
|
|
|
|
unsigned long convmul, unsigned long convdiv)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2010-05-05 00:26:45 +00:00
|
|
|
unsigned long *i, *min, *max;
|
|
|
|
|
int vleft, first = 1, err = 0;
|
|
|
|
|
size_t left;
|
2020-04-24 08:43:38 +02:00
|
|
|
char *p;
|
2010-05-05 00:26:45 +00:00
|
|
|
|
|
|
|
|
if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
2010-05-05 00:26:45 +00:00
|
|
|
|
2006-10-02 02:18:23 -07:00
|
|
|
i = (unsigned long *) data;
|
2005-04-16 15:20:36 -07:00
|
|
|
min = (unsigned long *) table->extra1;
|
|
|
|
|
max = (unsigned long *) table->extra2;
|
|
|
|
|
vleft = table->maxlen / sizeof(unsigned long);
|
|
|
|
|
left = *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
|
|
|
|
|
if (write) {
|
2017-07-12 14:33:33 -07:00
|
|
|
if (proc_first_pos_non_zero_ignore(ppos, table))
|
|
|
|
|
goto out;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
|
left = PAGE_SIZE - 1;
|
2020-04-24 08:43:38 +02:00
|
|
|
p = buffer;
|
2010-05-05 00:26:45 +00:00
|
|
|
}
|
|
|
|
|
|
2010-10-07 12:59:29 -07:00
|
|
|
for (; left && vleft--; i++, first = 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
unsigned long val;
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
2010-05-05 00:26:45 +00:00
|
|
|
bool neg;
|
|
|
|
|
|
2015-12-24 00:13:10 -05:00
|
|
|
left -= proc_skip_spaces(&p);
|
proc/sysctl: fix return error for proc_doulongvec_minmax()
If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.
For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:
{
.procname = "monitor_signals",
.data = &monitor_sigs,
.maxlen = 2*sizeof(unsigned long),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
Reproduce:
When passing two parameters, it's work normal. But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
-bash: echo: write error: Invalid argument
[root@cl150 ~]# echo $?
1
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
The following is the result after apply this patch. No error is
returned when the number of input parameters is less than the total
parameters.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# echo $?
0
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.
This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'. In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.
static int __do_proc_douintvec(...){
...
/*
* Arrays are not supported, keep this simple. *Do not* add
* support for them.
*/
if (vleft != 1) {
*lenp = 0;
return -EINVAL;
}
...
}
So, just __do_proc_doulongvec_minmax has the problem. And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.
Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 15:26:13 -08:00
|
|
|
if (!left)
|
|
|
|
|
break;
|
2010-05-05 00:26:45 +00:00
|
|
|
|
2015-12-24 00:13:10 -05:00
|
|
|
err = proc_get_long(&p, &left, &val, &neg,
|
2010-05-05 00:26:45 +00:00
|
|
|
proc_wspace_sep,
|
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
2022-01-21 22:13:48 -08:00
|
|
|
if (err || neg) {
|
|
|
|
|
err = -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
2022-01-21 22:13:48 -08:00
|
|
|
}
|
|
|
|
|
|
2017-01-25 18:20:55 -08:00
|
|
|
val = convmul * val / convdiv;
|
2019-05-14 15:44:55 -07:00
|
|
|
if ((min && val < *min) || (max && val > *max)) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
*i = val;
|
|
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
val = convdiv * (*i) / convmul;
|
2020-04-24 08:43:38 +02:00
|
|
|
if (!first)
|
|
|
|
|
proc_put_char(&buffer, &left, '\t');
|
|
|
|
|
proc_put_long(&buffer, &left, val, false);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (!write && !first && left && !err)
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '\n');
|
2010-05-05 00:26:45 +00:00
|
|
|
if (write && !err)
|
2015-12-24 00:13:10 -05:00
|
|
|
left -= proc_skip_spaces(&p);
|
2020-04-24 08:43:38 +02:00
|
|
|
if (write && first)
|
|
|
|
|
return err ? : -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp -= left;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
out:
|
2005-04-16 15:20:36 -07:00
|
|
|
*ppos += *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos, unsigned long convmul,
|
|
|
|
|
unsigned long convdiv)
|
2006-10-02 02:18:23 -07:00
|
|
|
{
|
|
|
|
|
return __do_proc_doulongvec_minmax(table->data, table, write,
|
2009-09-23 15:57:19 -07:00
|
|
|
buffer, lenp, ppos, convmul, convdiv);
|
2006-10-02 02:18:23 -07:00
|
|
|
}
|
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
|
* proc_doulongvec_minmax - read a vector of long integers with min/max values
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
*
|
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer, lenp, ppos, 1l, 1l);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_doulongvec_ms_jiffies_minmax - read a vector of millisecond values with min/max values
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string. The values
|
|
|
|
|
* are treated as milliseconds, and converted to jiffies when they are stored.
|
|
|
|
|
*
|
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2007-10-18 03:05:22 -07:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer,
|
2005-04-16 15:20:36 -07:00
|
|
|
lenp, ppos, HZ, 1000l);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
|
|
|
|
if (write) {
|
2017-05-08 15:54:58 -07:00
|
|
|
if (*lvalp > INT_MAX / HZ)
|
2006-03-24 03:15:50 -08:00
|
|
|
return 1;
|
2005-04-16 15:20:36 -07:00
|
|
|
*valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
|
|
|
|
|
} else {
|
|
|
|
|
int val = *valp;
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
|
}
|
|
|
|
|
*lvalp = lval / HZ;
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
|
|
|
|
if (write) {
|
2006-03-24 03:15:50 -08:00
|
|
|
if (USER_HZ < HZ && *lvalp > (LONG_MAX / HZ) * USER_HZ)
|
|
|
|
|
return 1;
|
2005-04-16 15:20:36 -07:00
|
|
|
*valp = clock_t_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
|
} else {
|
|
|
|
|
int val = *valp;
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
|
}
|
|
|
|
|
*lvalp = jiffies_to_clock_t(lval);
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
|
int write, void *data)
|
|
|
|
|
{
|
|
|
|
|
if (write) {
|
2013-07-24 10:39:07 +02:00
|
|
|
unsigned long jif = msecs_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
|
|
|
|
|
|
if (jif > INT_MAX)
|
|
|
|
|
return 1;
|
|
|
|
|
*valp = (int)jif;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
|
|
|
|
int val = *valp;
|
|
|
|
|
unsigned long lval;
|
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
|
}
|
|
|
|
|
*lvalp = jiffies_to_msecs(lval);
|
|
|
|
|
}
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_dointvec_jiffies - read a vector of integers as seconds
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
* The values read are assumed to be in seconds, and are converted into
|
|
|
|
|
* jiffies.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_jiffies_conv,NULL);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_dointvec_userhz_jiffies - read a vector of integers as 1/USER_HZ seconds
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
2005-11-07 01:01:06 -08:00
|
|
|
* @ppos: pointer to the file position
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
* The values read are assumed to be in 1/USER_HZ seconds, and
|
|
|
|
|
* are converted into jiffies.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_userhz_jiffies_conv,NULL);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* proc_dointvec_ms_jiffies - read a vector of integers as 1 milliseconds
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
2005-05-01 08:59:26 -07:00
|
|
|
* @ppos: file position
|
|
|
|
|
* @ppos: the current position in the file
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
|
* The values read are assumed to be in 1/1000 seconds, and
|
|
|
|
|
* are converted into jiffies.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
2020-04-24 08:43:38 +02:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write, void *buffer,
|
|
|
|
|
size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_ms_jiffies_conv, NULL);
|
|
|
|
|
}
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
static int proc_do_cad_pid(struct ctl_table *table, int write, void *buffer,
|
|
|
|
|
size_t *lenp, loff_t *ppos)
|
2006-10-02 02:19:00 -07:00
|
|
|
{
|
|
|
|
|
struct pid *new_pid;
|
|
|
|
|
pid_t tmp;
|
|
|
|
|
int r;
|
|
|
|
|
|
2008-02-08 04:19:20 -08:00
|
|
|
tmp = pid_vnr(cad_pid);
|
2006-10-02 02:19:00 -07:00
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
r = __do_proc_dointvec(&tmp, table, write, buffer,
|
2006-10-02 02:19:00 -07:00
|
|
|
lenp, ppos, NULL, NULL);
|
|
|
|
|
if (r || !write)
|
|
|
|
|
return r;
|
|
|
|
|
|
|
|
|
|
new_pid = find_get_pid(tmp);
|
|
|
|
|
if (!new_pid)
|
|
|
|
|
return -ESRCH;
|
|
|
|
|
|
|
|
|
|
put_pid(xchg(&cad_pid, new_pid));
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
/**
|
|
|
|
|
* proc_do_large_bitmap - read/write from/to a large bitmap
|
|
|
|
|
* @table: the sysctl table
|
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
|
* @buffer: the user buffer
|
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
|
* @ppos: file position
|
|
|
|
|
*
|
|
|
|
|
* The bitmap is stored at table->data and the bitmap length (in bits)
|
|
|
|
|
* in table->maxlen.
|
|
|
|
|
*
|
|
|
|
|
* We use a range comma separated format (e.g. 1,3-4,10-10) so that
|
|
|
|
|
* large bitmaps may be represented in a compact manner. Writing into
|
|
|
|
|
* the file will clear the bitmap then update it with the given input.
|
|
|
|
|
*
|
|
|
|
|
* Returns 0 on success.
|
|
|
|
|
*/
|
|
|
|
|
int proc_do_large_bitmap(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2010-05-05 00:26:55 +00:00
|
|
|
{
|
|
|
|
|
int err = 0;
|
|
|
|
|
size_t left = *lenp;
|
|
|
|
|
unsigned long bitmap_len = table->maxlen;
|
2014-05-12 16:04:53 -07:00
|
|
|
unsigned long *bitmap = *(unsigned long **) table->data;
|
2010-05-05 00:26:55 +00:00
|
|
|
unsigned long *tmp_bitmap = NULL;
|
|
|
|
|
char tr_a[] = { '-', ',', '\n' }, tr_b[] = { ',', '\n', 0 }, c;
|
|
|
|
|
|
2014-05-12 16:04:53 -07:00
|
|
|
if (!bitmap || !bitmap_len || !left || (*ppos && !write)) {
|
2010-05-05 00:26:55 +00:00
|
|
|
*lenp = 0;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (write) {
|
2020-04-24 08:43:38 +02:00
|
|
|
char *p = buffer;
|
2019-05-14 15:45:13 -07:00
|
|
|
size_t skipped = 0;
|
2010-05-05 00:26:55 +00:00
|
|
|
|
2019-05-14 15:45:13 -07:00
|
|
|
if (left > PAGE_SIZE - 1) {
|
2010-05-05 00:26:55 +00:00
|
|
|
left = PAGE_SIZE - 1;
|
2019-05-14 15:45:13 -07:00
|
|
|
/* How much of the buffer we'll skip this pass */
|
|
|
|
|
skipped = *lenp - left;
|
|
|
|
|
}
|
2010-05-05 00:26:55 +00:00
|
|
|
|
2019-05-14 15:44:52 -07:00
|
|
|
tmp_bitmap = bitmap_zalloc(bitmap_len, GFP_KERNEL);
|
2020-04-24 08:43:38 +02:00
|
|
|
if (!tmp_bitmap)
|
2010-05-05 00:26:55 +00:00
|
|
|
return -ENOMEM;
|
2015-12-24 00:13:10 -05:00
|
|
|
proc_skip_char(&p, &left, '\n');
|
2010-05-05 00:26:55 +00:00
|
|
|
while (!err && left) {
|
|
|
|
|
unsigned long val_a, val_b;
|
|
|
|
|
bool neg;
|
2019-05-14 15:45:13 -07:00
|
|
|
size_t saved_left;
|
2010-05-05 00:26:55 +00:00
|
|
|
|
2019-05-14 15:45:13 -07:00
|
|
|
/* In case we stop parsing mid-number, we can reset */
|
|
|
|
|
saved_left = left;
|
2015-12-24 00:13:10 -05:00
|
|
|
err = proc_get_long(&p, &left, &val_a, &neg, tr_a,
|
2010-05-05 00:26:55 +00:00
|
|
|
sizeof(tr_a), &c);
|
2019-05-14 15:45:13 -07:00
|
|
|
/*
|
|
|
|
|
* If we consumed the entirety of a truncated buffer or
|
|
|
|
|
* only one char is left (may be a "-"), then stop here,
|
|
|
|
|
* reset, & come back for more.
|
|
|
|
|
*/
|
|
|
|
|
if ((left <= 1) && skipped) {
|
|
|
|
|
left = saved_left;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
if (err)
|
|
|
|
|
break;
|
|
|
|
|
if (val_a >= bitmap_len || neg) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
val_b = val_a;
|
|
|
|
|
if (left) {
|
2015-12-24 00:13:10 -05:00
|
|
|
p++;
|
2010-05-05 00:26:55 +00:00
|
|
|
left--;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (c == '-') {
|
2015-12-24 00:13:10 -05:00
|
|
|
err = proc_get_long(&p, &left, &val_b,
|
2010-05-05 00:26:55 +00:00
|
|
|
&neg, tr_b, sizeof(tr_b),
|
|
|
|
|
&c);
|
2019-05-14 15:45:13 -07:00
|
|
|
/*
|
|
|
|
|
* If we consumed all of a truncated buffer or
|
|
|
|
|
* then stop here, reset, & come back for more.
|
|
|
|
|
*/
|
|
|
|
|
if (!left && skipped) {
|
|
|
|
|
left = saved_left;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
if (err)
|
|
|
|
|
break;
|
|
|
|
|
if (val_b >= bitmap_len || neg ||
|
|
|
|
|
val_a > val_b) {
|
|
|
|
|
err = -EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
if (left) {
|
2015-12-24 00:13:10 -05:00
|
|
|
p++;
|
2010-05-05 00:26:55 +00:00
|
|
|
left--;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2012-03-28 14:42:50 -07:00
|
|
|
bitmap_set(tmp_bitmap, val_a, val_b - val_a + 1);
|
2015-12-24 00:13:10 -05:00
|
|
|
proc_skip_char(&p, &left, '\n');
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
2019-05-14 15:45:13 -07:00
|
|
|
left += skipped;
|
2010-05-05 00:26:55 +00:00
|
|
|
} else {
|
|
|
|
|
unsigned long bit_a, bit_b = 0;
|
2021-06-30 18:54:53 -07:00
|
|
|
bool first = 1;
|
2010-05-05 00:26:55 +00:00
|
|
|
|
|
|
|
|
while (left) {
|
|
|
|
|
bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
|
|
|
|
|
if (bit_a >= bitmap_len)
|
|
|
|
|
break;
|
|
|
|
|
bit_b = find_next_zero_bit(bitmap, bitmap_len,
|
|
|
|
|
bit_a + 1) - 1;
|
|
|
|
|
|
2020-04-24 08:43:38 +02:00
|
|
|
if (!first)
|
|
|
|
|
proc_put_char(&buffer, &left, ',');
|
|
|
|
|
proc_put_long(&buffer, &left, bit_a, false);
|
2010-05-05 00:26:55 +00:00
|
|
|
if (bit_a != bit_b) {
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '-');
|
|
|
|
|
proc_put_long(&buffer, &left, bit_b, false);
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
first = 0; bit_b++;
|
|
|
|
|
}
|
2020-04-24 08:43:38 +02:00
|
|
|
proc_put_char(&buffer, &left, '\n');
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (!err) {
|
|
|
|
|
if (write) {
|
|
|
|
|
if (*ppos)
|
|
|
|
|
bitmap_or(bitmap, bitmap, tmp_bitmap, bitmap_len);
|
|
|
|
|
else
|
2012-03-28 14:42:50 -07:00
|
|
|
bitmap_copy(bitmap, tmp_bitmap, bitmap_len);
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
|
|
|
|
*lenp -= left;
|
|
|
|
|
*ppos += *lenp;
|
|
|
|
|
}
|
2017-11-17 15:30:26 -08:00
|
|
|
|
2019-05-14 15:44:52 -07:00
|
|
|
bitmap_free(tmp_bitmap);
|
2017-11-17 15:30:26 -08:00
|
|
|
return err;
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
|
|
|
|
|
2011-01-12 17:00:45 -08:00
|
|
|
#else /* CONFIG_PROC_SYSCTL */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2021-08-03 12:59:36 +02:00
|
|
|
int proc_dobool(struct ctl_table *table, int write,
|
|
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2016-08-25 15:16:51 -07:00
|
|
|
int proc_douintvec(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2016-08-25 15:16:51 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2017-07-12 14:33:40 -07:00
|
|
|
int proc_douintvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2017-07-12 14:33:40 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2021-03-25 11:08:13 -07:00
|
|
|
int proc_dou8vec_minmax(struct ctl_table *table, int write,
|
|
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-04-24 08:43:38 +02:00
|
|
|
return -ENOSYS;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
2019-04-17 16:35:49 -04:00
|
|
|
int proc_do_large_bitmap(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2019-04-17 16:35:49 -04:00
|
|
|
{
|
|
|
|
|
return -ENOSYS;
|
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2011-01-12 17:00:45 -08:00
|
|
|
#endif /* CONFIG_PROC_SYSCTL */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2019-06-14 16:22:18 -07:00
|
|
|
#if defined(CONFIG_SYSCTL)
|
|
|
|
|
int proc_do_static_key(struct ctl_table *table, int write,
|
2020-04-24 08:43:38 +02:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
2019-02-25 14:28:39 -08:00
|
|
|
{
|
2019-06-14 16:22:18 -07:00
|
|
|
struct static_key *key = (struct static_key *)table->data;
|
|
|
|
|
static DEFINE_MUTEX(static_key_mutex);
|
|
|
|
|
int val, ret;
|
|
|
|
|
struct ctl_table tmp = {
|
|
|
|
|
.data = &val,
|
|
|
|
|
.maxlen = sizeof(val),
|
|
|
|
|
.mode = table->mode,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2019-06-14 16:22:18 -07:00
|
|
|
};
|
2019-02-25 14:28:39 -08:00
|
|
|
|
|
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
|
|
|
|
return -EPERM;
|
|
|
|
|
|
2019-06-14 16:22:18 -07:00
|
|
|
mutex_lock(&static_key_mutex);
|
|
|
|
|
val = static_key_enabled(key);
|
2019-02-25 14:28:39 -08:00
|
|
|
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
|
|
|
|
|
if (write && !ret) {
|
2019-06-14 16:22:18 -07:00
|
|
|
if (val)
|
|
|
|
|
static_key_enable(key);
|
2019-02-25 14:28:39 -08:00
|
|
|
else
|
2019-06-14 16:22:18 -07:00
|
|
|
static_key_disable(key);
|
2019-02-25 14:28:39 -08:00
|
|
|
}
|
2019-06-14 16:22:18 -07:00
|
|
|
mutex_unlock(&static_key_mutex);
|
2019-02-25 14:28:39 -08:00
|
|
|
return ret;
|
|
|
|
|
}
|
2020-04-24 08:43:37 +02:00
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table kern_table[] = {
|
2021-03-24 13:39:16 +00:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2014-01-23 15:53:13 -08:00
|
|
|
{
|
|
|
|
|
.procname = "numa_balancing",
|
|
|
|
|
.data = NULL, /* filled in by handler */
|
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = sysctl_numa_balancing,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
NUMA balancing: optimize page placement for memory tiering system
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.
In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.
In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.
The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.
It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,
a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.
b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.
The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.
The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.
In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.
The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,
- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING
We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.
Thanks Andrew Morton to help fix the document format error.
Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 14:46:23 -07:00
|
|
|
.extra2 = SYSCTL_FOUR,
|
2014-01-23 15:53:13 -08:00
|
|
|
},
|
2012-10-25 14:16:43 +02:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "panic",
|
|
|
|
|
.data = &panic_timeout,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-02-10 01:45:24 -08:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "tainted",
|
2008-10-15 22:01:41 -07:00
|
|
|
.maxlen = sizeof(long),
|
2007-02-10 01:45:24 -08:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_taint,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
{
|
|
|
|
|
.procname = "sysctl_writes_strict",
|
|
|
|
|
.data = &sysctl_writes_strict,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra1 = SYSCTL_NEG_ONE,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra2 = SYSCTL_ONE,
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2007-07-15 23:40:10 -07:00
|
|
|
{
|
|
|
|
|
.procname = "print-fatal-signals",
|
|
|
|
|
.data = &print_fatal_signals,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-15 23:40:10 -07:00
|
|
|
},
|
2008-09-11 23:29:54 -07:00
|
|
|
#ifdef CONFIG_SPARC
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "reboot-cmd",
|
|
|
|
|
.data = reboot_command,
|
|
|
|
|
.maxlen = 256,
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "stop-a",
|
|
|
|
|
.data = &stop_a_enabled,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "scons-poweroff",
|
|
|
|
|
.data = &scons_pwroff,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
#endif
|
2008-11-16 23:49:24 -08:00
|
|
|
#ifdef CONFIG_SPARC64
|
|
|
|
|
{
|
|
|
|
|
.procname = "tsb-ratio",
|
|
|
|
|
.data = &sysctl_tsb_ratio,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-11-16 23:49:24 -08:00
|
|
|
},
|
|
|
|
|
#endif
|
2019-10-04 13:10:09 +02:00
|
|
|
#ifdef CONFIG_PARISC
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "soft-power",
|
|
|
|
|
.data = &pwrsw_enabled,
|
|
|
|
|
.maxlen = sizeof (int),
|
2020-04-24 08:43:37 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2013-01-18 15:12:24 +05:30
|
|
|
#endif
|
|
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "unaligned-trap",
|
|
|
|
|
.data = &unaligned_enabled,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-05-12 21:20:43 +02:00
|
|
|
#endif
|
2008-12-16 23:06:40 -05:00
|
|
|
#ifdef CONFIG_STACK_TRACER
|
|
|
|
|
{
|
|
|
|
|
.procname = "stack_tracer_enabled",
|
|
|
|
|
.data = &stack_tracer_enabled,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = stack_trace_sysctl,
|
2008-12-16 23:06:40 -05:00
|
|
|
},
|
|
|
|
|
#endif
|
2008-10-23 19:26:08 -04:00
|
|
|
#ifdef CONFIG_TRACING
|
|
|
|
|
{
|
2008-11-04 11:58:21 +01:00
|
|
|
.procname = "ftrace_dump_on_oops",
|
2008-10-23 19:26:08 -04:00
|
|
|
.data = &ftrace_dump_on_oops,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-10-23 19:26:08 -04:00
|
|
|
},
|
2013-06-14 16:21:43 -04:00
|
|
|
{
|
|
|
|
|
.procname = "traceoff_on_warning",
|
|
|
|
|
.data = &__disable_trace_on_warning,
|
|
|
|
|
.maxlen = sizeof(__disable_trace_on_warning),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
|
},
|
2014-12-12 22:27:10 -05:00
|
|
|
{
|
|
|
|
|
.procname = "tracepoint_printk",
|
|
|
|
|
.data = &tracepoint_printk,
|
|
|
|
|
.maxlen = sizeof(tracepoint_printk),
|
|
|
|
|
.mode = 0644,
|
2016-11-23 15:52:45 -05:00
|
|
|
.proc_handler = tracepoint_printk_sysctl,
|
2014-12-12 22:27:10 -05:00
|
|
|
},
|
2008-10-23 19:26:08 -04:00
|
|
|
#endif
|
2008-07-08 19:00:17 +02:00
|
|
|
#ifdef CONFIG_MODULES
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "modprobe",
|
|
|
|
|
.data = &modprobe_path,
|
|
|
|
|
.maxlen = KMOD_PATH_LEN,
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-04-02 15:49:29 -07:00
|
|
|
{
|
|
|
|
|
.procname = "modules_disabled",
|
|
|
|
|
.data = &modules_disabled,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
/* only handle a transition from default "0" to "1" */
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ONE,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2009-04-02 15:49:29 -07:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2014-04-10 14:09:31 -07:00
|
|
|
#ifdef CONFIG_UEVENT_HELPER
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "hotplug",
|
2005-11-16 09:00:00 +01:00
|
|
|
.data = &uevent_helper,
|
|
|
|
|
.maxlen = UEVENT_HELPER_PATH_LEN,
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2014-04-10 14:09:31 -07:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
|
|
|
|
{
|
|
|
|
|
.procname = "sysrq",
|
2020-03-02 17:51:34 +00:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2010-03-21 22:31:26 -07:00
|
|
|
.proc_handler = sysrq_sysctl_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
#endif
|
2006-10-19 23:28:34 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "cad_pid",
|
2006-10-02 02:19:00 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0600,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_do_cad_pid,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-10-19 23:28:34 -07:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "threads-max",
|
2015-04-16 12:47:50 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2015-04-16 12:47:50 -07:00
|
|
|
.proc_handler = sysctl_max_threads,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2011-04-01 17:07:50 -04:00
|
|
|
{
|
|
|
|
|
.procname = "usermodehelper",
|
|
|
|
|
.mode = 0555,
|
|
|
|
|
.child = usermodehelper_table,
|
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "overflowuid",
|
|
|
|
|
.data = &overflowuid,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2022-01-21 22:11:19 -08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:13:03 -08:00
|
|
|
.extra2 = SYSCTL_MAXOLDUID,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "overflowgid",
|
|
|
|
|
.data = &overflowgid,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2022-01-21 22:11:19 -08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:13:03 -08:00
|
|
|
.extra2 = SYSCTL_MAXOLDUID,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-01-06 00:19:28 -08:00
|
|
|
#ifdef CONFIG_S390
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "userprocess_debug",
|
2010-05-17 10:00:21 +02:00
|
|
|
.data = &show_unhandled_signals,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
.procname = "pid_max",
|
|
|
|
|
.data = &pid_max,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &pid_max_min,
|
|
|
|
|
.extra2 = &pid_max_max,
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "panic_on_oops",
|
|
|
|
|
.data = &panic_on_oops,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2019-01-03 15:28:20 -08:00
|
|
|
{
|
|
|
|
|
.procname = "panic_print",
|
|
|
|
|
.data = &panic_print,
|
|
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "ngroups_max",
|
2022-01-21 22:11:09 -08:00
|
|
|
.data = (void *)&ngroups_max,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2011-10-31 17:11:20 -07:00
|
|
|
{
|
|
|
|
|
.procname = "cap_last_cap",
|
|
|
|
|
.data = (void *)&cap_last_cap,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0444,
|
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
|
},
|
2010-11-29 17:07:17 -05:00
|
|
|
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
|
|
|
|
|
{
|
|
|
|
|
.procname = "unknown_nmi_panic",
|
|
|
|
|
.data = &unknown_nmi_panic,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
|
},
|
2010-02-12 17:19:19 -05:00
|
|
|
#endif
|
2020-04-11 21:06:19 +08:00
|
|
|
|
|
|
|
|
#if (defined(CONFIG_X86_32) || defined(CONFIG_PARISC)) && \
|
|
|
|
|
defined(CONFIG_DEBUG_STACKOVERFLOW)
|
|
|
|
|
{
|
|
|
|
|
.procname = "panic_on_stackoverflow",
|
|
|
|
|
.data = &sysctl_panic_on_stackoverflow,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
|
},
|
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#if defined(CONFIG_X86)
|
2006-09-26 10:52:27 +02:00
|
|
|
{
|
|
|
|
|
.procname = "panic_on_unrecovered_nmi",
|
|
|
|
|
.data = &panic_on_unrecovered_nmi,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-09-26 10:52:27 +02:00
|
|
|
},
|
2009-06-24 14:32:11 -07:00
|
|
|
{
|
|
|
|
|
.procname = "panic_on_io_nmi",
|
|
|
|
|
.data = &panic_on_io_nmi,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-06-24 14:32:11 -07:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "bootloader_type",
|
|
|
|
|
.data = &bootloader_type,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-05-07 16:54:11 -07:00
|
|
|
{
|
|
|
|
|
.procname = "bootloader_version",
|
|
|
|
|
.data = &bootloader_version,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-07 16:54:11 -07:00
|
|
|
},
|
2008-01-30 13:30:05 +01:00
|
|
|
{
|
|
|
|
|
.procname = "io_delay_type",
|
|
|
|
|
.data = &io_delay_type,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-30 13:30:05 +01:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2006-02-20 18:28:07 -08:00
|
|
|
#if defined(CONFIG_MMU)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "randomize_va_space",
|
|
|
|
|
.data = &randomize_va_space,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-02-20 18:28:07 -08:00
|
|
|
#endif
|
2006-01-14 13:21:00 -08:00
|
|
|
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
|
2005-07-27 11:44:57 -07:00
|
|
|
{
|
|
|
|
|
.procname = "spin_retry",
|
|
|
|
|
.data = &spin_retry,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-07-27 11:44:57 -07:00
|
|
|
},
|
2006-02-20 18:27:58 -08:00
|
|
|
#endif
|
2007-07-28 03:33:16 -04:00
|
|
|
#if defined(CONFIG_ACPI_SLEEP) && defined(CONFIG_X86)
|
2006-02-20 18:27:58 -08:00
|
|
|
{
|
|
|
|
|
.procname = "acpi_video_flags",
|
2007-07-19 01:47:41 -07:00
|
|
|
.data = &acpi_realmode_flags,
|
2006-02-20 18:27:58 -08:00
|
|
|
.maxlen = sizeof (unsigned long),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2006-02-20 18:27:58 -08:00
|
|
|
},
|
2006-02-28 09:42:23 -08:00
|
|
|
#endif
|
2013-01-09 20:06:28 +05:30
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
|
2006-02-28 09:42:23 -08:00
|
|
|
{
|
|
|
|
|
.procname = "ignore-unaligned-usertrap",
|
|
|
|
|
.data = &no_unaligned_warning,
|
|
|
|
|
.maxlen = sizeof (int),
|
2020-04-24 08:43:37 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-02-28 09:42:23 -08:00
|
|
|
},
|
2013-01-09 20:06:28 +05:30
|
|
|
#endif
|
|
|
|
|
#ifdef CONFIG_IA64
|
2009-01-15 10:38:56 -08:00
|
|
|
{
|
|
|
|
|
.procname = "unaligned-dump-stack",
|
|
|
|
|
.data = &unaligned_dump_stack,
|
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-01-15 10:38:56 -08:00
|
|
|
},
|
2006-06-26 13:56:52 +02:00
|
|
|
#endif
|
2006-06-27 02:54:53 -07:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
|
{
|
|
|
|
|
.procname = "max_lock_depth",
|
|
|
|
|
.data = &max_lock_depth,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-27 02:54:53 -07:00
|
|
|
},
|
2007-05-08 00:26:04 -07:00
|
|
|
#endif
|
2008-04-29 01:01:32 -07:00
|
|
|
#ifdef CONFIG_KEYS
|
|
|
|
|
{
|
|
|
|
|
.procname = "keys",
|
|
|
|
|
.mode = 0555,
|
|
|
|
|
.child = key_sysctls,
|
|
|
|
|
},
|
|
|
|
|
#endif
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
2011-06-03 17:54:40 -04:00
|
|
|
/*
|
|
|
|
|
* User-space scripts rely on the existence of this file
|
|
|
|
|
* as a feature check for perf_events being enabled.
|
|
|
|
|
*
|
|
|
|
|
* So it's an ABI, do not remove!
|
|
|
|
|
*/
|
2009-04-09 10:53:45 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_paranoid",
|
|
|
|
|
.data = &sysctl_perf_event_paranoid,
|
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_paranoid),
|
2009-04-09 10:53:45 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-04-09 10:53:45 +02:00
|
|
|
},
|
2009-05-05 17:50:24 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_mlock_kb",
|
|
|
|
|
.data = &sysctl_perf_event_mlock,
|
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_mlock),
|
2009-05-05 17:50:24 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-05 17:50:24 +02:00
|
|
|
},
|
2009-05-25 17:39:05 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_max_sample_rate",
|
|
|
|
|
.data = &sysctl_perf_event_sample_rate,
|
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_sample_rate),
|
2009-05-25 17:39:05 +02:00
|
|
|
.mode = 0644,
|
2011-02-16 11:22:34 +01:00
|
|
|
.proc_handler = perf_proc_update_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ONE,
|
2009-05-25 17:39:05 +02:00
|
|
|
},
|
2013-06-21 08:51:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "perf_cpu_time_max_percent",
|
|
|
|
|
.data = &sysctl_perf_cpu_time_max_percent,
|
|
|
|
|
.maxlen = sizeof(sysctl_perf_cpu_time_max_percent),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = perf_cpu_time_max_percent_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_ONE_HUNDRED,
|
2013-06-21 08:51:36 -07:00
|
|
|
},
|
2016-04-21 12:28:50 -03:00
|
|
|
{
|
|
|
|
|
.procname = "perf_event_max_stack",
|
2016-05-10 16:34:53 -03:00
|
|
|
.data = &sysctl_perf_event_max_stack,
|
2016-04-21 12:28:50 -03:00
|
|
|
.maxlen = sizeof(sysctl_perf_event_max_stack),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = perf_event_max_stack_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:11:14 -08:00
|
|
|
.extra2 = (void *)&six_hundred_forty_kb,
|
2016-04-21 12:28:50 -03:00
|
|
|
},
|
2016-05-12 13:06:21 -03:00
|
|
|
{
|
|
|
|
|
.procname = "perf_event_max_contexts_per_stack",
|
|
|
|
|
.data = &sysctl_perf_event_max_contexts_per_stack,
|
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_max_contexts_per_stack),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = perf_event_max_stack_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_ONE_THOUSAND,
|
2016-05-12 13:06:21 -03:00
|
|
|
},
|
2009-09-15 21:53:11 +02:00
|
|
|
#endif
|
2014-12-10 15:45:50 -08:00
|
|
|
{
|
|
|
|
|
.procname = "panic_on_warn",
|
|
|
|
|
.data = &panic_on_warn,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2014-12-10 15:45:50 -08:00
|
|
|
},
|
2019-10-15 02:55:57 +00:00
|
|
|
#if defined(CONFIG_TREE_RCU)
|
2016-06-02 13:51:41 -03:00
|
|
|
{
|
|
|
|
|
.procname = "panic_on_rcu_stall",
|
|
|
|
|
.data = &sysctl_panic_on_rcu_stall,
|
|
|
|
|
.maxlen = sizeof(sysctl_panic_on_rcu_stall),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2016-06-02 13:51:41 -03:00
|
|
|
},
|
2018-08-17 01:17:03 +03:00
|
|
|
#endif
|
2020-08-30 23:41:17 -07:00
|
|
|
#if defined(CONFIG_TREE_RCU)
|
|
|
|
|
{
|
|
|
|
|
.procname = "max_rcu_stall_to_panic",
|
|
|
|
|
.data = &sysctl_max_rcu_stall_to_panic,
|
|
|
|
|
.maxlen = sizeof(sysctl_max_rcu_stall_to_panic),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
|
.extra1 = SYSCTL_ONE,
|
|
|
|
|
.extra2 = SYSCTL_INT_MAX,
|
|
|
|
|
},
|
2015-05-26 22:50:33 +00:00
|
|
|
#endif
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table vm_table[] = {
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "overcommit_memory",
|
|
|
|
|
.data = &sysctl_overcommit_memory,
|
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_memory),
|
|
|
|
|
.mode = 0644,
|
2020-08-06 23:23:15 -07:00
|
|
|
.proc_handler = overcommit_policy_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_TWO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "overcommit_ratio",
|
|
|
|
|
.data = &sysctl_overcommit_ratio,
|
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_ratio),
|
|
|
|
|
.mode = 0644,
|
2014-01-21 15:49:14 -08:00
|
|
|
.proc_handler = overcommit_ratio_handler,
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "overcommit_kbytes",
|
|
|
|
|
.data = &sysctl_overcommit_kbytes,
|
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_kbytes),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = overcommit_kbytes_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
{
|
2020-04-24 08:43:37 +02:00
|
|
|
.procname = "page-cluster",
|
2005-04-16 15:20:36 -07:00
|
|
|
.data = &page_cluster,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2011-03-23 16:43:09 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2015-03-17 12:23:32 -04:00
|
|
|
{
|
|
|
|
|
.procname = "dirtytime_expire_seconds",
|
|
|
|
|
.data = &dirtytime_expire_interval,
|
2018-04-10 16:35:14 -07:00
|
|
|
.maxlen = sizeof(dirtytime_expire_interval),
|
2015-03-17 12:23:32 -04:00
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = dirtytime_interval_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2015-03-17 12:23:32 -04:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "swappiness",
|
|
|
|
|
.data = &vm_swappiness,
|
|
|
|
|
.maxlen = sizeof(vm_swappiness),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_TWO_HUNDRED,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14 17:58:21 -08:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "nr_hugepages",
|
2008-07-23 21:27:42 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = hugetlb_sysctl_handler,
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14 17:58:21 -08:00
|
|
|
},
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
|
{
|
|
|
|
|
.procname = "nr_hugepages_mempolicy",
|
|
|
|
|
.data = NULL,
|
|
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = &hugetlb_mempolicy_sysctl_handler,
|
|
|
|
|
},
|
2017-11-15 17:38:22 -08:00
|
|
|
{
|
|
|
|
|
.procname = "numa_stat",
|
|
|
|
|
.data = &sysctl_vm_numa_stat,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = sysctl_vm_numa_stat_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2017-11-15 17:38:22 -08:00
|
|
|
},
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14 17:58:21 -08:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "hugetlb_shm_group",
|
|
|
|
|
.data = &sysctl_hugetlb_shm_group,
|
|
|
|
|
.maxlen = sizeof(gid_t),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
{
|
|
|
|
|
.procname = "nr_overcommit_hugepages",
|
2008-07-23 21:27:42 -07:00
|
|
|
.data = NULL,
|
|
|
|
|
.maxlen = sizeof(unsigned long),
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = hugetlb_overcommit_handler,
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
.procname = "lowmem_reserve_ratio",
|
|
|
|
|
.data = &sysctl_lowmem_reserve_ratio,
|
|
|
|
|
.maxlen = sizeof(sysctl_lowmem_reserve_ratio),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = lowmem_reserve_ratio_sysctl_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-01-08 01:00:39 -08:00
|
|
|
{
|
|
|
|
|
.procname = "drop_caches",
|
|
|
|
|
.data = &sysctl_drop_caches,
|
|
|
|
|
.maxlen = sizeof(int),
|
2019-11-30 17:56:08 -08:00
|
|
|
.mode = 0200,
|
2006-01-08 01:00:39 -08:00
|
|
|
.proc_handler = drop_caches_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ONE,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_FOUR,
|
2006-01-08 01:00:39 -08:00
|
|
|
},
|
2010-05-24 14:32:28 -07:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
|
{
|
|
|
|
|
.procname = "compact_memory",
|
2021-05-04 18:36:48 -07:00
|
|
|
.data = NULL,
|
2010-05-24 14:32:28 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0200,
|
|
|
|
|
.proc_handler = sysctl_compaction_handler,
|
|
|
|
|
},
|
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:31:00 -07:00
|
|
|
{
|
|
|
|
|
.procname = "compaction_proactiveness",
|
|
|
|
|
.data = &sysctl_compaction_proactiveness,
|
2020-08-11 18:31:07 -07:00
|
|
|
.maxlen = sizeof(sysctl_compaction_proactiveness),
|
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:31:00 -07:00
|
|
|
.mode = 0644,
|
2021-09-02 14:59:59 -07:00
|
|
|
.proc_handler = compaction_proactiveness_sysctl_handler,
|
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:31:00 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_ONE_HUNDRED,
|
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:31:00 -07:00
|
|
|
},
|
2010-05-24 14:32:31 -07:00
|
|
|
{
|
|
|
|
|
.procname = "extfrag_threshold",
|
|
|
|
|
.data = &sysctl_extfrag_threshold,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2019-03-05 15:43:41 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2022-01-21 22:11:19 -08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:11:14 -08:00
|
|
|
.extra2 = (void *)&max_extfrag_threshold,
|
2010-05-24 14:32:31 -07:00
|
|
|
},
|
2015-04-15 16:13:20 -07:00
|
|
|
{
|
|
|
|
|
.procname = "compact_unevictable_allowed",
|
|
|
|
|
.data = &sysctl_compact_unevictable_allowed,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
2020-04-01 21:10:42 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax_warn_RT_change,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2015-04-15 16:13:20 -07:00
|
|
|
},
|
2010-05-24 14:32:31 -07:00
|
|
|
|
2010-05-24 14:32:28 -07:00
|
|
|
#endif /* CONFIG_COMPACTION */
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "min_free_kbytes",
|
|
|
|
|
.data = &min_free_kbytes,
|
|
|
|
|
.maxlen = sizeof(min_free_kbytes),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = min_free_kbytes_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as
When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.
The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.
This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.
This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
4.20-rc3+patch1-4: 18421 (98% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)
Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.
1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
4.20-rc3+patch1-4: 13464 (95% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)
As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
4.20-rc3+patch1-4: 14263 (93% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)
There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.
2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)
4.20-rc3+patch1-4: 11095 (93% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)
There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.
Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:35:52 -08:00
|
|
|
{
|
|
|
|
|
.procname = "watermark_boost_factor",
|
|
|
|
|
.data = &watermark_boost_factor,
|
|
|
|
|
.maxlen = sizeof(watermark_boost_factor),
|
|
|
|
|
.mode = 0644,
|
2020-04-24 08:43:37 +02:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as
When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.
The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.
This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.
This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
4.20-rc3+patch1-4: 18421 (98% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)
Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.
1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
4.20-rc3+patch1-4: 13464 (95% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)
As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
4.20-rc3+patch1-4: 14263 (93% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)
There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.
2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)
4.20-rc3+patch1-4: 11095 (93% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)
There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.
Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:35:52 -08:00
|
|
|
},
|
2016-03-17 14:19:14 -07:00
|
|
|
{
|
|
|
|
|
.procname = "watermark_scale_factor",
|
|
|
|
|
.data = &watermark_scale_factor,
|
|
|
|
|
.maxlen = sizeof(watermark_scale_factor),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = watermark_scale_factor_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ONE,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_THREE_THOUSAND,
|
2016-03-17 14:19:14 -07:00
|
|
|
},
|
2006-01-08 01:00:40 -08:00
|
|
|
{
|
2021-06-28 19:42:24 -07:00
|
|
|
.procname = "percpu_pagelist_high_fraction",
|
|
|
|
|
.data = &percpu_pagelist_high_fraction,
|
|
|
|
|
.maxlen = sizeof(percpu_pagelist_high_fraction),
|
2006-01-08 01:00:40 -08:00
|
|
|
.mode = 0644,
|
2021-06-28 19:42:24 -07:00
|
|
|
.proc_handler = percpu_pagelist_high_fraction_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2006-01-08 01:00:40 -08:00
|
|
|
},
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
{
|
|
|
|
|
.procname = "page_lock_unfairness",
|
|
|
|
|
.data = &sysctl_page_lock_unfairness,
|
|
|
|
|
.maxlen = sizeof(sysctl_page_lock_unfairness),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
|
{
|
|
|
|
|
.procname = "max_map_count",
|
|
|
|
|
.data = &sysctl_max_map_count,
|
|
|
|
|
.maxlen = sizeof(sysctl_max_map_count),
|
|
|
|
|
.mode = 0644,
|
2009-12-17 15:27:05 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-01-08 12:04:47 +00:00
|
|
|
#else
|
|
|
|
|
{
|
|
|
|
|
.procname = "nr_trim_pages",
|
|
|
|
|
.data = &sysctl_nr_trim_pages,
|
|
|
|
|
.maxlen = sizeof(sysctl_nr_trim_pages),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2009-01-08 12:04:47 +00:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
.procname = "vfs_cache_pressure",
|
|
|
|
|
.data = &sysctl_vfs_cache_pressure,
|
|
|
|
|
.maxlen = sizeof(sysctl_vfs_cache_pressure),
|
|
|
|
|
.mode = 0644,
|
2021-02-25 17:20:53 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2019-09-23 15:38:47 -07:00
|
|
|
#if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \
|
|
|
|
|
defined(CONFIG_ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
|
.procname = "legacy_va_layout",
|
|
|
|
|
.data = &sysctl_legacy_va_layout,
|
|
|
|
|
.maxlen = sizeof(sysctl_legacy_va_layout),
|
|
|
|
|
.mode = 0644,
|
2021-02-25 17:20:53 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
|
#endif
|
2006-01-18 17:42:32 -08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
|
{
|
|
|
|
|
.procname = "zone_reclaim_mode",
|
2016-07-28 15:46:32 -07:00
|
|
|
.data = &node_reclaim_mode,
|
|
|
|
|
.maxlen = sizeof(node_reclaim_mode),
|
2006-01-18 17:42:32 -08:00
|
|
|
.mode = 0644,
|
2021-02-25 17:20:53 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2006-01-18 17:42:32 -08:00
|
|
|
},
|
2006-07-03 00:24:13 -07:00
|
|
|
{
|
|
|
|
|
.procname = "min_unmapped_ratio",
|
|
|
|
|
.data = &sysctl_min_unmapped_ratio,
|
|
|
|
|
.maxlen = sizeof(sysctl_min_unmapped_ratio),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sysctl_min_unmapped_ratio_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_ONE_HUNDRED,
|
2006-07-03 00:24:13 -07:00
|
|
|
},
|
2006-09-25 23:31:52 -07:00
|
|
|
{
|
|
|
|
|
.procname = "min_slab_ratio",
|
|
|
|
|
.data = &sysctl_min_slab_ratio,
|
|
|
|
|
.maxlen = sizeof(sysctl_min_slab_ratio),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sysctl_min_slab_ratio_sysctl_handler,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2022-01-21 22:10:55 -08:00
|
|
|
.extra2 = SYSCTL_ONE_HUNDRED,
|
2006-09-25 23:31:52 -07:00
|
|
|
},
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
#endif
|
2007-05-09 02:35:13 -07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
{
|
|
|
|
|
.procname = "stat_interval",
|
|
|
|
|
.data = &sysctl_stat_interval,
|
|
|
|
|
.maxlen = sizeof(sysctl_stat_interval),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2007-05-09 02:35:13 -07:00
|
|
|
},
|
2016-05-19 17:12:50 -07:00
|
|
|
{
|
|
|
|
|
.procname = "stat_refresh",
|
|
|
|
|
.data = NULL,
|
|
|
|
|
.maxlen = 0,
|
|
|
|
|
.mode = 0600,
|
|
|
|
|
.proc_handler = vmstat_refresh,
|
|
|
|
|
},
|
2007-05-09 02:35:13 -07:00
|
|
|
#endif
|
2009-12-15 19:27:45 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2007-06-28 15:55:21 -04:00
|
|
|
{
|
|
|
|
|
.procname = "mmap_min_addr",
|
2009-07-31 12:54:11 -04:00
|
|
|
.data = &dac_mmap_min_addr,
|
|
|
|
|
.maxlen = sizeof(unsigned long),
|
2007-06-28 15:55:21 -04:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = mmap_min_addr_handler,
|
2007-06-28 15:55:21 -04:00
|
|
|
},
|
2009-12-15 19:27:45 +00:00
|
|
|
#endif
|
2007-07-15 23:38:01 -07:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
|
{
|
|
|
|
|
.procname = "numa_zonelist_order",
|
|
|
|
|
.data = &numa_zonelist_order,
|
|
|
|
|
.maxlen = NUMA_ZONELIST_ORDER_LEN,
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = numa_zonelist_order_handler,
|
2007-07-15 23:38:01 -07:00
|
|
|
},
|
|
|
|
|
#endif
|
2007-10-13 08:16:04 +01:00
|
|
|
#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
|
2007-03-01 10:07:42 +09:00
|
|
|
(defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
{
|
|
|
|
|
.procname = "vdso_enabled",
|
2014-05-05 12:19:32 -07:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
|
.data = &vdso32_enabled,
|
|
|
|
|
.maxlen = sizeof(vdso32_enabled),
|
|
|
|
|
#else
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
.data = &vdso_enabled,
|
|
|
|
|
.maxlen = sizeof(vdso_enabled),
|
2014-05-05 12:19:32 -07:00
|
|
|
#endif
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2009-09-16 11:50:15 +02:00
|
|
|
#ifdef CONFIG_MEMORY_FAILURE
|
|
|
|
|
{
|
|
|
|
|
.procname = "memory_failure_early_kill",
|
|
|
|
|
.data = &sysctl_memory_failure_early_kill,
|
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_early_kill),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2009-09-16 11:50:15 +02:00
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
.procname = "memory_failure_recovery",
|
|
|
|
|
.data = &sysctl_memory_failure_recovery,
|
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_recovery),
|
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2009-09-16 11:50:15 +02:00
|
|
|
},
|
|
|
|
|
#endif
|
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:08:10 -07:00
|
|
|
{
|
|
|
|
|
.procname = "user_reserve_kbytes",
|
|
|
|
|
.data = &sysctl_user_reserve_kbytes,
|
|
|
|
|
.maxlen = sizeof(sysctl_user_reserve_kbytes),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
|
},
|
2013-04-29 15:08:11 -07:00
|
|
|
{
|
|
|
|
|
.procname = "admin_reserve_kbytes",
|
|
|
|
|
.data = &sysctl_admin_reserve_kbytes,
|
|
|
|
|
.maxlen = sizeof(sysctl_admin_reserve_kbytes),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
|
},
|
mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 15:19:53 -08:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
|
|
|
|
|
{
|
|
|
|
|
.procname = "mmap_rnd_bits",
|
|
|
|
|
.data = &mmap_rnd_bits,
|
|
|
|
|
.maxlen = sizeof(mmap_rnd_bits),
|
|
|
|
|
.mode = 0600,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
|
.extra1 = (void *)&mmap_rnd_bits_min,
|
|
|
|
|
.extra2 = (void *)&mmap_rnd_bits_max,
|
|
|
|
|
},
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
|
|
|
|
|
{
|
|
|
|
|
.procname = "mmap_rnd_compat_bits",
|
|
|
|
|
.data = &mmap_rnd_compat_bits,
|
|
|
|
|
.maxlen = sizeof(mmap_rnd_compat_bits),
|
|
|
|
|
.mode = 0600,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
|
.extra1 = (void *)&mmap_rnd_compat_bits_min,
|
|
|
|
|
.extra2 = (void *)&mmap_rnd_compat_bits_max,
|
|
|
|
|
},
|
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13 17:16:41 -07:00
|
|
|
#endif
|
|
|
|
|
#ifdef CONFIG_USERFAULTFD
|
|
|
|
|
{
|
|
|
|
|
.procname = "unprivileged_userfaultfd",
|
|
|
|
|
.data = &sysctl_unprivileged_userfaultfd,
|
|
|
|
|
.maxlen = sizeof(sysctl_unprivileged_userfaultfd),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
|
.extra2 = SYSCTL_ONE,
|
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13 17:16:41 -07:00
|
|
|
},
|
mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 15:19:53 -08:00
|
|
|
#endif
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table debug_table[] = {
|
2012-10-08 16:28:16 -07:00
|
|
|
#ifdef CONFIG_SYSCTL_EXCEPTION_TRACE
|
2007-07-22 11:12:28 +02:00
|
|
|
{
|
|
|
|
|
.procname = "exception-trace",
|
|
|
|
|
.data = &show_unhandled_signals,
|
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
|
.mode = 0644,
|
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
|
},
|
|
|
|
|
#endif
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table dev_table[] = {
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
|
|
|
};
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2022-01-21 22:13:24 -08:00
|
|
|
DECLARE_SYSCTL_BASE(kernel, kern_table);
|
|
|
|
|
DECLARE_SYSCTL_BASE(vm, vm_table);
|
|
|
|
|
DECLARE_SYSCTL_BASE(debug, debug_table);
|
|
|
|
|
DECLARE_SYSCTL_BASE(dev, dev_table);
|
2020-04-24 08:43:37 +02:00
|
|
|
|
2022-01-21 22:13:31 -08:00
|
|
|
int __init sysctl_init_bases(void)
|
2005-11-04 10:18:40 +00:00
|
|
|
{
|
2022-01-21 22:13:24 -08:00
|
|
|
register_sysctl_base(kernel);
|
|
|
|
|
register_sysctl_base(vm);
|
|
|
|
|
register_sysctl_base(debug);
|
|
|
|
|
register_sysctl_base(dev);
|
2012-07-30 14:42:48 -07:00
|
|
|
|
2007-02-14 00:34:13 -08:00
|
|
|
return 0;
|
|
|
|
|
}
|
2006-09-27 01:51:04 -07:00
|
|
|
#endif /* CONFIG_SYSCTL */
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
|
* No sense putting this after each symbol definition, twice,
|
|
|
|
|
* exception granted :-)
|
|
|
|
|
*/
|
2021-08-03 12:59:36 +02:00
|
|
|
EXPORT_SYMBOL(proc_dobool);
|
2005-04-16 15:20:36 -07:00
|
|
|
EXPORT_SYMBOL(proc_dointvec);
|
2016-08-25 15:16:51 -07:00
|
|
|
EXPORT_SYMBOL(proc_douintvec);
|
2005-04-16 15:20:36 -07:00
|
|
|
EXPORT_SYMBOL(proc_dointvec_jiffies);
|
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_minmax);
|
2017-07-12 14:33:40 -07:00
|
|
|
EXPORT_SYMBOL_GPL(proc_douintvec_minmax);
|
2005-04-16 15:20:36 -07:00
|
|
|
EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
|
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
|
|
|
|
|
EXPORT_SYMBOL(proc_dostring);
|
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_minmax);
|
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_ms_jiffies_minmax);
|
2019-04-17 16:35:49 -04:00
|
|
|
EXPORT_SYMBOL(proc_do_large_bitmap);
|