PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.
The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set. This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.
Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU. In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage. Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated. However, khugepaged
collapses the pages and the expected page faults do not occur.
In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.
Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.
It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.
Fixes: a0715cc226 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc compat stuff updates from Al Viro:
"This part is basically untangling various compat stuff. Compat
syscalls moved to their native counterparts, getting rid of quite a
bit of double-copying and/or set_fs() uses. A lot of field-by-field
copyin/copyout killed off.
- kernel/compat.c is much closer to containing just the
copyin/copyout of compat structs. Not all compat syscalls are gone
from it yet, but it's getting there.
- ipc/compat_mq.c killed off completely.
- block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
drivers/block/floppy.c where they belong. Yes, there are several
drivers that implement some of the same ioctls. Some are m68k and
one is 32bit-only pmac. drivers/block/floppy.c is the only one in
that bunch that can be built on biarch"
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: move compat syscalls to native ones
usbdevfs: get rid of field-by-field copyin
compat_hdio_ioctl: get rid of set_fs()
take floppy compat ioctls to sodding floppy.c
ipmi: get rid of field-by-field __get_user()
ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
rt_sigtimedwait(): move compat to native
select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
put_compat_rusage(): switch to copy_to_user()
sigpending(): move compat to native
getrlimit()/setrlimit(): move compat to native
times(2): move compat to native
compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
do_sigaltstack(): lift copying to/from userland into callers
take compat_sys_old_getrlimit() to native syscall
trim __ARCH_WANT_SYS_OLD_GETRLIMIT
New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(),
sys_wait4() and their compat variants switched to those. Copying
struct rusage to userland is left to syscall itself. For
compat_sys_wait4() that eliminates the use of set_fs() completely.
For compat_sys_waitid() it's still needed (for siginfo handling);
that will change shortly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.
There is a lot of good development going on but unfortunately it
didn't quite make the merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root
Modify do_prlimit to call security_task_setrlimit passing the task
whose rlimit we are changing not the tsk->group_leader.
In general this should not matter as the lsms implementing
security_task_setrlimit apparmor and selinux both examine the
task->cred to see what should be allowed on the destination task.
That task->cred is shared between tasks created with CLONE_THREAD
unless thread keyrings are in play, in which case both apparmor and
selinux create duplicate security contexts.
So the only time when it will matter which thread is passed to
security_task_setrlimit is if one of the threads of a process performs
an operation that changes only it's credentials. At which point if a
thread has done that we don't want to hide that information from the
lsms.
So fix the call of security_task_setrlimit. With the removal
of tsk->group_leader this makes the code slightly faster,
more comprehensible and maintainable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When SELinux was first added to the kernel, a process could only get
and set its own resource limits via getrlimit(2) and setrlimit(2), so no
MAC checks were required for those operations, and thus no security hooks
were defined for them. Later, SELinux introduced a hook for setlimit(2)
with a check if the hard limit was being changed in order to be able to
rely on the hard limit value as a safe reset point upon context
transitions.
Later on, when prlimit(2) was added to the kernel with the ability to get
or set resource limits (hard or soft) of another process, LSM/SELinux was
not updated other than to pass the target process to the setrlimit hook.
This resulted in incomplete control over both getting and setting the
resource limits of another process.
Add a new security_task_prlimit() hook to the check_prlimit_permission()
function to provide complete mediation. The hook is only called when
acting on another task, and only if the existing DAC/capability checks
would allow access. Pass flags down to the hook to indicate whether the
prlimit(2) call will read, write, or both read and write the resource
limits of the target process.
The existing security_task_setrlimit() hook is left alone; it continues
to serve a purpose in supporting the ability to make decisions based on
the old and/or new resource limit values when setting limits. This
is consistent with the DAC/capability logic, where
check_prlimit_permission() performs generic DAC/capability checks for
acting on another task, while do_prlimit() performs a capability check
based on a comparison of the old and new resource limits. Fix the
inline documentation for the hook to match the code.
Implement the new hook for SELinux. For setting resource limits, we
reuse the existing setrlimit permission. Note that this does overload
the setrlimit permission to mean the ability to set the resource limit
(soft or hard) of another process or the ability to change one's own
hard limit. For getting resource limits, a new getrlimit permission
is defined. This was not originally defined since getrlimit(2) could
only be used to obtain a process' own limits.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <james.l.morris@oracle.com>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/coredump.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/autogroup.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull namespace updates from Eric Biederman:
"There is a lot here. A lot of these changes result in subtle user
visible differences in kernel behavior. I don't expect anything will
care but I will revert/fix things immediately if any regressions show
up.
From Seth Forshee there is a continuation of the work to make the vfs
ready for unpriviled mounts. We had thought the previous changes
prevented the creation of files outside of s_user_ns of a filesystem,
but it turns we missed the O_CREAT path. Ooops.
Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
children that are forked after the prctl are considered and not
children forked before the prctl. The only known user of this prctl
systemd forks all children after the prctl. So no userspace
regressions will occur. Holding earlier forked children to the same
rules as later forked children creates a semantic that is sane enough
to allow checkpoing of processes that use this feature.
There is a long delayed change by Nikolay Borisov to limit inotify
instances inside a user namespace.
Michael Kerrisk extends the API for files used to maniuplate
namespaces with two new trivial ioctls to allow discovery of the
hierachy and properties of namespaces.
Konstantin Khlebnikov with the help of Al Viro adds code that when a
network namespace exits purges it's sysctl entries from the dcache. As
in some circumstances this could use a lot of memory.
Vivek Goyal fixed a bug with stacked filesystems where the permissions
on the wrong inode were being checked.
I continue previous work on ptracing across exec. Allowing a file to
be setuid across exec while being ptraced if the tracer has enough
credentials in the user namespace, and if the process has CAP_SETUID
in it's own namespace. Proc files for setuid or otherwise undumpable
executables are now owned by the root in the user namespace of their
mm. Allowing debugging of setuid applications in containers to work
better.
A bug I introduced with permission checking and automount is now
fixed. The big change is to mark the mounts that the kernel initiates
as a result of an automount. This allows the permission checks in sget
to be safely suppressed for this kind of mount. As the permission
check happened when the original filesystem was mounted.
Finally a special case in the mount namespace is removed preventing
unbounded chains in the mount hash table, and making the semantics
simpler which benefits CRIU.
The vfs fix along with related work in ima and evm I believe makes us
ready to finish developing and merge fully unprivileged mounts of the
fuse filesystem. The cleanups of the mount namespace makes discussing
how to fix the worst case complexity of umount. The stacked filesystem
fixes pave the way for adding multiple mappings for the filesystem
uids so that efficient and safer containers can be implemented"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc/sysctl: Don't grab i_lock under sysctl_lock.
vfs: Use upper filesystem inode in bprm_fill_uid()
proc/sysctl: prune stale dentries during unregistering
mnt: Tuck mounts under others instead of creating shadow/side mounts.
prctl: propagate has_child_subreaper flag to every descendant
introduce the walk_process_tree() helper
nsfs: Add an ioctl() to return owner UID of a userns
fs: Better permission checking for submounts
exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
vfs: open() with O_CREAT should not create inodes with unknown ids
nsfs: Add an ioctl() to return the namespace type
proc: Better ownership of files for non-dumpable tasks in user namespaces
exec: Remove LSM_UNSAFE_PTRACE_CAP
exec: Test the ptracer's saved cred to see if the tracee can gain caps
exec: Don't reset euid and egid when the tracee has CAP_SETUID
inotify: Convert to using per-namespace limits
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.
To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.
In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.
Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to init before becoming subreaper, because some our
ancestor migh have had is_child_subreaper flag while forking our
sub-tree and our childs will all inherit has_child_subreaper flag,
and we have no way to influence it. And only way to check if we have
no has_child_subreaper flag is to create some childs, kill them and
see where they will reparent to.
Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.
Optimize:
a) When descendant already has has_child_subreaper flag all his subtree
has it too already.
* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that parent has no more children, and
only after child with wrong flag is added to the tree.
* Also make these inheritance more clear by using real_parent instead of
current, as on clone(CLONE_PARENT) if current has is_child_subreaper
and real_parent has no is_child_subreaper or has_child_subreaper, child
will have has_child_subreaper flag set without actually having a
subreaper in it's ancestors.
b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.
So we can skip their(a,b) subtree from walk.
v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance
v3: remove csr_descendant leftover, change current to real_parent
in has_child_subreaper inheritance
v4: small commit message fix
Fixes: ebec18a6d3 ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge updates from Andrew Morton:
- various misc bits
- most of MM (quite a lot of MM material is awaiting the merge of
linux-next dependencies)
- kasan
- printk updates
- procfs updates
- MAINTAINERS
- /lib updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
init: reduce rootwait polling interval time to 5ms
binfmt_elf: use vmalloc() for allocation of vma_filesz
checkpatch: don't emit unified-diff error for rename-only patches
checkpatch: don't check c99 types like uint8_t under tools
checkpatch: avoid multiple line dereferences
checkpatch: don't check .pl files, improve absolute path commit log test
scripts/checkpatch.pl: fix spelling
checkpatch: don't try to get maintained status when --no-tree is given
lib/ida: document locking requirements a bit better
lib/rbtree.c: fix typo in comment of ____rb_erase_color
lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
MAINTAINERS: add drm and drm/i915 irc channels
MAINTAINERS: add "C:" for URI for chat where developers hang out
MAINTAINERS: add drm and drm/i915 bug filing info
MAINTAINERS: add "B:" for URI where to file bugs
get_maintainer: look for arbitrary letter prefixes in sections
printk: add Kconfig option to set default console loglevel
printk/sound: handle more message headers
printk/btrfs: handle more message headers
printk/kdb: handle more message headers
...
This limitation came with the reason to remove "another way for
malicious code to obscure a compromised program and masquerade as a
benign process" by allowing "security-concious program can use this
prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":
http://marc.info/?l=linux-kernel&m=133160684517468&w=2
This explanation doesn't look sufficient. The only thing "exe" link is
indicating is the file, used to execve, which is basically nothing and
not reliable immediately after process has returned from execve system
call.
Moreover, to use this feture, all the mappings to previous exe file have
to be unmapped and all the new exe file permissions must be satisfied.
Which means, that changing exe link is very similar to calling execve on
the binary.
The need to remove this limitations comes from migration of NFS mount
point, which is not accessible during restore and replaced by other file
system. Because of this exe link has to be changed twice.
[akpm@linux-foundation.org: fix up comment]
Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some embedded systems have no use for them. This removes about
25KB from the kernel binary size when configured out.
Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.
The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>