[ Upstream commit 905ae01c4a ]
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.
Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().
Changelog
v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
error was caused by the fact that cred_alloc_blank() left the ucounts
pointer empty.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
Pull char/misc driver updates from Greg KH:
"Here is the big set of char, misc, and other assorted driver subsystem
patches for 5.10-rc1.
There's a lot of different things in here, all over the drivers/
directory. Some summaries:
- soundwire driver updates
- habanalabs driver updates
- extcon driver updates
- nitro_enclaves new driver
- fsl-mc driver and core updates
- mhi core and bus updates
- nvmem driver updates
- eeprom driver updates
- binder driver updates and fixes
- vbox minor bugfixes
- fsi driver updates
- w1 driver updates
- coresight driver updates
- interconnect driver updates
- misc driver updates
- other minor driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
binder: fix UAF when releasing todo list
docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
misc: Kconfig: fix a HISI_HIKEY_USB dependency
LSM: Fix type of id parameter in kernel_post_load_data prototype
misc: Kconfig: add a new dependency for HISI_HIKEY_USB
firmware_loader: fix a kernel-doc markup
w1: w1_therm: make w1_poll_completion static
binder: simplify the return expression of binder_mmap
test_firmware: Test partial read support
firmware: Add request_partial_firmware_into_buf()
firmware: Store opt_flags in fw_priv
fs/kernel_file_read: Add "offset" arg for partial reads
IMA: Add support for file reads without contents
LSM: Add "contents" flag to kernel_read_file hook
module: Call security_kernel_post_load_data()
firmware_loader: Use security_post_load_data()
LSM: Introduce kernel_post_load_data() hook
fs/kernel_read_file: Add file_size output argument
fs/kernel_read_file: Switch buffer size arg to size_t
fs/kernel_read_file: Remove redundant size argument
...
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve. The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.
The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.
The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.
The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances. In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.
Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for implementiong kernel_execve (which will take kernel
pointers not userspace pointers) factor out bprm_stack_limits out of
prepare_arg_pages. This separates the counting which depends upon the
getting data from userspace from the calculations of the stack limits
which is usable in kernel_execve.
The remove prepare_args_pages and compute bprm->argc and bprm->envc
directly in do_execveat_common, before bprm_stack_limits is called.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Factor bprm_execve
out of do_execve_common to separate out the copying of arguments
to the newe stack, and the rest of exec.
In separating bprm_execve from do_execve_common the copying
of the arguments onto the new stack happens earlier.
As the copying of the arguments does not depend any security hooks,
files, the file table, current->in_execve, current->fs->in_exec,
bprm->unsafe, or creds this is safe.
Likewise the security hook security_creds_for_exec does not depend upon
preventing the argument copying from happening.
In addition to making it possible to implement kernel_execve that
performs the copying differently, this separation of bprm_execve from
do_execve_common makes for a nice separation of responsibilities making
the exec code easier to navigate.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code that
launches init to use set_fs so that pages coming from the kernel look like
they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument copying
from userspace needs to happen earlier. Move the allocation and
initialization of bprm->mm into alloc_bprm so that the bprm->mm is
available early to store the new user stack into. This is a prerequisite
for copying argv and envp into the new user stack early before ther rest of
exec.
To keep the things consistent the cleanup of bprm->mm is moved into
free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
allocated and free_bprm are called.
Moving bprm_mm_init earlier is safe as it does not depend on any files,
current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
table is shared. (AKA bprm_mm_init does not depend on any of the code that
happens between alloc_bprm and where it was previously called.)
This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
is safe because current->fs->in_exec is only used to preventy taking an
additional reference on the fs_struct.
This moves bprm->mm cleanup after current->in_execve is set to 0. This is
safe because current->in_execve is only used by the lsms (apparmor and
tomoyou) and always for LSM specific functions, never for anything to do
with the mm.
This adds bprm->mm cleanup into the successful return path. This is safe
because being on the successful return path implies that begin_new_exec
succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
moving into free_bprm will do nothing.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the computation
of bprm->filename and possible allocation of a name in the case
of execveat into alloc_bprm to make that possible.
The exectuable name, the arguments, and the environment are
copied into the new usermode stack which is stored in bprm
until exec passes the point of no return.
As the executable name is copied first onto the usermode stack
it needs to be known. As there are no dependencies to computing
the executable name, compute it early in alloc_bprm.
As an implementation detail if the filename needs to be generated
because it embeds a file descriptor store that filename in a new field
bprm->fdpath, and free it in free_bprm. Previously this was done in
an independent variable pathbuf. I have renamed pathbuf fdpath
because fdpath is more suggestive of what kind of path is in the
variable. I moved fdpath into struct linux_binprm because it is
tightly tied to the other variables in struct linux_binprm, and as
such is needed to allow the call alloc_binprm to move.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the allocation
of the bprm into it's own function (alloc_bprm) and move the call of
alloc_bprm before unshare_files so that bprm can ultimately be
allocated, the arguments can be placed on the new stack, and then the
bprm can be passed into the core of exec.
Neither the allocation of struct binprm nor the unsharing depend upon each
other so swapping the order in which they are called is trivially safe.
To keep things consistent the order of cleanup at the end of
do_execve_common swapped to match the order of initialization.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>