There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
Instead of using args[0] for the value of the last breaking event
address register, add a member to make things more obvious.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Introduce show_stack_loglvl(), that eventually will substitute
show_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200418201944.482088-29-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add stack name, sp and reliable information into test unwinding
results. Also consider ip outside of kernel text as failure if the
state is reported reliable.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Avoid mixture of task == NULL and task == current meaning the same
thing and simply always initialize task with current in unwind_start.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove pointless stack recursion on stack type ... warning, which
only confuses people. There is no way to make backchain unwinder 100%
reliable. When a task is interrupted in-between stack frame allocation
and backchain write instructions new stack frame backchain pointer is
left uninitialized (there are also sometimes additional instruction
in-between stack frame allocation and backchain write instructions due
to gcc shrink-wrapping). In attempt to unwind such stack the unwinder
would still try to use that invalid backchain value and perform all kind
of sanity checks on it to make sure we are not pointed out of stack. In
some cases that invalid backchain value would be 0 and we would falsely
treat next stackframe as pt_regs and again gprs[15] in those pt_regs
might happen to point at some address within the task's stack.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There never have been distributions that shiped with CONFIG_SMP=n for
s390. In addition the kernel currently doesn't even compile with
CONFIG_SMP=n for s390. Most likely it wouldn't even work, even if we
fix the compile error, since nobody tests it, since there is no use
case that I can think of.
Therefore simply enforce CONFIG_SMP and get rid of some more or
less unused code.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Rework the dump_trace() stack unwinder interface to support different
unwinding algorithms. The new interface looks like this:
struct unwind_state state;
unwind_for_each_frame(&state, task, regs, start_stack)
do_something(state.sp, state.ip, state.reliable);
The unwind_bc.c file contains the implementation for the classic
back-chain unwinder.
One positive side effect of the new code is it now handles ftraced
functions gracefully. It prints the real name of the return function
instead of 'return_to_handler'.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With pointer obfuscation the output of show_registers() became quite useless:
Krnl PSW : (____ptrval____) (____ptrval____) (__list_add_valid+0x98/0xa8)
In order to print the psw mask and address use %px instead of %p.
And the output looks again like this:
Krnl PSW : 0404d00180000000 00000000007c0dd0 (__list_add_valid+0x98/0xa8)
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove STACK_ORDER and STACK_SIZE in favour of identical THREAD_SIZE_ORDER
and THREAD_SIZE definitions. THREAD_SIZE and THREAD_SIZE_ORDER naming is
misleading since it is used as general kernel stack size information. But
both those definitions are used in the common code and throughout
architectures specific code, so changing the naming is problematic.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With virtually mapped kernel stacks the kernel stack overflow detection
is now fault based, every stack has a guard page in the vmalloc space.
The panic_stack is renamed to nodat_stack and is used for all function
that need to run without DAT, e.g. memcpy_real or do_start_kdump.
The main effect is a reduction in the kernel image size as with vmap
stacks the old style overflow checking that adds two instructions per
function is not needed anymore. Result from bloat-o-meter:
add/remove: 20/1 grow/shrink: 13/26854 up/down: 2198/-216240 (-214042)
In regard to performance the micro-benchmark for fork has a hit of a
few microseconds, allocating 4 pages in vmalloc space is more expensive
compare to an order-2 page allocation. But with real workload I could
not find a noticeable difference.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Rename a couple of the struct psw_bits members so it is more obvious
for what they are good. Initially I thought using the single character
names from the PoP would be sufficient and obvious, but admittedly
that is not true.
The current implementation is not easy to use, if one has to look into
the source file to figure out which member represents the 'per' bit
(which is the 'r' member).
Therefore rename the members to sane names that are identical to the
uapi psw mask defines:
r -> per
i -> io
e -> ext
t -> dat
m -> mcheck
w -> wait
p -> pstate
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove raw stack dumps that are printed before call traces in case of
a warning, or the 'l' sysrq trigger (show a stack backtrace for all
active CPUs).
Besides that a raw stack dump should not be shown for the 'l' sysrq
trigger the value of the dump is close to zero. That's also why we
don't print it in case of a panic since ages anymore. That this is
still printed on warnings is just a leftover. So get rid of this
completely.
The following won't be printed anymore with this change:
Stack:
00000000bbc4fbc8 00000000bbc4fc58 0000000000000003 0000000000000000
00000000bbc4fcf8 00000000bbc4fc70 00000000bbc4fc70 0000000000000020
000000007fe00098 00000000bfe8be00 00000000bbc4fe94 000000000000000a
000000000000000c 00000000bbc4fcc0 0000000000000000 0000000000000000
000000000095b930 0000000000113366 00000000bbc4fc58 00000000bbc4fca0
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use pr_cont instead of printk calls also within show_stack and
die in order to avoid extra line breaks.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>