The current code is rather complex and caused a lot of subtle
and hard to debug bugs in the past. Simplify the code by calling
the system_call handler with interrupts disabled, save
machine state, and re-enable them later.
This requires significant changes to the machine check handling code
as well. When the machine check interrupt arrived while being in kernel
mode the new code will signal pending machine checks with a SIGP external
call. When userspace was interrupted, the handler will switch to the
kernel stack and directly execute s390_handle_mcck().
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pull s390 updates from Vasily Gorbik:
- Update maintainers. Niklas Schnelle takes over zpci and Vineeth
Vijayan common io code.
- Extend cpuinfo to include topology information.
- Add new extended counters for IBM z15 and sampling buffer allocation
rework in perf code.
- Add control over zeroing out memory during system restart.
- CCA protected key block version 2 support and other
fixes/improvements in crypto code.
- Convert to new fallthrough; annotations.
- Replace zero-length arrays with flexible-arrays.
- QDIO debugfs and other small improvements.
- Drop 2-level paging support optimization for compat tasks. Varios mm
cleanups.
- Remove broken and unused hibernate / power management support.
- Remove fake numa support which does not bring any benefits.
- Exclude offline CPUs from CPU topology masks to be more consistent
with other architectures.
- Prevent last branching instruction address leaking to userspace.
- Other small various fixes and improvements all over the code.
* tag 's390-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (57 commits)
s390/mm: cleanup init_new_context() callback
s390/mm: cleanup virtual memory constants usage
s390/mm: remove page table downgrade support
s390/qdio: set qdio_irq->cdev at allocation time
s390/qdio: remove unused function declarations
s390/ccwgroup: remove pm support
s390/ap: remove power management code from ap bus and drivers
s390/zcrypt: use kvmalloc instead of kmalloc for 256k alloc
s390/mm: cleanup arch_get_unmapped_area() and friends
s390/ism: remove pm support
s390/cio: use fallthrough;
s390/vfio: use fallthrough;
s390/zcrypt: use fallthrough;
s390: use fallthrough;
s390/cpum_sf: Fix wrong page count in error message
s390/diag: fix display of diagnose call statistics
s390/ap: Remove ap device suspend and resume callbacks
s390/pci: Improve handling of unset UID
s390/pci: Fix zpci_alloc_domain() over allocation
s390/qdio: pass ISC as parameter to chsc_sadc()
...
This update consolidates page table handling code. Because
there are hardly any 31-bit binaries left we do not need to
optimize for that.
No extra efforts are needed to ensure that a compat task does
not map anything above 2GB. The TASK_SIZE limit for 31-bit
tasks is 2GB already and the generic code does check that a
resulting map address would not surpass that limit.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When userspace executes a syscall or gets interrupted,
BEAR contains a kernel address when returning to userspace.
This make it pretty easy to figure out where the kernel is
mapped even with KASLR enabled. To fix this, add lpswe to
lowcore and always execute it there, so userspace sees only
the lowcore address of lpswe. For this we have to extend
both critical_cleanup and the SWITCH_ASYNC macro to also check
for lpswe addresses in lowcore.
Fixes: b2d24b97b2 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
Cc: <stable@vger.kernel.org> # v5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390 math emulation was removed with commit 5a79859ae0 ("s390:
remove 31 bit support"), rendering ieee_emulation_warnings useless.
The code still built because it was protected by CONFIG_MATHEMU, which
was no longer selectable.
This patch removes the sysctl_ieee_emulation_warnings declaration and
the sysctl entry declaration.
Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Stephen Kitt <steve@sk2.org>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
disabled_wait uses _THIS_IP_ and assumes that compiler would inline it.
Make sure this assumption is always correct by utilizing __always_inline.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This function must be inlined since any caller expects the current
stack pointer; which wouldn't be true if the function isn't inlined.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390_base_mcck_handler was used during system reset if diag308 set was
not available. But after commit d485235b00 ("s390: assume diag308 set
always works") is a dead code and could be removed.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
stop_machine is the only user left of cpu_relax_yield. Given that it
now has special semantics which are tied to stop_machine introduce a
weak stop_machine_yield function which architectures can override, and
get rid of the generic cpu_relax_yield implementation.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The stop_machine loop to advance the state machine and to wait for all
affected CPUs to check-in calls cpu_relax_yield in a tight loop until
the last missing CPUs acknowledged the state transition.
On a virtual system where not all logical CPUs are backed by real CPUs
all the time it can take a while for all CPUs to check-in. With the
current definition of cpu_relax_yield a diagnose 0x44 is done which
tells the hypervisor to schedule *some* other CPU. That can be any
CPU and not necessarily one of the CPUs that need to run in order to
advance the state machine. This can lead to a pretty bad diagnose 0x44
storm until the last missing CPU finally checked-in.
Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
directed yield. Each CPU in the wait loop will pick up the next CPU
in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
hypervisor to run this next CPU instead of the current one. If there
is only a limited number of real CPUs backing the virtual CPUs we
end up with the real CPUs passed around in a round-robin fashion.
[heiko.carstens@de.ibm.com]:
Use cpumask_next_wrap as suggested by Peter Zijlstra.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The disabled_wait() function uses its argument as the PSW address when
it stops the CPU with a wait PSW that is disabled for interrupts.
The different callers sometimes use a specific number like 0xdeadbeef
to indicate a specific failure, the early boot code uses 0 and some
other calls sites use __builtin_return_address(0).
At the time a dump is created the current PSW and the registers of a
CPU are written to lowcore to make them avaiable to the dump analysis
tool. For a CPU stopped with disabled_wait the PSW and the registers
do not really make sense together, the PSW address does not point to
the function the registers belong to.
Simplify disabled_wait() by using _THIS_IP_ for the PSW address and
drop the argument to the function.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Rework the dump_trace() stack unwinder interface to support different
unwinding algorithms. The new interface looks like this:
struct unwind_state state;
unwind_for_each_frame(&state, task, regs, start_stack)
do_something(state.sp, state.ip, state.reliable);
The unwind_bc.c file contains the implementation for the classic
back-chain unwinder.
One positive side effect of the new code is it now handles ftraced
functions gracefully. It prints the real name of the return function
instead of 'return_to_handler'.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
clang fails to use the %O and %R inline assembly modifiers
the same way as gcc, leading to build failures with every use
of __load_psw_mask():
/tmp/nmi-4a9f80.s: Assembler messages:
/tmp/nmi-4a9f80.s:571: Error: junk at end of line: `+8(160(%r11))'
/tmp/nmi-4a9f80.s:626: Error: junk at end of line: `+8(160(%r11))'
Replace these with a more conventional way of passing the addresses
that should work with both clang and gcc.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The CALL_ON_STACK helper currently does not work with clang and for
calls without arguments. It does not initialize r2 although the constraint
is "+&d". Rework the CALL_FMT_x and the CALL_ON_STACK macros to work
with clang and produce optimal code in all cases.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The __no_sanitize_address_or_inline and __no_kasan_or_inline defines
are almost identical. The only difference is that __no_kasan_or_inline
does not have the 'notrace' attribute.
To be able to replace __no_sanitize_address_or_inline with the older
definition, add 'notrace' to __no_kasan_or_inline and change to two
users of __no_sanitize_address_or_inline in the s390 code.
The 'notrace' option is necessary for e.g. the __load_psw_mask function
in arch/s390/include/asm/processor.h. Without the option it is possible
to trace __load_psw_mask which leads to kernel stack overflow.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pointed-out-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some functions from both arch/s390/kernel/ipl.c and
arch/s390/kernel/machine_kexec.c are called without DAT enabled
(or with and without DAT enabled code paths). There is no easy way
to partially disable kasan for those files without a substantial
rework. Disable kasan for both files for now.
To avoid disabling kasan for arch/s390/kernel/diag.c DAT flag is
enabled in diag308 call. pcpu_delegate which disables DAT is marked
with __no_sanitize_address to disable instrumentation for that one
function.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
smp_start_secondary function is called without DAT enabled. To avoid
disabling kasan instrumentation for entire arch/s390/kernel/smp.c
smp_start_secondary has been split in 2 parts. smp_start_secondary has
instrumentation disabled, it does minimal setup and enables DAT. Then
instrumentated __smp_start_secondary is called to do the rest.
__load_psw_mask function instrumentation has been disabled as well
to be able to call it from smp_start_secondary.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove STACK_ORDER and STACK_SIZE in favour of identical THREAD_SIZE_ORDER
and THREAD_SIZE definitions. THREAD_SIZE and THREAD_SIZE_ORDER naming is
misleading since it is used as general kernel stack size information. But
both those definitions are used in the common code and throughout
architectures specific code, so changing the naming is problematic.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With virtually mapped kernel stacks the kernel stack overflow detection
is now fault based, every stack has a guard page in the vmalloc space.
The panic_stack is renamed to nodat_stack and is used for all function
that need to run without DAT, e.g. memcpy_real or do_start_kdump.
The main effect is a reduction in the kernel image size as with vmap
stacks the old style overflow checking that adds two instructions per
function is not needed anymore. Result from bloat-o-meter:
add/remove: 20/1 grow/shrink: 13/26854 up/down: 2198/-216240 (-214042)
In regard to performance the micro-benchmark for fork has a hit of a
few microseconds, allocating 4 pages in vmalloc space is more expensive
compare to an order-2 page allocation. But with real workload I could
not find a noticeable difference.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Define TIF_ISOLATE_BP and TIF_ISOLATE_BP_GUEST and add the necessary
plumbing in entry.S to be able to run user space and KVM guests with
limited branch prediction.
To switch a user space process to limited branch prediction the
s390_isolate_bp() function has to be call, and to run a vCPU of a KVM
guest associated with the current task with limited branch prediction
call s390_isolate_bp_guest().
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add the PPA instruction to the system entry and exit path to switch
the kernel to a different branch prediction behaviour. The instructions
are added via CPU alternatives and can be disabled with the "nospec"
or the "nobp=0" kernel parameter. If the default behaviour selected
with CONFIG_KERNEL_NOBP is set to "n" then the "nobp=1" parameter can be
used to enable the changed kernel branch prediction.
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>