Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
callback install/invocation machinery. Long standing bug caused by a
massive brain slip of that Gleixner dude, which went unnoticed for
almost a year"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Serialize callback invocations proper
Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:
- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.
- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.
- a tooling fix for symbol handling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()
Pull scheduler fixes from Thomas Gleixner:
"From the scheduler departement:
- a bunch of sched deadline related fixes which deal with various
buglets and corner cases.
- two fixes for the loadavg spikes which are caused by the delayed
NOHZ accounting"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use deadline instead of period when calculating overflow
sched/deadline: Throttle a constrained deadline task activated after the deadline
sched/deadline: Make sure the replenishment timer fires in the next period
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
sched/deadline: Add missing update_rq_clock() in dl_task_timer()
Pull locking fixes from Thomas Gleixner:
"Three fixes related to locking:
- fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
for the XCHGADD variant already
- plug a potential use after free in the futex code
- prevent leaking a held spinlock in an futex error handling code
path"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
futex: Add missing error handling to FUTEX_REQUEUE_PI
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
Commit bfc8c90139 ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.
The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.
The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().
This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit f931ab479d ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").
That commit was extended again with commit b5d24fda9c ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.
In addition with commit 3fc2192410 ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.
This in turn triggers the following warning on s390:
WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
Call Trace:
assert_held_device_hotplug+0x40/0x58)
mem_hotplug_begin+0x34/0xc8
add_memory_resource+0x7e/0x1f8
add_memory+0xda/0x130
add_memory_merged+0x15c/0x178
sclp_detect_standby_memory+0x2ae/0x2f8
do_one_initcall+0xa2/0x150
kernel_init_freeable+0x228/0x2d8
kernel_init+0x2a/0x140
kernel_thread_starter+0x6/0xc
One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end(). But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).
Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.
To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.
Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While going through the event inheritance code Oleg got confused.
Add some comments to better explain the silent dissapearance of
orphaned events.
So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.
Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.
We then avoid copying orphaned events; in order to avoid getting more
of them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.
Daniel's test case had:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
To make it more interesting, I changed it to:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.
Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.
runtime / (deadline - t) > dl_runtime / dl_period
There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:
runtime / (deadline - t) > dl_runtime / dl_deadline
We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".
After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.
To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.
Reproducer:
--------------- %< ---------------
int main (int argc, char **argv)
{
int ret;
int flags = 0;
unsigned long l = 0;
struct timespec ts;
struct sched_attr attr;
memset(&attr, 0, sizeof(attr));
attr.size = sizeof(attr);
attr.sched_policy = SCHED_DEADLINE;
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
ts.tv_sec = 0;
ts.tv_nsec = 2000 * 1000; /* 2 ms */
ret = sched_setattr(0, &attr, flags);
if (ret < 0) {
perror("sched_setattr");
exit(-1);
}
for(;;) {
/* XXX: you may need to adjust the loop */
for (l = 0; l < 150000; l++);
/*
* The ideia is to go to sleep right before the deadline
* and then wake up before the next period to receive
* a new replenishment.
*/
nanosleep(&ts, NULL);
}
exit(0);
}
--------------- >% ---------------
On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).
For instance:
f.c:
--------------- %< ---------------
int main (void)
{
for(;;);
}
--------------- >% ---------------
# gcc -o f f.c
# trace-cmd record -e sched:sched_switch \
-e syscalls:sys_exit_sched_setattr \
chrt -d --sched-runtime 490000000 \
--sched-deadline 500000000 \
--sched-period 1000000000 0 ./f
# trace-cmd report | grep "{pid of ./f}"
After setting parameters, the task is replenished and continue running
until being throttled:
f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
The task is throttled after running 492318 ms, as expected:
f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]
But then, the task is replenished 500719 ms after the first
replenishment:
<idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]
Running for 490277 ms:
f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]
Hence, in the first period, the task runs 2 * runtime, and that is a bug.
During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
<IRQ>
dump_stack+0x85/0xc4
__warn+0x172/0x1b0
warn_slowpath_fmt+0xb4/0xf0
? __warn+0x1b0/0x1b0
? debug_check_no_locks_freed+0x2c0/0x2c0
? cpudl_set+0x3d/0x2b0
replenish_dl_entity+0x71e/0xc40
enqueue_task_dl+0x2ea/0x12e0
? dl_task_timer+0x777/0x990
? __hrtimer_run_queues+0x270/0xa50
dl_task_timer+0x316/0x990
? enqueue_task_dl+0x12e0/0x12e0
? enqueue_task_dl+0x12e0/0x12e0
__hrtimer_run_queues+0x270/0xa50
? hrtimer_cancel+0x20/0x20
? hrtimer_interrupt+0x119/0x600
hrtimer_interrupt+0x19c/0x600
? trace_hardirqs_off+0xd/0x10
local_apic_timer_interrupt+0x74/0xe0
smp_apic_timer_interrupt+0x76/0xa0
apic_timer_interrupt+0x93/0xa0
The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.
2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.
3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.
4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.
5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.
6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.
7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.
8) Fix use after free in vrf_xmit(), from David Ahern.
9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.
10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.
11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.
12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.
13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.
14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.
15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...
Pull workqueue fix from Tejun Heo:
"If a delayed work is queued with NULL @wq, workqueue code explodes
after the timer expires at which point it's difficult to tell who the
culprit was.
This actually happened and the offender was net/smc this time.
Add an explicit sanity check for it in the queueing path"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
The setup/remove_state/instance() functions in the hotplug core code are
serialized against concurrent CPU hotplug, but unfortunately not serialized
against themself.
As a consequence a concurrent invocation of these function results in
corruption of the callback machinery because two instances try to invoke
callbacks on remote cpus at the same time. This results in missing callback
invocations and initiator threads waiting forever on the completion.
The obvious solution to replace get_cpu_online() with cpu_hotplug_begin()
is not possible because at least one callsite calls into these functions
from a get_online_cpu() locked region.
Extend the protection scope of the cpuhp_state_mutex from solely protecting
the state arrays to cover the callback invocation machinery as well.
Fixes: 5b7aa87e04 ("cpu/hotplug: Implement setup/removal interface")
Reported-and-tested-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: hpa@zytor.com
Cc: mingo@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Thomas Gleixner:
- a fix for the kexec/purgatory regression which was introduced in the
merge window via an innocent sparse fix. We could have reverted that
commit, but on deeper inspection it turned out that the whole
machinery is neither documented nor robust. So a proper cleanup was
done instead
- the fix for the TLB flush issue which was discovered recently
- a simple typo fix for a reboot quirk
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix tlb flushing when lguest clears PGE
kexec, x86/purgatory: Unbreak it and clean it up
x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
The purgatory code defines global variables which are referenced via a
symbol lookup in the kexec code (core and arch).
A recent commit addressing sparse warnings made these static and thereby
broke kexec_file.
Why did this happen? Simply because the whole machinery is undocumented and
lacks any form of forward declarations. The variable names are unspecific
and lack a prefix, so adding forward declarations creates shadow variables
in the core code. Aside of that the code relies on magic constants and
duplicate struct definitions with no way to ensure that these things stay
in sync. The section placement of the purgatory variables happened by
chance and not by design.
Unbreak kexec and cleanup the mess:
- Add proper forward declarations and document the usage
- Use common struct definition
- Use the proper common defines instead of magic constants
- Add a purgatory_ prefix to have a proper name space
- Use ARRAY_SIZE() instead of a homebrewn reimplementation
- Add proper sections to the purgatory variables [ From Mike ]
Fixes: 72042a8c7b ("x86/purgatory: Make functions and variables static")
Reported-by: Mike Galbraith <<efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Tobin C. Harding" <me@tobin.cc>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge fixes from Andrew Morton:
"26 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...