Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
In some cases we don't want hard lockup detection enabled by default.
An example is when running as a guest. Introduce
watchdog_enable_hardlockup_detector(bool)
allowing those cases to disable hard lockup detection. This must be
executed early by the boot processor from e.g. smp_prepare_boot_cpu, in
order to allow kernel command line arguments to override it, as well as
to avoid hard lockup detection being enabled before we've had a chance
to indicate that it's unwanted. In summary,
initial boot: default=enabled
smp_prepare_boot_cpu
watchdog_enable_hardlockup_detector(false): default=disabled
cmdline has 'nmi_watchdog=1': default=enabled
The running kernel still has the ability to enable/disable at any time
with /proc/sys/kernel/nmi_watchdog us usual. However even when the
default has been overridden /proc/sys/kernel/nmi_watchdog will initially
show '1'. To truly turn it on one must disable/enable it, i.e.
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
This patch will be immediately useful for KVM with the next patch of this
series. Other hypervisor guest types may find it useful as well.
[akpm@linux-foundation.org: fix build]
[dzickus@redhat.com: fix compile issues on sparc]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, soft lockup detector warns once for each case of process
softlockup. But the thread 'watchdog/n' may not always get the cpu at the
time slot between the task switch of two processes hogging that cpu to
reset soft_watchdog_warn.
An example would be two processes hogging the cpu. Process A causes the
softlockup warning and is killed manually by a user. Process B
immediately becomes the new process hogging the cpu preventing the
softlockup code from resetting the soft_watchdog_warn variable.
This case is a false negative of "warn only once for a process", as there
may be a different process that is going to hog the cpu. Resolve this by
saving/checking the task pointer of the hogging process and use that to
reset soft_watchdog_warn too.
[dzickus@redhat.com: update comment]
Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of these are the uses of &__raw_get_cpu_var for address calculation.
touch_softlockup_watchdog_sync() uses __raw_get_cpu_var to write to
per cpu variables. Use __this_cpu_write instead.
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: linux-watchdog@vger.kernel.org
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch avoids printing the message 'enabled on all CPUs,
...' multiple times. For example, the issue can occur in the
following scenario:
1) watchdog_nmi_enable() fails to enable PMU counters and sets
cpu0_err.
2) 'echo [0|1] > /proc/sys/kernel/nmi_watchdog' is executed to
disable and re-enable the watchdog mechanism 'on the fly'.
3) If watchdog_nmi_enable() succeeds to enable PMU counters,
each CPU will print the message because step1 left behind a
non-zero cpu0_err.
if (!IS_ERR(event)) {
if (cpu == 0 || cpu0_err)
pr_info("enabled on all CPUs, ...")
The patch avoids this by clearing cpu0_err in watchdog_nmi_disable().
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1407768567-171794-4-git-send-email-dzickus@redhat.com
[ Applied small cleanups. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the
system has been in a softlockup state when debugging.
[akpm@linux-foundation.org: apply the taint before calling panic()]
Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.
Currently, upon detection of this condition by the per-cpu watchdog
task, debug information (including a stack trace) is sent to the system
log.
On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.
To avoid loss of useful debug information, for architectures which
support NMI, this patch makes it possible to improve soft lockup
reporting. This is accomplished by issuing an NMI to each cpu to obtain
a stack trace.
If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.
[dzickus@redhat.com: add CONFIG_SMP in certain areas]
[akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
[mq@suse.cz: fix warning]
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Wu noticed the following splat on his machine when updating
/proc/sys/kernel/watchdog_thresh:
BUG: sleeping function called from invalid context at mm/slub.c:965
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180
#1: (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110
#2: (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80
Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110
CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x4e/0x7a
__might_sleep+0x11d/0x190
kmem_cache_alloc_trace+0x4e/0x1e0
perf_event_alloc+0x55/0x440
perf_event_create_kernel_counter+0x26/0xe0
watchdog_nmi_enable+0x75/0x140
update_timers_all_cpus+0x53/0xa0
proc_dowatchdog+0xe4/0x110
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xad/0x180
SyS_write+0x49/0xb0
system_call_fastpath+0x16/0x1b
NMI watchdog: disabled (cpu0): hardware events not enabled
What happened is after updating the watchdog_thresh, the lockup detector
is restarted to utilize the new value. Part of this process involved
disabling preemption. Once preemption was disabled, perf tried to
allocate a new event (as part of the restart). This caused the above
BUG_ON as you can't sleep with preemption disabled.
The preemption restriction seemed agressive as we are not doing anything
on that particular cpu, but with all the online cpus (which are
protected by the get_online_cpus lock). Remove the restriction and the
BUG_ON goes away.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Peter Wu <peter@lekensteyn.nl>
Tested-by: Peter Wu <peter@lekensteyn.nl>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I ran into a scenario where while one cpu was stuck and should have
panic'd because of the NMI watchdog, it didn't. The reason was another
cpu was spewing stack dumps on to the console. Upon investigation, I
noticed that when writing to the console and also when dumping the
stack, the watchdog is touched.
This causes all the cpus to reset their NMI watchdog flags and the
'stuck' cpu just spins forever.
This change causes the semantics of touch_nmi_watchdog to be changed
slightly. Previously, I accidentally changed the semantics and we
noticed there was a codepath in which touch_nmi_watchdog could be
touched from a preemtible area. That caused a BUG() to happen when
CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code.
My attempt here re-introduces the change to have the
touch_nmi_watchdog() code only touch the local cpu instead of all of the
cpus. But instead of using __get_cpu_var(), I use the
__raw_get_cpu_var() version.
This avoids the preemption problem. However my reasoning wasn't because
I was trying to be lazy. Instead I rationalized it as, well if
preemption is enabled then interrupts should be enabled to and the NMI
watchdog will have no reason to trigger. So it won't matter if the
wrong cpu is touched because the percpu interrupt counters the NMI
watchdog uses should still be incrementing.
Don said:
: I'm ok with this patch, though it does alter the behaviour of how
: touch_nmi_watchdog works. For the most part I don't think most callers
: need to touch all of the watchdogs (on each cpu). Perhaps a corner case
: will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ).
:
: But this does address an issue where if a system is locked up and one cpu
: is spewing out useful debug messages (or error messages), the hard lockup
: will fail to go off. We have seen this on RHEL also.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Ben Zhang <benzh@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
watchdog_tresh controls how often nmi perf event counter checks per-cpu
hrtimer_interrupts counter and blows up if the counter hasn't changed
since the last check. The counter is updated by per-cpu
watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
period which guarantees that hrtimer is scheduled 2 times per the main
period. Both hrtimer and perf event are started together when the
watchdog is enabled.
So far so good. But...
But what happens when watchdog_thresh is updated from sysctl handler?
proc_dowatchdog will set a new sampling period and hrtimer callback
(watchdog_timer_fn) will use the new value in the next round. The
problem, however, is that nobody tells the perf event that the sampling
period has changed so it is ticking with the period configured when it
has been set up.
This might result in an ear ripping dissonance between perf and hrtimer
parts if the watchdog_thresh is increased. And even worse it might lead
to KABOOM if the watchdog is configured to panic on such a spurious
lockup.
This patch fixes the issue by updating both nmi perf even counter and
hrtimers if the threshold value has changed.
The nmi one is disabled and then reinitialized from scratch. This has
an unpleasant side effect that the allocation of the new event might
fail theoretically so the hard lockup detector would be disabled for
such cpus. On the other hand such a memory allocation failure is very
unlikely because the original event is deallocated right before.
It would be much nicer if we just changed perf event period but there
doesn't seem to be any API to do that right now. It is also unfortunate
that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
cannot use on_each_cpu() and do the same thing from the per-cpu context.
The update from the current CPU should be safe because
perf_event_disable removes the event atomically before it clears the
per-cpu watchdog_ev so it cannot change anything under running handler
feet.
The hrtimer is simply restarted (thanks to Don Zickus who has pointed
this out) if it is queued because we cannot rely it will fire&adopt to
the new sampling period before a new nmi event triggers (when the
treshold is decreased).
[akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_dowatchdog doesn't synchronize multiple callers which might lead to
confusion when two parallel callers might confuse watchdog_enable_all_cpus
resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
watchdog_thresh was set to 0 already).
This patch adds a local mutex which synchronizes callers to the sysctl
handler.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have two very conflicting state variable names in the
watchdog:
* watchdog_enabled: This one reflects the user interface. It's
set to 1 by default and can be overriden with boot options
or sysctl/procfs interface.
* watchdog_disabled: This is the internal toggle state that
tells if watchdog threads, timers and NMI events are currently
running or not. This state mostly depends on the user settings.
It's a convenient state latch.
Now we really need to find clearer names because those
are just too confusing to encourage deep review.
watchdog_enabled now becomes watchdog_user_enabled to reflect
its purpose as an interface.
watchdog_disabled becomes watchdog_running to suggest its
role as a pure internal state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
The user activation/deactivation of the watchdog through boot parameters
or systcl is currently implemented with a dance involving kthreads parking
and unparking methods: the threads are unconditionally registered on
boot and they park as soon as the user want the watchdog to be disabled.
This method involves a few noisy details to handle though: the watchdog
kthreads may be unparked anytime due to hotplug operations, after which
the watchdog internals have to decide to park again if it is user-disabled.
As a result the setup() and unpark() methods need to be able to request a
reparking. This is not currently supported in the kthread infrastructure
so this piece of the watchdog code only works halfway.
Besides, unparking/reparking the watchdog kthreads consume unnecessary
cputime on hotplug operations when those could be simply ignored in the
first place.
As suggested by Srivatsa, let's instead only register the watchdog
threads when they are needed. This way we don't need to think about
hotplug operations and we don't burden the CPU onlining when the watchdog
is simply disabled.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
Commit 8d4516904b ("watchdog: Fix CPU hotplug regression") causes an
oops or hard lockup when doing
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
and the kernel is booted with nmi_watchdog=1 (default)
Running laptop-mode-tools and disconnecting/connecting AC power will
cause this to trigger, making it a common failure scenario on laptops.
Instead of bailing out of watchdog_disable() when !watchdog_enabled we
can initialize the hrtimer regardless of watchdog_enabled status. This
makes it safe to call watchdog_disable() in the nmi_watchdog=0 case,
without the negative effect on the enabled => disabled => enabled case.
All these tests pass with this patch:
- nmi_watchdog=1
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
- nmi_watchdog=0
echo 0 > /sys/devices/system/cpu/cpu1/online
- nmi_watchdog=0
echo mem > /sys/power/state
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661
Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>