commit 203cbf77de59fc8f13502dcfd11350c6d4a5c95f upstream.
If CONFIG_TIME_LOW_RES is enabled we add a jiffie to the relative timeout to
prevent short sleeps, but we do not account for that in interfaces which
retrieve the remaining time.
Helge observed that timerfd can return a remaining time larger than the
relative timeout. That's not expected and breaks userland test programs.
Store the information that the timer was armed relative and provide functions
to adjust the remaining time. To avoid bloating the hrtimer struct make state
a u8, which as a bonus results in better code on x86 at least.
Reported-and-tested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.273328486@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I noticed for non-monotonic timers in timer_list, some of the
output looked a little confusing.
For example:
#1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
# expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]
You'll note the relative time till the expiration "[in xxx to
yyy nsecs]" is incorrect. This is because its printing the delta
between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.
This patch fixes this issue by adding the clock offset to the
"now" time which we use to calculate the delta.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.
We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.
The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.
With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.
Before:
47.76% hog [.] main
14.84% [kernel] [k] _raw_spin_lock_irqsave
9.55% [kernel] [k] _raw_spin_unlock_irqrestore
6.71% [kernel] [k] mod_timer
6.24% [kernel] [k] lock_timer_base.isra.38
3.76% [kernel] [k] detach_if_pending
3.71% [kernel] [k] del_timer
2.50% [kernel] [k] internal_add_timer
1.51% [kernel] [k] get_nohz_timer_target
1.28% [kernel] [k] __internal_add_timer
0.78% [kernel] [k] timerfn
0.48% [kernel] [k] wake_up_nohz_cpu
After:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When no timers/hrtimers are pending, the expiry time is set to a
special value: 'KTIME_MAX'. This normally happens with
NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.
When 'expiry == KTIME_MAX', we either cancel the 'tick-sched' hrtimer
(NOHZ_MODE_HIGHRES) or skip reprogramming clockevent device
(NOHZ_MODE_LOWRES). But, the clockevent device is already
reprogrammed from the tick-handler for next tick.
As the clock event device is programmed in ONESHOT mode it will at
least fire one more time (unnecessarily). Timers on few
implementations (like arm_arch_timer, etc.) only support PERIODIC mode
and their drivers emulate ONESHOT over that. Which means that on these
platforms we will get spurious interrupts periodically (at last
programmed interval rate, normally tick rate).
In order to avoid spurious interrupts, the clockevent device should be
stopped or its interrupts should be masked.
A simple (yet hacky) solution to get this fixed could be: update
hrtimer_force_reprogram() to always reprogram clockevent device and
update clockevent drivers to STOP generating events (or delay it to
max time) when 'expires' is set to KTIME_MAX. But the drawback here is
that every clockevent driver has to be hacked for this particular case
and its very easy for new ones to miss this.
However, Thomas suggested to add an optional state ONESHOT_STOPPED to
solve this problem: lkml.org/lkml/2014/5/9/508.
This patch adds support for ONESHOT_STOPPED state in clockevents
core. It will only be available to drivers that implement the
state-specific callbacks instead of the legacy ->set_mode() callback.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b8b383a03ac07b13312c16850b5106b82e4245b5.1428031396.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
'enum clock_event_mode' is used for two purposes today:
- to pass mode to the driver of clockevent device::set_mode().
- for managing state of the device for clockevents core.
For supporting new modes/states we have moved away from the
legacy set_mode() callback to new per-mode/state callbacks. New
modes/states shouldn't be exposed to the legacy (now OBSOLOTE)
callbacks and so we shouldn't add new states to 'enum
clock_event_mode'.
Lets have separate enums for the two use cases mentioned above.
Keep using the earlier enum for legacy set_mode() callback and
mark it OBSOLETE. And add another enum to clearly specify the
possible states of a clockevent device.
This also renames the newly added per-mode callbacks to reflect
state changes.
We haven't got rid of 'mode' member of 'struct
clock_event_device' as it is used by some of the clockevent
drivers and it would automatically die down once we migrate
those drivers to the new interface. It ('mode') is only updated
now for the drivers using the legacy interface.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/b6b0143a8a57bd58352ad35e08c25424c879c0cb.1425037853.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is not possible for the clockevents core to know which modes (other than
those with a corresponding feature flag) are supported by a particular
implementation. And drivers are expected to handle transition to all modes
elegantly, as ->set_mode() would be issued for them unconditionally.
Now, adding support for a new mode complicates things a bit if we want to use
the legacy ->set_mode() callback. We need to closely review all clockevents
drivers to see if they would break on addition of a new mode. And after such
reviews, it is found that we have to do non-trivial changes to most of the
drivers [1].
Introduce mode-specific set_mode_*() callbacks, some of which the drivers may or
may not implement. A missing callback would clearly convey the message that the
corresponding mode isn't supported.
A driver may still choose to keep supporting the legacy ->set_mode() callback,
but ->set_mode() wouldn't be supporting any new modes beyond RESUME. If a driver
wants to benefit from using a new mode, it would be required to migrate to
the mode specific callbacks.
The legacy ->set_mode() callback and the newly introduced mode-specific
callbacks are mutually exclusive. Only one of them should be supported by the
driver.
Sanity check is done at the time of registration to distinguish between optional
and required callbacks and to make error recovery and handling simpler. If the
legacy ->set_mode() callback is provided, all mode specific ones would be
ignored by the core but a warning is thrown if they are present.
Call sites calling ->set_mode() directly are also updated to use
__clockevents_set_mode() instead, as ->set_mode() may not be available anymore
for few drivers.
[1] https://lkml.org/lkml/2014/12/9/605
[2] https://lkml.org/lkml/2015/1/23/255
Suggested-by: Thomas Gleixner <tglx@linutronix.de> [2]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Link: http://lkml.kernel.org/r/792d59a40423f0acffc9bb0bec9de1341a06fa02.1423788565.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition. On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer. The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.
Convert /proc/timer_list to a proper seq_file with its own iterator. This
is a little more complex given that we have to make two passes with two
separate headers.
sysrq_timer_list_show also needed to be updated to reflect the fact that
now timer_list_show only does one cpu at at time.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-3-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For the ondemand cpufreq governor, it is desired that the iowait
time is microaccounted in a similar way as idle time is.
This patch introduces the infrastructure to account and expose
this information via the get_cpu_iowait_time_us() function.
[akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082523.284feab6@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current logic which handles clock events programming failures can
increase min_delta_ns unlimited and even can cause overflows.
Sanitize it by:
- prevent zero increase when min_delta_ns == 1
- limiting min_delta_ns to a jiffie
- bail out if the jiffie limit is hit
- add retries stats for /proc/timer_list so we can gather data
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
struct cpumask will be undefined soon with CONFIG_CPUMASK_OFFSTACK=y,
to avoid them being declared on the stack.
cpumask_bits() does what we want here (of course, this code is crap).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Thomas Gleixner <tglx@linutronix.de>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
execution time of the hrtimer callbacks.
This is error-prone for virtual machines, where a guest vcpu can be
scheduled out during the execution of the callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor), which in can lead to large min_delta_ns rendering the
system unusable.
Replace the current heuristics with something more reliable. Allow the
interrupt code to try 3 times to catch up with the lost time. If that
fails use the total time spent in the interrupt handler to defer the
next timer interrupt so the system can catch up with other things
which got delayed. Limit that deferment to 100ms.
The retry events and the maximum time spent in the interrupt handler
are recorded and exposed via /proc/timer_list
Inspired by a patch from Marcelo.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org