Pull timer core changes from Ingo Molnar:
"Continued cleanups of the core time and NTP code, plus more nohz work
preparing for tick-less userspace execution."
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Rework timekeeping functions to take timekeeper ptr as argument
time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
time: Refactor accumulation of nsecs to secs
time: Condense timekeeper.xtime into xtime_sec
time: Explicitly use u32 instead of int for shift values
time: Whitespace cleanups per Ingo%27s requests
nohz: Move next idle expiry time record into idle logic area
nohz: Move ts->idle_calls incrementation into strict idle logic
nohz: Rename ts->idle_tick to ts->last_tick
nohz: Make nohz API agnostic against idle ticks cputime accounting
nohz: Separate idle sleeping time accounting from nohz logic
timers: Improve get_next_timer_interrupt()
timers: Add accounting of non deferrable timers
timers: Consolidate base->next_timer update
timers: Create detach_if_pending() and use it
Pull RCU changes from Ingo Molnar:
"Quoting from Paul, the major features of this series are:
1. Preventing latency spikes of more than 200 microseconds for
kernels built with NR_CPUS=4096, which is reportedly becoming the
default for some distros. This is a first step, as it does not
help with systems that actually -have- 4096 CPUs (work on this case
is in progress, but is not yet ready for mainline).
This category also includes improving concurrency of rcu_barrier(),
placed here due to conflicts. Posted to LKML at:
https://lkml.org/lkml/2012/6/22/381
Note that patches 18-22 of that series have been defered to 3.7, as
they have not yet proven themselves to be mainline-ready (and yes,
these are the ones intended to get rid of RCU's latency spikes for
systems that actually have 4096 CPUs).
2. Updates to documentation and rcutorture fixes, the latter category
including improvements to rcu_barrier() testing. Posted to LKML at
http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html.
3. Miscellaneous fixes posted to LKML at:
https://lkml.org/lkml/2012/6/22/500
with the exception of the last commit, which was posted here:
http://www.gossamer-threads.com/lists/linux/kernel/1561830
4. RCU_FAST_NO_HZ fixes and improvements. Posted to LKML at:
http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.html
http://www.gossamer-threads.com/lists/linux/kernel/1561833
The first four patches of the first series went into 3.5 to fix a
regression.
5. Code-style fixes. These were posted to LKML at
http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.html
http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Fix broken strings in RCU's source code.
rcu: Fix code-style issues involving "else"
rcu: Introduce check for callback list/count mismatch
rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
rcu: Fix qlen_lazy breakage
rcu: Round FAST_NO_HZ lazy timeout to nearest second
rcu: The rcu_needs_cpu() function is not a quiescent state
rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
rcu: Add check for CPUs going offline with callbacks queued
rcu: Disable preemption in rcu_blocking_is_gp()
rcu: Prevent uninitialized string in RCU CPU stall info
rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
rcu: Split RCU core processing out of __call_rcu()
rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
rcu: Make __call_rcu() handle invocation from idle
rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset
rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
rcu: Remove return value from rcu_assign_pointer()
key: Remove extraneous parentheses from rcu_assign_keypointer()
rcu: Remove return value from RCU_INIT_POINTER()
...
One more time/ntp fix pulled from Ingo Molnar.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ntp: Fix STA_INS/DEL clearing bug
As part of cleaning up the timekeeping code, this patch converts
a number of internal functions to takei a timekeeper ptr as an
argument, so that the internal functions don't access the global
timekeeper structure directly. This allows for further optimizations
to reduce lock hold time later.
This patch has been updated to include more consistent usage of the
timekeeper value, by making sure it is always passed as a argument
to non top-level functions.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-9-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timekeeper struct has a xtime_nsec, which keeps the
sub-nanosecond remainder. This ends up being somewhat
duplicative of the timekeeper.xtime.tv_nsec value, and we
have to do extra work to keep them apart, copying the full
nsec portion out and back in over and over.
This patch simplifies some of the logic by taking the timekeeper
xtime value and splitting it into timekeeper.xtime_sec and
reuses the timekeeper.xtime_nsec for the sub-second portion
(stored in higher res shifted nanoseconds).
This simplifies some of the accumulation logic. And will
allow for more accurate timekeeping once the vsyscall code
is updated to use the shifted nanosecond remainder.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reason: Update to upstream changes to avoid further conflicts.
Fixup a trivial merge conflict in kernel/time/tick-sched.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull RCU, perf, and scheduler fixes from Ingo Molnar.
The RCU fix is a revert for an optimization that could cause deadlocks.
One of the scheduler commits (164c33c6ad "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes. See this thread on
lkml:
[PATCH] fork: fix error handling in dup_task()
The perf fixes are just trivial oneliners.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix segfault with report and mixed guestmount use
perf kvm: Fix regression with guest machine creation
perf script: Fix format regression due to libtraceevent merge
ring-buffer: Fix accounting of entries when removing pages
ring-buffer: Fix crash due to uninitialized new_pages list head
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS/sched: Update scheduler file pattern
sched/nohz: Rewrite and fix load-avg computation -- again
sched: Fix fork() error path to not crash
To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.
The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix
Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thanks to Charles Wang for spotting the defects in the current code:
- If we go idle during the sample window -- after sampling, we get a
negative bias because we can negate our own sample.
- If we wake up during the sample window we get a positive bias
because we push the sample to a known active period.
So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Reported-and-tested-by: Charles Wang <muming.wq@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
also disable itself. This commit therefore checks for tick_nohz_enabled
being zero, disabling rcu_prepare_for_idle() if so. This commit assumes
that tick_nohz_enabled can change at runtime: If this is not the case,
then a simpler approach suffices.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull core updates (RCU and locking) from Ingo Molnar:
"Most of the diffstat comes from the RCU slow boot regression fixes,
but there's also a debuggability improvements/fixes."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
memblock: Document memblock_is_region_{memory,reserved}()
rcu: Precompute RCU_FAST_NO_HZ timer offsets
rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
rcu: RCU_FAST_NO_HZ detection of callback adoption
spinlock: Indicate that a lockup is only suspected
kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
panic: Make panic_on_oops configurable
When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.
But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.
To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.
However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs themselves because
we now want them to be agnostic against places that stop/restart tick.
Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>