Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
Pull timer updates from Thomas Gleixner:
"Nothing exciting, just the usual pile of fixes, updates and cleanups:
- A bunch of clocksource driver updates
- Removal of CONFIG_TIMER_STATS and the related /proc file
- More posix timer slim down work
- A scalability enhancement in the tick broadcast code
- Math cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
hrtimer: Catch invalid clockids again
math64, tile: Fix build failure
clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init
timerfd: Protect the might cancel mechanism proper
timer_list: Remove useless cast when printing
time: Remove CONFIG_TIMER_STATS
clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101
clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum
clocksource/drivers/ostm: Add renesas-ostm timer driver
clocksource/drivers/ostm: Document renesas-ostm timer DT bindings
clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock
clocksource/drivers/gemini: Add driver for the Cortina Gemini
clocksource: add DT bindings for Cortina Gemini
clockevents: Add a clkevt-of mechanism like clksrc-of
tick/broadcast: Reduce lock cacheline contention
timers: Omit POSIX timer stuff from task_struct when disabled
x86/timer: Make delay() work during early bootup
delay: Add explanation of udelay() inaccuracy
...
Pull timer fixes from Thomas Gleixner:
"Two small fixes::
- Prevent deadlock on the tick broadcast lock. Found and fixed by
Mike.
- Stop using printk() in the timekeeping debug code to prevent a
deadlock against the scheduler"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Use deferred printk() in debug code
tick/broadcast: Prevent deadlock on tick_broadcast_lock
commit 82e88ff1ea ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed
unfortunately a sanity check in the hrtimer code which was part of that
MONOTONIC_RAW patch series.
It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless
code. So bring it back.
It is way too easy to take any random clockid and feed it to the hrtimer
subsystem. At best, it gets mapped to a monotonic base, but it would be
better to just catch illegal values as early as possible.
Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning.
[ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This reverts commit 24b91e360e and commit
7bdb59f1ad ("tick/nohz: Fix possible missing clock reprog after tick
soft restart") that depends on it,
Pavel reports that it causes occasional boot hangs for him that seem to
depend on just how the machine was booted. In particular, his machine
hangs at around the PCI fixups of the EHCI USB host controller, but only
hangs from cold boot, not from a warm boot.
Thomas Gleixner suspecs it's a CPU hotplug interaction, particularly
since Pavel also saw suspend/resume issues that seem to be related.
We're reverting for now while trying to figure out the root cause.
Reported-bisected-and-tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org # reverted commits were marked for stable
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tick_broadcast_lock is taken from interrupt context, but the following call
chain takes the lock without disabling interrupts:
[ 12.703736] _raw_spin_lock+0x3b/0x50
[ 12.703738] tick_broadcast_control+0x5a/0x1a0
[ 12.703742] intel_idle_cpu_online+0x22/0x100
[ 12.703744] cpuhp_invoke_callback+0x245/0x9d0
[ 12.703752] cpuhp_thread_fun+0x52/0x110
[ 12.703754] smpboot_thread_fn+0x276/0x320
So the following deadlock can happen:
lock(tick_broadcast_lock);
<Interrupt>
lock(tick_broadcast_lock);
intel_idle_cpu_online() is the only place which violates the calling
convention of tick_broadcast_control(). This was caused by the removal of
the smp function call in course of the cpu hotplug rework.
Instead of slapping local_irq_disable/enable() at the call site, we can
relax the calling convention and handle it in the core code, which makes
the whole machinery more robust.
Fixes: 29d7bbada9 ("intel_idle: Remove superfluous SMP fuction call")
Reported-by: Gabriel C <nix.or.die@gmail.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ruslan Ruslichenko <rruslich@cisco.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: lwn@lwn.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1486953115.5912.4.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ts->next_tick keeps track of the next tick deadline in order to optimize
clock programmation on irq exit and avoid redundant clock device writes.
Now if ts->next_tick missed an update, we may spuriously miss a clock
reprog later as the nohz code is fooled by an obsolete next_tick value.
This is what happens here on a specific path: when we observe an
expired timer from the nohz update code on irq exit, we perform a soft
tick restart which simply fires the closest possible tick without
actually exiting the nohz mode and restoring a periodic state. But we
forget to update ts->next_tick accordingly.
As a result, after the next tick resulting from such soft tick restart,
the nohz code sees a stale value on ts->next_tick which doesn't match
the clock deadline that just expired. If that obsolete ts->next_tick
value happens to collide with the actual next tick deadline to be
scheduled, we may spuriously bypass the clock reprogramming. In the
worst case, the tick may never fire again.
Fix this with a ts->next_tick reset on soft tick restart.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It was observed that on an Intel x86 system without the ARAT (Always
running APIC timer) feature and with fairly large number of CPUs as
well as CPUs coming in and out of intel_idle frequently, the lock
contention on the tick_broadcast_lock can become significant.
To reduce contention, the lock is put into its own cacheline and all
the cpumask_var_t variables are put into the __read_mostly section.
Running the SP benchmark of the NAS Parallel Benchmarks on a 4-socket
16-core 32-thread Nehalam system, the performance number improved
from 3353.94 Mop/s to 3469.31 Mop/s when this patch was applied on
a 4.9.6 kernel. This is a 3.4% improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1485799063-20857-1-git-send-email-longman@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.
In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.
Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.
In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:
Task in a full dynticks CPU
----------------------------
* hrtimer A is queued 2 seconds ahead
* the tick is stopped, scheduled 1 second ahead
* tick fires 1 second later
* on tick exit, nohz schedules the tick 1 second ahead but sees
the clockevent device is already programmed to that deadline,
fooled by hrtimer A, the tick isn't rescheduled.
* hrtimer A is cancelled before its deadline
* tick never fires again until an interrupt happens...
In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.
On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.
Reported-by: James Hartsock <hartsjc@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The last caller to timekeeping_set_tai_offset() was in commit
0b5154fb90 (timekeeping: Simplify tai updating from
do_adjtimex, 2013-03-22) and the last caller to
timekeeping_get_tai_offset() was in commit 76f4108892 (hrtimer:
Cleanup hrtimer accessors to the timekepeing state, 2014-07-16).
Remove these unused functions now that we handle TAI offsets
differently.
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>