Right now, get_cpu_idle_time_us() only reports the idle
statistics upto the point the CPU entered last idle; not what is
valid right now.
This patch adds an update of the idle statistics to
get_cpu_idle_time_us(), so that calling this function always
returns statistics that are accurate at the point of the call.
This includes resetting the start of the idle time for
accounting purposes to avoid double accounting.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082323.2d2f1945@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Entering nohz code on every micro-idle is costing ~10% throughput for netperf
TCP_RR when scheduling cross-cpu. Rate limiting entry fixes this, but raises
ticks a bit. On my Q6600, an idle box goes from ~85 interrupts/sec to 128.
The higher the context switch rate, the more nohz entry costs. With this patch
and some cycle recovery patches in my tree, max cross cpu context switch rate is
improved by ~16%, a large portion of which of which is this ratelimiting.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers, init: Limit the number of per cpu calibration bootup messages
posix-cpu-timers: optimize and document timer_create callback
clockevents: Add missing include to pacify sparse
x86: vmiclock: Fix printk format
x86: Fix printk format due to variable type change
sparc: fix printk for change of variable type
clocksource/events: Fix fallout of generic code changes
nohz: Allow 32-bit machines to sleep for more than 2.15 seconds
nohz: Track last do_timer() cpu
nohz: Prevent clocksource wrapping during idle
nohz: Type cast printk argument
mips: Use generic mult/shift factor calculation for clocks
clocksource: Provide a generic mult/shift factor calculation
clockevents: Use u32 for mult and shift factors
nohz: Introduce arch_needs_cpu
nohz: Reuse ktime in sub-functions of tick_check_idle.
time: Remove xtime_cache
time: Implement logarithmic time accumulation
The previous patch which limits the sleep time to the maximum
deferment time of the time keeping clocksource has some limitations on
SMP machines: if all CPUs are idle then for all CPUs the maximum sleep
time is limited.
Solve this by keeping track of which cpu had the do_timer() duty
assigned last and limit the sleep time only for this cpu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
The dynamic tick allows the kernel to sleep for periods longer than a
single tick, but it does not limit the sleep time currently. In the
worst case the kernel could sleep longer than the wrap around time of
the time keeping clock source which would result in losing track of
time.
Prevent this by limiting it to the safe maximum sleep time of the
current time keeping clock source. The value is calculated when the
clock source is registered.
[ tglx: simplified the code a bit and massaged the commit msg ]
Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On some archs local_softirq_pending() has a data type of unsigned long
on others its unsigned int. Type cast it to (unsigned int) in the
printk to avoid the compiler warning.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Allow the architecture to request a normal jiffy tick when the system
goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
used to prevent the system going fully idle if there has been an
interrupt other than a clock comparator interrupt since the last wakeup.
On s390 the HiperSockets response time for 1 connection ping-pong goes
down from 42 to 34 microseconds. The CPU cost decreases by 27%.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20090929122533.402715150@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
and/or ts->tick_stopped) both function get a time stamp with ktime_get.
The same time stamp can be reused if both function require one.
On s390 this change has the additional benefit that gcc inlines the
tick_nohz_stop_idle function into tick_check_idle. The number of
instructions to execute tick_check_idle drops from 225 to 144
(without the ktime_get optimization it is 367 vs 215 instructions).
before:
0) | tick_check_idle() {
0) | tick_nohz_stop_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.601 us | }
0) 1.765 us | }
0) 3.047 us | }
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.570 us | }
0) 1.727 us | }
0) | tick_do_update_jiffies64() {
0) 0.609 us | }
0) 8.055 us | }
after:
0) | tick_check_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.617 us | }
0) 1.773 us | }
0) | tick_do_update_jiffies64() {
0) 0.593 us | }
0) 4.477 us | }
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090929122533.206589318@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit f2e21c9610 had unfortunate side
effects with cpufreq governors on some systems.
If the system did not switch into NOHZ mode ts->inidle is not set when
tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
fail to call tick_nohz_start_idle(). This results in bogus idle
accounting information which is passed to cpufreq governors.
Set the inidle flag unconditionally of the NOHZ active state to keep
the idle time accounting correct in any case.
[ tglx: Added comment and tweaked the changelog ]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: stable@kernel.org
LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A call from irq_exit() may occasionally pause the timing
info for cpufreq ondemand governor. This results in the
cpufreq ondemand governor to fail to calculate the
system load properly. Thus, relocate the checks for this
particular case to keep the governor always functional.
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Reported-by: Tero Kristo <tero.kristo@nokia.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:
The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn
[ tglx: fixup the hrtimer pinned mode ]
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
kernel/time/tick-sched.c:137:6: warning: symbol 'tick_nohz_update_jiffies' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.
In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.
This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In my device I get many interrupts from a high speed USB device in a very
short period of time. The system spends a lot of time reprogramming the
hardware timer which is in a slower timing domain as compared to the CPU.
This results in the CPU spending a huge amount of time waiting for the
timer posting to be done. All of this reprogramming is useless as the
wake up time has not changed.
As measured using ETM trace this drops my reprogramming penalty from
almost 60% CPU load down to 15% during high interrupt rate. I can send
traces to show this.
Suppress setting of duplicate timer event when timer already stopped.
Timer programming can be very costly and can result in long cpu stall/wait
times.
[akpm@linux-foundation.org: coding-style fixes]
[tglx@linutronix.de: move the check to the right place and avoid raising
the softirq for nothing]
Signed-off-by: Richard Woodruff <r-woodruff2@ti.com>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: remove false positive warning
After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
it still might have pending softirqs. However take_cpu_down makes sure
that the idle task will run next instead of ksoftirqd on the taken down cpu.
The idle task will call tick_nohz_stop_sched_tick which might warn about
pending softirqs just before the cpu kills itself completely.
However the pending softirqs on the dead cpu aren't a problem because they
will be moved to an online cpu during CPU_DEAD handling.
So make sure we warn only for online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, move all hrtimer processing into hardirq context
This is an attempt at removing some of the hrtimer complexity by
reducing the number of callback modes to 1.
This means that all hrtimer callback functions will be ran from HARD-irq
context.
I went through all the 30 odd hrtimer callback functions in the kernel
and saw only one that I'm not quite sure of, which is the one in
net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.
Furthermore, the hrtimer core now calls callbacks directly with IRQs
disabled in case you try to enqueue an expired timer. If this timer is a
periodic timer (which should use hrtimer_forward() to advance its time)
then it might be possible to end up in an inf. recursive loop due to the
fact that hrtimer_forward() doesn't round up to the next timer
granularity, and therefore keeps on calling the callback - obviously
this needs a fix.
Aside from that, this seems to compile and actually boot on my dual core
test box - although I'm sure there are some bugs in, me not hitting any
makes me certain :-)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: nohz powersavings and wakeup regression
commit fb02fbc14d (NOHZ: restart tick
device from irq_enter()) causes a serious wakeup regression.
While the patch is correct it does not take into account that spurious
wakeups happen on x86. A fix for this issue is available, but we just
revert to the .27 behaviour and let long running softirqs screw
themself.
Disable it for now.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>