Commit Graph

180354 Commits

Author SHA1 Message Date
Suresh Siddha dd5feea14a sched: Fix SCHED_MC regression caused by change in sched cpu_power
On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.

Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:

 |
 |  commit f93e65c186
 |  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 |  Date:   Tue Sep 1 10:34:32 2009 +0200
 |
 |	sched: Restore __cpu_power to a straight sum of power
 |

Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.

We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 15:45:13 +01:00
Thomas Gleixner 83ab0aa0d5 sched: Don't use possibly stale sched_class
setscheduler() saves task->sched_class outside of the rq->lock held
region for a check after the setscheduler changes have become
effective. That might result in checking a stale value.

rtmutex_setprio() has the same problem, though it is protected by
p->pi_lock against setscheduler(), but for correctness sake (and to
avoid bad examples) it needs to be fixed as well.

Retrieve task->sched_class inside of the rq->lock held region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
2010-02-17 11:58:18 +01:00
Thomas Gleixner 6e40f5bbbc Merge branch 'sched/urgent' into sched/core
Conflicts: kernel/sched.c

Necessary due to the urgent fixes which conflict with the code move
from sched.c to sched_fair.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 16:48:56 +01:00
Peter Zijlstra 0970d2992d sched: Fix race between ttwu() and task_rq_lock()
Thomas found that due to ttwu() changing a task's cpu without holding
the rq->lock, task_rq_lock() might end up locking the wrong rq.

Avoid this by serializing against TASK_WAKING.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266241712.15770.420.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
Suresh Siddha 9000f05c6d sched: Fix SMT scheduler regression in find_busiest_queue()
Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

   commit bdb94aa5db
   Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
   Date:   Tue Sep 1 10:34:38 2009 +0200

   sched: Try to deal with low capacity

   @@ -4203,15 +4223,18 @@ find_busiest_queue()
   ...
	for_each_cpu(i, sched_group_cpus(group)) {
   +	unsigned long power = power_of(i);

   ...

   -	wl = weighted_cpuload(i);
   +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
   +	wl /= power;

   -	if (rq->nr_running == 1 && wl > imbalance)
   +	if (capacity && rq->nr_running == 1 && wl > imbalance)
		continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with  imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
Vaidyanathan Srinivasan 28f5318167 sched: Fix sched_mv_power_savings for !SMT
Fix for sched_mc_powersavigs for pre-Nehalem platforms.
Child sched domain should clear SD_PREFER_SIBLING if parent will have
SD_POWERSAVINGS_BALANCE because they are contradicting.

Sets the flags correctly based on sched_mc_power_savings.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100208100555.GD2931@dirshya.in.ibm.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
Anton Blanchard 301ba0457f kthread, sched: Remove reference to kthread_create_on_cpu
kthread_create_on_cpu doesn't exist so update a comment in
kthread.c to reflect this.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100209040740.GB3702@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-09 11:47:39 +01:00
Anton Blanchard fa535a77bd sched: cpuacct: Use bigger percpu counter batch values for stats counters
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
enabled we can call cpuacct_update_stats with values much larger
than percpu_counter_batch.  This means the call to
percpu_counter_add will always add to the global count which is
protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch
value by cputime_one_jiffy such that we have the same batch
limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
His patch did this once at boot but that initialisation happened
too early on PowerPC (before time_init) and it was never updated
at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by
cputime_one_jiffy at runtime, which keeps the batch correct even
after cpu hotplug operations.  We cap it at INT_MAX in case of
overflow.

For architectures that do not support
CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
and gcc is smart enough to optimise min(s32
percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC.  So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
disabled kernel:

 CONFIG_CGROUP_CPUACCT disabled:   16906698 ctx switches/sec
 CONFIG_CGROUP_CPUACCT enabled:       61720 ctx switches/sec
 CONFIG_CGROUP_CPUACCT + patch:	   16663217 ctx switches/sec

Tested with:

 wget http://ozlabs.org/~anton/junkcode/context_switch.c
 make context_switch
 for i in `seq 0 63`; do taskset -c $i ./context_switch & done
 vmstat 1

Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:57:37 +01:00
Anton Blanchard 0c9cf2efd7 percpu_counter: Make __percpu_counter_add an inline function on UP
Even though batch isn't used on UP, we may want to pass one in
to keep the SMP and UP code paths similar. Convert
__percpu_counter_add to an inline function so we wont get
variable unused warnings if we do.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:57:36 +01:00
Ingo Molnar 6d3e0907b8 Merge branch 'sched/urgent' into sched/core
Merge reason: Merge dependent fix, update to latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:55:46 +01:00
Andrew Morton 50200df462 kernel/sched.c: Suppress unused var warning
On UP:

 kernel/sched.c: In function 'wake_up_new_task':
 kernel/sched.c:2631: warning: unused variable 'cpu'

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:53:19 +01:00
Linus Torvalds 29275254ca Linux 2.6.33-rc7 2010-02-06 14:17:12 -08:00
Linus Torvalds 82e22d77bf Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  hwmon: (w83781d) Request I/O ports individually for probing
  hwmon: (lm78) Request I/O ports individually for probing
  hwmon: (adt7462) Wrong ADT7462_VOLT_COUNT
2010-02-06 13:02:31 -08:00
Linus Torvalds f6510ec5a9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel:
  drm/i915: Fix leak of relocs along do_execbuffer error path
  drm/i915: slow acpi_lid_open() causes flickering - V2
  drm/i915: Disable SR when more than one pipe is enabled
  drm/i915: page flip support for Ironlake
  drm/i915: Fix the incorrect DMI string for Samsung SX20S laptop
  drm/i915: Add support for SDVO composite TV
  drm/i915: don't trigger ironlake vblank interrupt at irq install
  drm/i915: handle non-flip pending case when unpinning the scanout buffer
  drm/i915: Fix the device info of Pineview
  drm/i915: enable vblank interrupt on ironlake
  drm/i915: Prevent use of uninitialized pointers along error path.
  drm/i915: disable hotplug detect before Ironlake CRT detect
2010-02-06 13:01:39 -08:00
Linus Torvalds 6f5a55f1a6 Fix potential crash with sys_move_pages
We incorrectly depended on the 'node_state/node_isset()' functions
testing the node range, rather than checking it explicitly.  That's not
reliable, even if it might often happen to work.  So do the proper
explicit test.

Reported-by: Marcus Meissner <meissner@suse.de>
Acked-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-06 13:00:37 -08:00
Linus Torvalds 9d9c3a51e7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ASoC: pandora: Add APLL supply to fix audio output
  ALSA: ice1724 - aureon - fix wm8770 volume offset
  ALSA: cosmetic: make hda intel interrupt name consistent with others
  ALSA: hda - Delay switching to polling mode if an interrupt was missing
  ALSA: ctxfi - fix PTP address initialization
2010-02-05 11:11:34 -08:00
Jean Delvare b0bcdd3cd0 hwmon: (w83781d) Request I/O ports individually for probing
Different motherboards have different PNP declarations for
W83781D/W83782D chips. Some declare the whole range of I/O ports (8
ports), some declare only the useful ports (2 ports at offset 5) and
some declare fancy ranges, for example 4 ports at offset 4. To
properly handle all cases, request all ports individually for probing.
After we have determined that we really have a W83781D or W83782D
chip, the useful port range will be requested again, as a single
block.

I did not see a board which needs this yet, but I know of one for lm78
driver and I'd like to keep the logic of these two drivers in sync.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: stable@kernel.org
2010-02-05 19:58:36 +01:00
Jean Delvare 197027e6ef hwmon: (lm78) Request I/O ports individually for probing
Different motherboards have different PNP declarations for LM78/LM79
chips. Some declare the whole range of I/O ports (8 ports), some
declare only the useful ports (2 ports at offset 5) and some declare
fancy ranges, for example 4 ports at offset 4. To properly handle all
cases, request all ports individually for probing. After we have
determined that we really have an LM78 or LM79 chip, the useful port
range will be requested again, as a single block.

This fixes the driver on the Olivetti M3000 DT 540, at least.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: stable@kernel.org
2010-02-05 19:58:36 +01:00
Ray Copeland 85f8d3e5fa hwmon: (adt7462) Wrong ADT7462_VOLT_COUNT
The #define ADT7462_VOLT_COUNT is wrong, it should be 13 not 12. All the 
for loops that use this as a limit count are of the typical form, "for 
(n = 0; n < ADT7462_VOLT_COUNT; n++)", so to loop through all voltages 
w/o missing the last one it is necessary for the count to be one greater 
than it is.  (Specifically, you will miss the +1.5V 3GPIO input with count 
= 12 vs. 13.)

Signed-off-by: Ray Copeland <ray.copeland@aprius.com>
Acked-by: "Darrick J. Wong" <djwong@us.ibm.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: stable@kernel.org
2010-02-05 19:58:35 +01:00
Takashi Iwai 3e0b33f786 Merge remote branch 'alsa/fixes' into for-linus 2010-02-05 19:57:23 +01:00
Takashi Iwai a26a408888 Merge branch 'fix/asoc' into for-linus 2010-02-05 19:57:16 +01:00
Takashi Iwai db9256c003 Merge branch 'fix/hda' into for-linus 2010-02-05 19:56:55 +01:00
Linus Torvalds 56dca4ceb7 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  [libata] Call flush_dcache_page after PIO data transfers in libata-sff.c
  ahci: add Acer G725 to broken suspend list
  libata: fix ata_id_logical_per_physical_sectors
  libata-scsi passthru: fix bug which truncated LBA48 return values
2010-02-05 07:58:21 -08:00
Andres Salomon 73d2eaac8a CS5536: apply pci quirk for BIOS SMBUS bug
The new cs5535-* drivers use PCI header config info rather than MSRs to
determine the memory region to use for things like GPIOs and MFGPTs.  As
anticipated, we've run into a buggy BIOS:

[    0.081818] pci 0000:00:14.0: reg 10: [io  0x6000-0x7fff]
[    0.081906] pci 0000:00:14.0: reg 14: [io  0x6100-0x61ff]
[    0.082015] pci 0000:00:14.0: reg 18: [io  0x6200-0x63ff]
[    0.082917] pci 0000:00:14.2: reg 20: [io  0xe000-0xe00f]
[    0.083551] pci 0000:00:15.0: reg 10: [mem 0xa0010000-0xa0010fff]
[    0.084436] pci 0000:00:15.1: reg 10: [mem 0xa0011000-0xa0011fff]
[    0.088816] PCI: pci_cache_line_size set to 32 bytes
[    0.088938] pci 0000:00:14.0: address space collision: [io 0x6100-0x61ff] already in use
[    0.089052] pci 0000:00:14.0: can't reserve [io  0x6100-0x61ff]

This is a Soekris board, and its BIOS sets the size of the PCI ISA bridge
device's BAR0 to 8k.  In reality, it should be 8 bytes (BAR0 is used for
SMBus stuff).  This quirk checks for an incorrect size, and resets it
accordingly.

Signed-off-by: Andres Salomon <dilinger@collabora.co.uk>
Tested-by: Leigh Porter <leigh@leighporter.org>
Tested-by: Jens Rottmann <JRottmann@LiPPERTEmbedded.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-05 07:36:50 -08:00
Stephen Rothwell 2938429501 percpu: add __percpu for sparse
This is to make the annotation of percpu variables during the next merge
window less painfull.

Extracted from a patch by Rusty Russell.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-05 07:35:05 -08:00