Commit Graph

11814 Commits

Author SHA1 Message Date
Peter Zijlstra
d72bce0e67 rcu: Cure load woes
Commit cc3ce5176d (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
state) fudges a sleeping task' state, resulting in the scheduler seeing
a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
task waking up. The result is unbalanced load calculation.

The problem that patch tried to address is that the RCU threads could
stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
task detector due to on-demand wake-ups.

Cure the problem differently by always giving the tasks at least one
wake-up once the CPU is fully up and running, this will kick them out of
the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
wait state.

[ The alternative would be teaching kthread_create() to start threads as
  INTERRUPTIBLE but that needs a tad more thought. ]

Reported-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul E. McKenney <paul.mckenney@linaro.org>
Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-31 10:01:48 +02:00
Linus Torvalds
6345d24daf mm: Fix boot crash in mm_alloc()
Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [<c11ae035>] find_next_bit+0x55/0xb0
    Call Trace:
     [<c11addda>] cpumask_any_but+0x2a/0x70
     [<c102396b>] flush_tlb_mm+0x2b/0x80
     [<c1022705>] pud_populate+0x35/0x50
     [<c10227ba>] pgd_alloc+0x9a/0xf0
     [<c103a3fc>] mm_init+0xec/0x120
     [<c103a7a3>] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfc ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-29 11:32:28 -07:00
Linus Torvalds
f310642123 Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
  x86 idle: deprecate "no-hlt" cmdline param
  x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
  x86 idle floppy: deprecate disable_hlt()
  x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
  x86 idle: clarify AMD erratum 400 workaround
  idle governor: Avoid lock acquisition to read pm_qos before entering idle
  cpuidle: menu: fixed wrapping timers at 4.294 seconds
2011-05-29 11:18:09 -07:00
Tim Chen
333c5ae994 idle governor: Avoid lock acquisition to read pm_qos before entering idle
Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.

I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received.  This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list.  The contention is severe when there are a lot
of cpus waking and going into idle.  For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:

-     37.82%          swapper  [kernel.kallsyms]          [k]
_raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
      - 95.65% pm_qos_request
           menu_select
           cpuidle_idle_call
         - cpu_idle
              99.98% start_secondary

A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.

cc: stable@kernel.org
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: James Bottomley <James.Bottomley@suse.de>
Acked-by: mark gross <markgross@thegnar.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-05-29 00:50:59 -04:00
Linus Torvalds
08a8b79600 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
  sched: Fix ->min_vruntime calculation in dequeue_entity()
  sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
  sched: More sched_domain iterations fixes
2011-05-28 12:56:46 -07:00
Linus Torvalds
1ba4b8cb94 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
  rcu: Remove waitqueue usage for cpu, node, and boost kthreads
  rcu: Avoid acquiring rcu_node locks in timer functions
  atomic: Add atomic_or()
  Documentation: Add statistics about nested locks
  rcu: Decrease memory-barrier usage based on semi-formal proof
  rcu: Make rcu_enter_nohz() pay attention to nesting
  rcu: Don't do reschedule unless in irq
  rcu: Remove old memory barriers from rcu_process_callbacks()
  rcu: Add memory barriers
  rcu: Fix unpaired rcu_irq_enter() from locking selftests
2011-05-28 12:56:32 -07:00
Linus Torvalds
c4a227d89f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  perf: Fix SIGIO handling
  perf top: Don't stop if no kernel symtab is found
  perf top: Handle kptr_restrict
  perf top: Remove unused macro
  perf events: initialize fd array to -1 instead of 0
  perf tools: Make sure kptr_restrict warnings fit 80 col terms
  perf tools: Fix build on older systems
  perf symbols: Handle /proc/sys/kernel/kptr_restrict
  perf: Remove duplicate headers
  ftrace: Add internal recursive checks
  tracing: Update btrfs's tracepoints to use u64 interface
  tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
  ftrace: Set ops->flag to enabled even on static function tracing
  tracing: Have event with function tracer check error return
  ftrace: Have ftrace_startup() return failure code
  jump_label: Check entries limit in __jump_label_update
  ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
  scripts/tags.sh: Add magic for trace-events for etags too
  scripts/tags.sh: Fix ctags for DEFINE_EVENT()
  x86/ftrace: Fix compiler warning in ftrace.c
  ...
2011-05-28 12:55:55 -07:00
Paul E. McKenney
cc3ce5176d rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
result in softlockup warnings.  Because some of RCU's kthreads can
legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
state in order to avoid those warnings.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:41:56 +02:00
Peter Zijlstra
08bca60a69 rcu: Remove waitqueue usage for cpu, node, and boost kthreads
It is not necessary to use waitqueues for the RCU kthreads because
we always know exactly which thread is to be awakened.  In addition,
wake_up() only issues an actual wakeup when there is a thread waiting on
the queue, which was why there was an extra explicit wake_up_process()
to get the RCU kthreads started.

Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
eliminates the need for the initial wake_up_process() and also shrinks
the data structure size a bit.  The wakeup logic is placed in a new
rcu_wait() macro.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:41:52 +02:00
Paul E. McKenney
8826f3b039 rcu: Avoid acquiring rcu_node locks in timer functions
This commit switches manipulations of the rcu_node ->wakemask field
to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
acquiring the rcu_node lock.  This should avoid the following lockdep
splat reported by Valdis Kletnieks:

[   12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
[   12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
[   12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[   12.987691] hub 1-4:1.0: USB hub found
[   12.987877] hub 1-4:1.0: 3 ports detected
[   12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
[   13.071471] udevadm used greatest stack depth: 3984 bytes left
[   13.172129]
[   13.172130] =======================================================
[   13.172425] [ INFO: possible circular locking dependency detected ]
[   13.172650] 2.6.39-rc6-mmotm0506 #1
[   13.172773] -------------------------------------------------------
[   13.172997] blkid/267 is trying to acquire lock:
[   13.173009]  (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]
[   13.173009] but task is already holding lock:
[   13.173009]  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
[   13.173009]
[   13.173009] which lock already depends on the new lock.
[   13.173009]
[   13.173009]
[   13.173009] the existing dependency chain (in reverse order) is:
[   13.173009]
[   13.173009] -> #2 (rcu_node_level_0){..-...}:
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
[   13.173009]        [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5
[   13.173009]        [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7
[   13.173009]        [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23
[   13.173009]        [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75
[   13.173009]        [<ffffffff81030cc6>] update_curr+0x101/0x12e
[   13.173009]        [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b
[   13.173009]        [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68
[   13.173009]        [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128
[   13.173009]        [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c
[   13.173009]        [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d
[   13.173009]        [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18
[   13.173009]        [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20
[   13.173009]        [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23
[   13.173009]        [<ffffffff810b739c>] find_get_page+0xa9/0xb9
[   13.173009]        [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d
[   13.173009]        [<ffffffff810d1a25>] __do_fault+0x54/0x3e6
[   13.173009]        [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed
[   13.173009]        [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0
[   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   13.173009]
[   13.173009] -> #1 (&rq->lock){-.-.-.}:
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
[   13.173009]        [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3
[   13.173009]        [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108
[   13.173009]        [<ffffffff810376c3>] do_fork+0x265/0x33f
[   13.173009]        [<ffffffff81007d02>] kernel_thread+0x6b/0x6d
[   13.173009]        [<ffffffff8153a9dd>] rest_init+0x21/0xd2
[   13.173009]        [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6
[   13.173009]        [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3
[   13.173009]        [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7
[   13.173009]
[   13.173009] -> #0 (&p->pi_lock){-.-.-.}:
[   13.173009]        [<ffffffff81067788>] check_prev_add+0x68/0x20e
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
[   13.173009]        [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]        [<ffffffff81032f3c>] wake_up_process+0x10/0x12
[   13.173009]        [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
[   13.173009]        [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
[   13.173009]        [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
[   13.173009]        [<ffffffff8103e487>] __do_softirq+0x109/0x26a
[   13.173009]        [<ffffffff8157144c>] call_softirq+0x1c/0x30
[   13.173009]        [<ffffffff81003207>] do_softirq+0x44/0xf1
[   13.173009]        [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
[   13.173009]        [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
[   13.173009]        [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
[   13.173009]        [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
[   13.173009]        [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
[   13.173009]        [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
[   13.173009]        [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
[   13.173009]        [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
[   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   13.173009]
[   13.173009] other info that might help us debug this:
[   13.173009]
[   13.173009] Chain exists of:
[   13.173009]   &p->pi_lock --> &rq->lock --> rcu_node_level_0
[   13.173009]
[   13.173009]  Possible unsafe locking scenario:
[   13.173009]
[   13.173009]        CPU0                    CPU1
[   13.173009]        ----                    ----
[   13.173009]   lock(rcu_node_level_0);
[   13.173009]                                lock(&rq->lock);
[   13.173009]                                lock(rcu_node_level_0);
[   13.173009]   lock(&p->pi_lock);
[   13.173009]
[   13.173009]  *** DEADLOCK ***
[   13.173009]
[   13.173009] 3 locks held by blkid/267:
[   13.173009]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de
[   13.173009]  #1:  (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9
[   13.173009]  #2:  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
[   13.173009]
[   13.173009] stack backtrace:
[   13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
[   13.173009] Call Trace:
[   13.173009]  <IRQ>  [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9
[   13.173009]  [<ffffffff81067788>] check_prev_add+0x68/0x20e
[   13.173009]  [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46
[   13.173009]  [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]  [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]  [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff81032f3c>] wake_up_process+0x10/0x12
[   13.173009]  [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
[   13.173009]  [<ffffffff810451da>] ? del_timer+0x75/0x75
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
[   13.173009]  [<ffffffff8103e487>] __do_softirq+0x109/0x26a
[   13.173009]  [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6
[   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
[   13.173009]  [<ffffffff8157144c>] call_softirq+0x1c/0x30
[   13.173009]  [<ffffffff81003207>] do_softirq+0x44/0xf1
[   13.173009]  [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
[   13.173009]  [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
[   13.173009]  [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
[   13.173009]  <EOI>  [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310
[   13.173009]  [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310
[   13.173009]  [<ffffffff812220e7>] ? clear_page_c+0x7/0x10
[   13.173009]  [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd
[   13.173009]  [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
[   13.173009]  [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
[   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
[   13.173009]  [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
[   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
[   13.173009]  [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
[   13.173009]  [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
[   13.173009]  [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]  [<ffffffff810d915f>] ? sys_brk+0x32/0x10c
[   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
[   13.173009]  [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c
[   13.173009]  [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   13.173009]  [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:41:49 +02:00
Ingo Molnar
29f742f88a Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent 2011-05-28 17:41:05 +02:00
Peter Zijlstra
f506b3dc0e perf: Fix SIGIO handling
Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
explicitly push the wakeup (including signals) when requested.

Reported-by: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:04:59 +02:00
KOSAKI Motohiro
1e1b6c511d cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:02:57 +02:00
Peter Zijlstra
1e87623178 sched: Fix ->min_vruntime calculation in dequeue_entity()
Dima Zavin <dima@android.com> reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin <dima@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:02:56 +02:00
Peter Zijlstra
d6aa8f85f1 sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
__ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
code was suspect.

Yong found that the interrupt could hit after context_switch() changes
current but before it clears p->on_cpu, if that interrupt were to
attempt a wake-up of p we would indeed find ourselves spinning in IRQ
context.

Fix this by reverting to the old behaviour for this situation and
perform a full remote wake-up.

Cc: Frank Rowand <frank.rowand@am.sony.com>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Marc Zyngier <Marc.Zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:02:55 +02:00
Xiaotian Feng
cd4ae6adf8 sched: More sched_domain iterations fixes
sched_domain iterations needs to be protected by rcu_read_lock() now,
this patch adds another two places which needs the rcu lock, which is
spotted by following suspicious rcu_dereference_check() usage warnings.

kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:02:54 +02:00
Linus Torvalds
f23a5e1405 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Fix PM QOS's user mode interface to work with ASCII input
  PM / Hibernate: Update kerneldoc comments in hibernate.c
  PM / Hibernate: Remove arch_prepare_suspend()
  PM / Hibernate: Update some comments in core hibernate code
2011-05-27 14:27:34 -07:00
Linus Torvalds
e52e713ec3 Merge branch 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs
* 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs:
  Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/   to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file>   to Documentation/security/<moved_file>.
2011-05-27 10:25:02 -07:00
Ingo Molnar
d6a72fe465 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent 2011-05-27 14:28:09 +02:00
Rakib Mullick
6f7bd76f05 kernel/profile.c: remove some duplicate code from profile_hits()
profile_hits() has a common check for prof_on and prof_buffer regardless
of SMP or !SMP.  So, remove some duplicate code by splitting profile_hits
into two.

[akpm@linux-foundation.org: make do_profile_hits static]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Jiri Slaby
3864601387 mm: extract exe_file handling from procfs
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe.  Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong.  By doing that we can make dup_mm_exe_file static.  Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Daniel Lezcano
a77aea9201 cgroup: remove the ns_cgroup
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

  * cgroup creation is out-of-control
  * cgroup name can conflict when pids are looping
  * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
  * we may want to create a namespace without creating a cgroup

  The ns_cgroup was replaced by a compatibility flag 'clone_children',
  where a newly created cgroup will copy the parent cgroup values.
  The userspace has to manually create a cgroup and add a task to
  the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change.  Commit 45531757b4 ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal.  Since that
time we have heard from XXX users who were affected by this.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00
Ben Blum
d846687d7f cgroups: use flex_array in attach_proc
Convert cgroup_attach_proc to use flex_array.

The cgroup_attach_proc implementation requires a pre-allocated array to
store task pointers to atomically move a thread-group, but asking for a
monolithic array with kmalloc() may be unreliable for very large groups.
Using flex_array provides the same functionality with less risk of
failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00
Ben Blum
74a1166dfe cgroups: make procs file writable
Make procs file writable to move all threads by tgid at once.

Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00
Ben Blum
f780bdb7c1 cgroups: add per-thread subsystem callbacks
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface.  Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this.  All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00