Commit Graph

17836 Commits

Author SHA1 Message Date
Linus Torvalds
06eb4cc2e7 Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull more cgroup fixes from Tejun Heo:
 "Three more patches to fix cgroup_freezer breakage due to the recent
  cgroup internal locking changes - an operation cgroup_freezer was
  using now requires sleepable context and cgroup_freezer was invoking
  that while holding a spin lock.  cgroup_freezer was using an overly
  elaborate hierarchical locking scheme.

  While it's possible to convert the hierarchical spinlocks directly to
  mutexes, this patch simplifies the overall locking so that it uses a
  global mutex.  This has the added benefit of avoiding iterating
  potentially huge number of tasks under a spinlock.  While the patch is
  on the larger side in the devel cycle, the changes made are mostly
  straight-forward and the locking logic is a lot simpler afterwards"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix rcu_read_lock() leak in update_if_frozen()
  cgroup_freezer: replace freezer->lock with freezer_mutex
  cgroup: introduce task_css_is_root()
2014-05-21 18:36:40 +09:00
Linus Torvalds
95d08585e0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single bug fix for a long standing issue:

   - Updating the expiry value of a relative timer _after_ letting the
     idle logic select a target cpu for the timer based on its stale
     expiry value is outright stupid.  Thanks to Viresh for spotting the
     brainfart"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Set expiry time before switch_hrtimer_base()
2014-05-20 14:19:10 +09:00
Tejun Heo
36e9d2ebcc cgroup: fix rcu_read_lock() leak in update_if_frozen()
While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
2014-05-13 11:28:30 -04:00
Tejun Heo
e5ced8ebb1 cgroup_freezer: replace freezer->lock with freezer_mutex
After 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem.  I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
errors like the following on freezer state reads and transitions.

  BUG: sleeping function called from invalid context at /work
 /os/work/kernel/locking/rwsem.c:20
  in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
  5 locks held by bash/462:
   #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
   #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
   #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
   #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
   #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
  Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460

  CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
   ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
   0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
  Call Trace:
   [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
   [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
   [<ffffffff81d05974>] down_read+0x24/0x60
   [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
   [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
   [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
   [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
   [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
   [<ffffffff811f121d>] SyS_write+0x4d/0xc0
   [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup.  In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex.  This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward.  The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
  freezer_mutex.  update_if_frozen() race no longer exists and the
  comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
  the new task_css_is_root() without doing rcu_read_lock/unlock().  If
  not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
  the whole operation and pin the css while iterating so that each
  descendant processing happens in sleepable context.

Fixes: 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:31 -04:00
Tejun Heo
5024ae29cd cgroup: introduce task_css_is_root()
Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers.  No need to export
    init_css_set.  Pointed out by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:27 -04:00
Linus Torvalds
efb2b1d5fd Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Fixes for two bugs in workqueue.

  One is exiting with internal mutex held in a failure path of
  wq_update_unbound_numa().  The other is a subtle and unlikely
  use-after-possible-last-put in the rescuer logic.  Both have been
  around for quite some time now and are unlikely to have triggered
  noticeably often.  All patches are marked for -stable backport"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix a possible race condition between rescuer and pwq-release
  workqueue: make rescuer_thread() empty wq->maydays list before exiting
  workqueue: fix bugs in wq_update_unbound_numa() failure path
2014-05-13 11:24:07 +09:00
Linus Torvalds
26a41cd1ee Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "During recent restructuring, device_cgroup unified config input check
  and enforcement logic; unfortunately, it turned out to share too much.
  Aristeu's patches fix the breakage and marked for -stable backport.

  The other two patches are fallouts from kernfs conversion.  The blkcg
  change is temporary and will go away once kernfs internal locking gets
  simplified (patches pending)"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
  device_cgroup: check if exception removal is allowed
  device_cgroup: fix the comment format for recently added functions
  device_cgroup: rework device access check and exception checking
  cgroup: fix the retry path of cgroup_mount()
2014-05-13 11:22:57 +09:00
Viresh Kumar
84ea7fe379 hrtimer: Set expiry time before switch_hrtimer_base()
switch_hrtimer_base() calls hrtimer_check_target() which ensures that
we do not migrate a timer to a remote cpu if the timer expires before
the current programmed expiry time on that remote cpu.

But __hrtimer_start_range_ns() calls switch_hrtimer_base() before the
new expiry time is set. So the sanity check in hrtimer_check_target()
is operating on stale or even uninitialized data.

Update expiry time before calling switch_hrtimer_base().

[ tglx: Rewrote changelog once again ]

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: fweisbec@gmail.com
Cc: arvind.chauhan@arm.com
Link: http://lkml.kernel.org/r/81999e148745fc51bbcd0615823fbab9b2e87e23.1399882253.git.viresh.kumar@linaro.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-12 10:47:39 +02:00
Linus Torvalds
181da3c34a Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "A somewhat unpleasantly large collection of small fixes.  The big ones
  are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'.  It
  was using __init functions with predictably suboptimal results.

  Another key fix is a build fix which would produce output that simply
  would not decompress correctly in some configuration, due to the
  existing Makefiles picking up an unfortunate local label and mistaking
  it for the global symbol _end.

  Additional fixes include the handling of 64-bit numbers when setting
  the vdso data page (a latent bug which became manifest when i386
  started exporting a vdso with time functions), a fix to the new MSR
  manipulation accessors which would cause features to not get properly
  unblocked, a build fix for 32-bit userland, and a few new platform
  quirks"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall()
  x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro
  x86: Fix typo preventing msr_set/clear_bit from having an effect
  x86/intel: Add quirk to disable HPET for the Baytrail platform
  x86/hpet: Make boot_hpet_disable extern
  x86-64, build: Fix stack protector Makefile breakage with 32-bit userland
  x86/reboot: Add reboot quirk for Certec BPC600
  asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
  asmlinkage, x86: Add explicit __visible to arch/x86/*
  asmlinkage: Revert "lto: Make asmlinkage __visible"
  x86, build: Don't get confused by local symbols
  x86/efi: earlyprintk=efi,keep fix
2014-05-09 12:24:20 -07:00
Linus Torvalds
f322e26238 Merge tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
 "This contains two fixes.

  The first is a long standing bug that causes bogus data to show up in
  the refcnt field of the module_refcnt tracepoint.  It was introduced
  by a merge conflict resolution back in 2.6.35-rc days.

  The result should be 'refcnt = incs - decs', but instead it did
  'refcnt = incs + decs'.

  The second fix is to a bug that was introduced in this merge window
  that allowed for a tracepoint funcs pointer to be used after it was
  freed.  Moving the location of where the probes are released solved
  the problem"

* tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoint: Fix use of tracepoint funcs after rcu free
  trace: module: Maintain a valid user count
2014-05-08 14:17:13 -07:00
Mathieu Desnoyers
8058bd0faa tracepoint: Fix use of tracepoint funcs after rcu free
Commit de7b297390 "tracepoint: Use struct pointer instead of name hash
for reg/unreg tracepoints" introduces a use after free by calling
release_probes on the old struct tracepoint array before the newly
allocated array is published with rcu_assign_pointer. There is a race
window where tracepoints (RCU readers) can perform a
"use-after-grace-period-after-free", which shows up as a GPF in
stress-tests.

Link: http://lkml.kernel.org/r/53698021.5020108@oracle.com
Link: http://lkml.kernel.org/p/1399549669-25465-1-git-send-email-mathieu.desnoyers@efficios.com

Reported-by: Sasha Levin <sasha.levin@oracle.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Dave Jones <davej@redhat.com>
Fixes: de7b297390 "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints"
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-08 09:10:56 -04:00
Andi Kleen
722a9f9299 asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
As requested by Linus add explicit __visible to the asmlinkage users.
This marks functions visible to assembler.

Tree sweep for rest of tree.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-4-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 16:07:46 -07:00
Linus Torvalds
2080cee435 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) e1000e computes header length incorrectly wrt vlans, fix from Vlad
    Yasevich.

 2) ns_capable() check in sock_diag netlink code, from Andrew
    Lutomirski.

 3) Fix invalid queue pairs handling in virtio_net, from Amos Kong.

 4) Checksum offloading busted in sxgbe driver due to incorrect
    descriptor layout, fix from Byungho An.

 5) Fix build failure with SMC_DEBUG set to 2 or larger, from Zi Shen
    Lim.

 6) Fix uninitialized A and X registers in BPF interpreter, from Alexei
    Starovoitov.

 7) Fix arch dependencies of candence driver.

 8) Fix netlink capabilities checking tree-wide, from Eric W Biederman.

 9) Don't dump IFLA_VF_PORTS if netlink request didn't ask for it in
    IFLA_EXT_MASK, from David Gibson.

10) IPV6 FIB dump restart doesn't handle table changes that happen
    meanwhile, causing the code to loop forever or emit dups, fix from
    Kumar Sandararajan.

11) Memory leak on VF removal in bnx2x, from Yuval Mintz.

12) Bug fixes for new Altera TSE driver from Vince Bridgers.

13) Fix route lookup key in SCTP, from Xugeng Zhang.

14) Use BH blocking spinlocks in SLIP, as per a similar fix to CAN/SLCAN
    driver.  From Oliver Hartkopp.

15) TCP doesn't bump retransmit counters in some code paths, fix from
    Eric Dumazet.

16) Clamp delayed_ack in tcp_cubic to prevent theoretical divides by
    zero.  Fix from Liu Yu.

17) Fix locking imbalance in error paths of HHF packet scheduler, from
    John Fastabend.

18) Properly reference the transport module when vsock_core_init() runs,
    from Andy King.

19) Fix buffer overflow in cdc_ncm driver, from Bjørn Mork.

20) IP_ECN_decapsulate() doesn't see a correct SKB network header in
    ip_tunnel_rcv(), fix from Ying Cai.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits)
  net: macb: Fix race between HW and driver
  net: macb: Remove 'unlikely' optimization
  net: macb: Re-enable RX interrupt only when RX is done
  net: macb: Clear interrupt flags
  net: macb: Pass same size to DMA_UNMAP as used for DMA_MAP
  ip_tunnel: Set network header properly for IP_ECN_decapsulate()
  e1000e: Restrict MDIO Slow Mode workaround to relevant parts
  e1000e: Fix issue with link flap on 82579
  e1000e: Expand workaround for 10Mb HD throughput bug
  e1000e: Workaround for dropped packets in Gig/100 speeds on 82579
  net/mlx4_core: Don't issue PCIe speed/width checks for VFs
  net/mlx4_core: Load the Eth driver first
  net/mlx4_core: Fix slave id computation for single port VF
  net/mlx4_core: Adjust port number in qp_attach wrapper when detaching
  net: cdc_ncm: fix buffer overflow
  Altera TSE: ALTERA_TSE should depend on HAS_DMA
  vsock: Make transport the proto owner
  net: sched: lock imbalance in hhf qdisc
  net: mvmdio: Check for a valid interrupt instead of an error
  net phy: Check for aneg completion before setting state to PHY_RUNNING
  ...
2014-05-05 15:59:46 -07:00
Linus Torvalds
0384dcae2b Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "This udpate delivers:

   - A fix for dynamic interrupt allocation on x86 which is required to
     exclude the GSI interrupts from the dynamic allocatable range.

     This was detected with the newfangled tablet SoCs which have GPIOs
     and therefor allocate a range of interrupts.  The MSI allocations
     already excluded the GSI range, so we never noticed before.

   - The last missing set_irq_affinity() repair, which was delayed due
     to testing issues

   - A few bug fixes for the armada SoC interrupt controller

   - A memory allocation fix for the TI crossbar interrupt controller

   - A trivial kernel-doc warning fix"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: irq-crossbar: Not allocating enough memory
  irqchip: armanda: Sanitize set_irq_affinity()
  genirq: x86: Ensure that dynamic irq allocation does not conflict
  linux/interrupt.h: fix new kernel-doc warnings
  irqchip: armada-370-xp: Fix releasing of MSIs
  irqchip: armada-370-xp: implement the ->check_device() msi_chip operation
  irqchip: armada-370-xp: fix invalid cast of signed value into unsigned variable
2014-05-03 08:32:48 -07:00
Linus Torvalds
98facf0e1e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "This update brings along:

   - Two fixes for long standing bugs in the hrtimer code, one which
     prevents remote enqueuing and the other preventing arbitrary delays
     after a interrupt hang was detected

   - A fix in the timer wheel which prevents math overflow

   - A fix for a long standing issue with the architected ARM timer
     related to the C3STOP mechanism.

   - A trivial compile fix for nspire SoC clocksource"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Prevent overflow in apply_slack
  hrtimer: Prevent remote enqueue of leftmost timers
  hrtimer: Prevent all reprogramming if hang detected
  clocksource: nspire: Fix compiler warning
  clocksource: arch_arm_timer: Fix age-old arch timer C3STOP detection issue
2014-05-03 08:31:45 -07:00
Linus Torvalds
00622e61ed Merge tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
 "This is a small fix where the trigger code used the wrong
  rcu_dereference().  It required rcu_dereference_sched() instead of the
  normal rcu_dereference().  It produces a nasty RCU lockdep splat due
  to the incorrect rcu notation"

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

* tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Use rcu_dereference_sched() for trace event triggers
2014-05-03 08:30:44 -07:00
Steven Rostedt (Red Hat)
561a4fe851 tracing: Use rcu_dereference_sched() for trace event triggers
As trace event triggers are now part of the mainline kernel, I added
my trace event trigger tests to my test suite I run on all my kernels.
Now these tests get run under different config options, and one of
those options is CONFIG_PROVE_RCU, which checks under lockdep that
the rcu locking primitives are being used correctly. This triggered
the following splat:

===============================
[ INFO: suspicious RCU usage. ]
3.15.0-rc2-test+ #11 Not tainted
-------------------------------
kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
4 locks held by swapper/1/0:
 #0:  ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
 #1:  (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
 #2:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
 #3:  (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
 ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
 0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
Call Trace:
 <IRQ>  [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
 [<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
 [<ffffffff810ee51c>] event_triggers_call+0x99/0x108
 [<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
 [<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
 [<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
 [<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
 [<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
 [<ffffffff8106eb77>] wake_up_process+0x36/0x3b
 [<ffffffff810575cc>] wake_up_worker+0x24/0x26
 [<ffffffff810578bc>] insert_work+0x5c/0x65
 [<ffffffff81059982>] __queue_work+0x26c/0x283
 [<ffffffff81059999>] ? __queue_work+0x283/0x283
 [<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
 [<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
 [<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
 [<ffffffff81059999>] ? __queue_work+0x283/0x283
 [<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
 [<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
 [<ffffffff81046d03>] irq_exit+0x42/0x97
 [<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
 [<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
 <EOI>  [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
 [<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
 [<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
 [<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
 [<ffffffff8102a23c>] start_secondary+0x212/0x219

The cause is that the triggers are protected by rcu_read_lock_sched() but
the data is dereferenced with rcu_dereference() which expects it to
be protected with rcu_read_lock(). The proper reference should be
rcu_dereference_sched().

Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-02 23:12:42 -04:00
Linus Torvalds
60b88f3941 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
 "Fixed one missing place for the new taint flag, and remove a warning
  giving only false positives (now we finally figured out why)"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  module: remove warning about waiting module removal.
  Fix: tracing: use 'E' instead of 'X' for unsigned module taint flag
2014-05-01 10:35:01 -07:00
Jiri Bohac
98a01e779f timer: Prevent overflow in apply_slack
On architectures with sizeof(int) < sizeof (long), the
computation of mask inside apply_slack() can be undefined if the
computed bit is > 32.

E.g. with: expires = 0xffffe6f5 and slack = 25, we get:

expires_limit = 0x20000000e
bit = 33
mask = (1 << 33) - 1  /* undefined */

On x86, mask becomes 1 and and the slack is not applied properly.
On s390, mask is -1, expires is set to 0 and the timer fires immediately.

Use 1UL << bit to solve that issue.

Suggested-by: Deborah Townsend <dstownse@us.ibm.com>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140418152310.GA13654@midget.suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 13:46:17 +02:00
Leon Ma
012a45e3f4 hrtimer: Prevent remote enqueue of leftmost timers
If a cpu is idle and starts an hrtimer which is not pinned on that
same cpu, the nohz code might target the timer to a different cpu.

In the case that we switch the cpu base of the timer we already have a
sanity check in place, which determines whether the timer is earlier
than the current leftmost timer on the target cpu. In that case we
enqueue the timer on the current cpu because we cannot reprogram the
clock event device on the target.

If the timers base is already the target CPU we do not have this
sanity check in place so we enqueue the timer as the leftmost timer in
the target cpus rb tree, but we cannot reprogram the clock event
device on the target cpu. So the timer expires late and subsequently
prevents the reprogramming of the target cpu clock event device until
the previously programmed event fires or a timer with an earlier
expiry time gets enqueued on the target cpu itself.

Add the same target check as we have for the switch base case and
start the timer on the current cpu if it would become the leftmost
timer on the target.

[ tglx: Rewrote subject and changelog ]

Signed-off-by: Leon Ma <xindong.ma@intel.com>
Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 12:34:51 +02:00
Stuart Hayes
6c6c0d5a1c hrtimer: Prevent all reprogramming if hang detected
If the last hrtimer interrupt detected a hang it sets hang_detected=1
and programs the clock event device with a delay to let the system
make progress.

If hang_detected == 1, we prevent reprogramming of the clock event
device in hrtimer_reprogram() but not in hrtimer_force_reprogram().

This can lead to the following situation:

hrtimer_interrupt()
   hang_detected = 1;
   program ce device to Xms from now (hang delay)

We have two timers pending:
   T1 expires 50ms from now
   T2 expires 5s from now

Now T1 gets canceled, which causes hrtimer_force_reprogram() to be
invoked, which in turn programs the clock event device to T2 (5
seconds from now).

Any hrtimer_start after that will not reprogram the hardware due to
hang_detected still being set. So we effectivly block all timers until
the T2 event fires and cleans up the hang situation.

Add a check for hang_detected to hrtimer_force_reprogram() which
prevents the reprogramming of the hang delay in the hardware
timer. The subsequent hrtimer_interrupt will resolve all outstanding
issues.

[ tglx: Rewrote subject and changelog and fixed up the comment in
  	hrtimer_force_reprogram() ]

Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Link: http://lkml.kernel.org/r/53602DC6.2060101@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 12:34:51 +02:00
Linus Torvalds
2aafe1a4d4 Merge tag 'trace-fixes-v3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace bugfix from Steven Rostedt:
 "Takao Indoh reported that he was able to cause a ftrace bug while
  loading a module and enabling function tracing at the same time.

  He uncovered a race where the module when loaded will convert the
  calls to mcount into nops, and expects the module's text to be RW.
  But when function tracing is enabled, it will convert all kernel text
  (core and module) from RO to RW to convert the nops to calls to ftrace
  to record the function.  After the convertion, it will convert all the
  text back from RW to RO.

  The issue is, it will also convert the module's text that is loading.
  If it converts it to RO before ftrace does its conversion, it will
  cause ftrace to fail and require a reboot to fix it again.

  This patch moves the ftrace module update that converts calls to
  mcount into nops to be done when the module state is still
  MODULE_STATE_UNFORMED.  This will ignore the module when the text is
  being converted from RW back to RO"

* tag 'trace-fixes-v3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/module: Hardcode ftrace_module_init() call into load_module()
2014-04-28 16:57:51 -07:00
Steven Rostedt (Red Hat)
a949ae560a ftrace/module: Hardcode ftrace_module_init() call into load_module()
A race exists between module loading and enabling of function tracer.

	CPU 1				CPU 2
	-----				-----
  load_module()
   module->state = MODULE_STATE_COMING

				register_ftrace_function()
				 mutex_lock(&ftrace_lock);
				 ftrace_startup()
				  update_ftrace_function();
				   ftrace_arch_code_modify_prepare()
				    set_all_module_text_rw();
				   <enables-ftrace>
				    ftrace_arch_code_modify_post_process()
				     set_all_module_text_ro();

				[ here all module text is set to RO,
				  including the module that is
				  loading!! ]

   blocking_notifier_call_chain(MODULE_STATE_COMING);
    ftrace_init_module()

     [ tries to modify code, but it's RO, and fails!
       ftrace_bug() is called]

When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.

The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.

The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.

Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com

Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-04-28 10:37:21 -04:00
Thomas Gleixner
62a08ae2a5 genirq: x86: Ensure that dynamic irq allocation does not conflict
On x86 the allocation of irq descriptors may allocate interrupts which
are in the range of the GSI interrupts. That's wrong as those
interrupts are hardwired and we don't have the irq domain translation
like PPC. So one of these interrupts can be hooked up later to one of
the devices which are hard wired to it and the io_apic init code for
that particular interrupt line happily reuses that descriptor with a
completely different configuration so hell breaks lose.

Inside x86 we allocate dynamic interrupts from above nr_gsi_irqs,
except for a few usage sites which have not yet blown up in our face
for whatever reason. But for drivers which need an irq range, like the
GPIO drivers, we have no limit in place and we don't want to expose
such a detail to a driver.

To cure this introduce a function which an architecture can implement
to impose a lower bound on the dynamic interrupt allocations.

Implement it for x86 and set the lower bound to nr_gsi_irqs, which is
the end of the hardwired interrupt space, so all dynamic allocations
happen above.

That not only allows the GPIO driver to work sanely, it also protects
the bogus callsites of create_irq_nr() in hpet, uv, irq_remapping and
htirq code. They need to be cleaned up as well, but that's a separate
issue.

Reported-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Krogerus Heikki <heikki.krogerus@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1404241617360.28206@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-28 12:20:00 +02:00
Rusty Russell
79465d2fd4 module: remove warning about waiting module removal.
We remove the waiting module removal in commit 3f2b9c9cdf (September
2013), but it turns out that modprobe in kmod (< version 16) was
asking for waiting module removal.  No one noticed since modprobe would
check for 0 usage immediately before trying to remove the module, and
the race is unlikely.

However, it means that anyone running old (but not ancient) kmod
versions is hitting the printk designed to see if anyone was running
"rmmod -w".  All reports so far have been false positives, so remove
the warning.

Fixes: 3f2b9c9cdf
Reported-by: Valerio Vanni <valerio.vanni@inwind.it>
Cc: Elliott, Robert (Server Storage) <Elliott@hp.com>
Cc: stable@kernel.org
Acked-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-04-28 11:06:59 +09:30