Commit Graph

19122 Commits

Author SHA1 Message Date
Paul E. McKenney
3f95aa81d2 rcu: Make TASKS_RCU handle tasks that are almost done exiting
Once a task has passed exit_notify() in the do_exit() code path, it
is no longer on the task lists, and is therefore no longer visible
to rcu_tasks_kthread().  This means that an almost-exited task might
be preempted while within a trampoline, and this task won't be waited
on by rcu_tasks_kthread().  This commit fixes this bug by adding an
srcu_struct.  An exiting task does srcu_read_lock() just before calling
exit_notify(), and does the corresponding srcu_read_unlock() after
doing the final preempt_disable().  This means that rcu_tasks_kthread()
can do synchronize_srcu() to wait for all mostly-exited tasks to reach
their final preempt_disable() region, and then use synchronize_sched()
to wait for those tasks to finish exiting.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:22 -07:00
Paul E. McKenney
53c6d4edf8 rcu: Add synchronous grace-period waiting for RCU-tasks
It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them.  The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:21 -07:00
Paul E. McKenney
bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Paul E. McKenney
8315f42295 rcu: Add call_rcu_tasks()
This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
context switch (not preemption!) and userspace execution (not the idle
loop -- use some sort of schedule_on_each_cpu() if you need to handle the
idle tasks.  Note that unlike other RCU flavors, these quiescent states
occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerant
updaters.  This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single task-private linked list containing all tasks that have not
yet passed through a quiescent state.  If experience shows this assumption
to be incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:19 -07:00
Paul E. McKenney
38706bc5a2 rcutorture: Add callback-flood test
Although RCU is designed to handle arbitrary floods of callbacks, this
capability is not routinely tested.   This commit therefore adds a
cbflood capability in which kthreads repeatedly registers large numbers
of callbacks.  One such kthread is created for each four CPUs (rounding
up), and the test may be controlled by several cbflood_* kernel boot
parameters, which control the number of bursts per flood, the number
of callbacks per burst, the time between bursts, and the time between
floods.  The default values are large enough to exercise RCU's emergency
responses to callback flooding.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-09-07 16:24:48 -07:00
Joe Perches
eea203fea3 rcu: Use pr_alert/pr_cont for printing logs
User pr_alert/pr_cont for printing the logs from rcutorture module directly
instead of writing it to a buffer and then printing it. This allows us from not
having to allocate such buffers. Also remove a resulting empty function.

I tested this using the parse-torture.sh script as follows:

$ dmesg | grep torture > log.txt
$ bash parse-torture.sh log.txt test
$

There were no warnings which means that parsing went fine.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:24:47 -07:00
Pranith Kumar
58ade2dbe9 rcutorture: Fix a sparse warning by marking boost_mutex static
This commit fixes the following sparse warning by marking boost_mutex
static:

kernel/rcu/rcutorture.c:185:1: warning: symbol 'boost_mutex' was not declared. Should it be static?

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07 16:24:17 -07:00
Paul E. McKenney
73a860cd58 rcu: Replace flush_signals() with WARN_ON(signal_pending())
Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible.  Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:20 -07:00
Pranith Kumar
2aa792e6fa rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:

*  Is the thread we are trying to wake up the current thread?
*  Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
*  Is there no thread created for this flavour, hence nothing to wake up?

If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.

Idle:

Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

rcutorture:

Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.

There is a comment about an implied barrier supplied by the wake_up()
logic.  This barrier is necessary for the awakened thread to see the
updated ->gp_flags.  This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.

Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:19 -07:00
Paul E. McKenney
ade9862470 rcu: Make TINY_RCU tinier by putting error checks under #ifdef
The rcu_idle_enter_common() and rcu_idle_exit_common() functions contain
error checks that have to the best of my knowledge have never triggered
over the past several years.  These are nevertheless valuable when
creating new architectures or doing other low-level changes, so the
checks should not be deleted.  This commit instead places these checks
under #ifdef CONFIG_RCU_TRACE so that they are executed only when
specifically requested.

The savings are significant:

	Before:

	   text    data     bss     dec     hex filename
	   1749      39       0    1788     6fc /tmp/b/kernel/rcu/tiny.o
	    632     152       0     784     310 /tmp/b/kernel/rcu/update.o
				   ----
				   2572

	After:

	   text    data     bss     dec     hex filename
	   1281      37       0    1318     526 /tmp/b/kernel/rcu/tiny.o
	    632     152       0     784     310 /tmp/b/kernel/rcu/update.o
				   ----
				   2102

This amounts to 470 bytes, or 18% of the original.

Switched from #ifdef to IS_ENABLED() on Josh Triplett's advice.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07 16:18:18 -07:00
Paul E. McKenney
9fdd3bc900 rcu: Break more call_rcu() deadlock involving scheduler and perf
Commit 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler
and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake
the rcuo kthread due to the queue being initially empty, but did not
do anything for the case where the queue was overflowing.  This commit
therefore also defers wakeup for the overflow case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:17 -07:00
Pranith Kumar
66d701ea7e rcu: Remove stale comment in tree.c
This commit removes a stale comment in rcu/tree.c which was left
out when some code was moved around previously in commit 2036d94a7b
("rcu:  Rework detection of use of RCU by offline CPUs") For reference,
the following updated comment exists a few lines below this which means
the same:

/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:16 -07:00
Pranith Kumar
fafb6e843f rcu: Update tiny.c references to tree.c
This commit updates the references to rcutree.c which is now rcu/tree.c

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:15 -07:00
Ard Biesheuvel
a8a29b3b7b rcu: Define tracepoint strings only if CONFIG_TRACING is set
Commit f7f7bac9cb ("rcu: Have the RCU tracepoints use the tracepoint_string
infrastructure") unconditionally populates the __tracepoint_str input section,
but this section is not assigned an output section if CONFIG_TRACING is not set.
This results in the __tracepoint_str turning up in unexpected places, i.e.,
after _edata.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:14 -07:00
Oleg Nesterov
85b39d305b rcu: Uninline rcu_read_lock_held()
This commit uninlines rcu_read_lock_held(). According to "size vmlinux"
this saves 28549 in .text:

	- 5541731 3014560 14757888 23314179
	+ 5513182 3026848 14757888 23297918

Note: it looks as if the data grows by 12288 bytes but this is not true,
it does not actually grow. But .data starts with ALIGN(THREAD_SIZE) and
since .text shrinks the padding grows, and thus .data grows too as it
seen by /bin/size. diff System.map:

	- ffffffff81510000 D _sdata
	- ffffffff81510000 D init_thread_union
	+ ffffffff81509000 D _sdata
	+ ffffffff8150c000 D init_thread_union

Perhaps we can change vmlinux.lds.S to .data itself, so that /bin/size
can't "wrongly" report that .data grows if .text shinks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:13 -07:00
Pranith Kumar
e02b2edfa1 rcu: Use true/false instead of 1/0 for a bool type
This commit uses true/false instead of 1/0 for bool types in rcu_gp_fqs()
and force_qs_rnp().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:12 -07:00
Pranith Kumar
d0bc90fd37 rcu: Return bool type for rcu_try_advance_all_cbs()
Return a bool type instead of 0 in rcu_try_advance_all_cbs().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:10 -07:00
Pranith Kumar
f534ed1fd7 rcu: Use bool type for return value in rcu_is_watching()
Use a bool type for return in rcu_is_watching().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:09 -07:00
Pranith Kumar
bf33eb1aef rcu: Fix sparse warning about rcu_batches_completed_preempt() being non-static
fix sparse warning about rcu_batches_completed_preempt() being non-static by
marking it as static

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:08 -07:00
Pranith Kumar
4de376a1b1 rcu: Remove remaining read-modify-write ACCESS_ONCE() calls
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:07 -07:00
Linus Torvalds
6fef37c9a7 Merge tag 'pm+acpi-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
 "These are regression fixes (ACPI sysfs, ACPI video, suspend test),
  ACPI cpuidle deadlock fix, missing runtime validation of ACPI _DSD
  output, a fix and a new CPU ID for the RAPL driver, new blacklist
  entry for the ACPI EC driver and a couple of trivial cleanups
  (intel_pstate and generic PM domains).

  Specifics:

   - Fix for recently broken test_suspend= command line argument (Rafael
     Wysocki).

   - Fixes for regressions related to the ACPI video driver caused by
     switching the default to native backlight handling in 3.16 from
     Hans de Goede.

   - Fix for a sysfs attribute of ACPI device objects that returns stale
     values sometimes due to the fact that they are cached instead of
     executing the appropriate method (_SUN) every time (broken in
     3.14).  From Yasuaki Ishimatsu.

   - Fix for a deadlock between cpuidle_lock and cpu_hotplug.lock in the
     ACPI processor driver from Jiri Kosina.

   - Runtime output validation for the ACPI _DSD device configuration
     object missing from the support for it that has been introduced
     recently.  From Mika Westerberg.

   - Fix for an unuseful and misleading RAPL (Running Average Power
     Limit) domain detection message in the RAPL driver from Jacob Pan.

   - New Intel Haswell CPU ID for the RAPL driver from Jason Baron.

   - New Clevo W350etq blacklist entry for the ACPI EC driver from Lan
     Tianyu.

   - Cleanup for the intel_pstate driver and the core generic PM domains
     code from Gabriele Mazzotta and Geert Uytterhoeven"

* tag 'pm+acpi-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock
  ACPI / scan: not cache _SUN value in struct acpi_device_pnp
  cpufreq: intel_pstate: Remove unneeded variable
  powercap / RAPL: change domain detection message
  powercap / RAPL: add support for CPU model 0x3f
  PM / domains: Make generic_pm_domain.name const
  PM / sleep: Fix test_suspend= command line option
  ACPI / EC: Add msi quirk for Clevo W350etq
  ACPI / video: Disable native_backlight on HP ENVY 15 Notebook PC
  ACPI / video: Add a disable_native_backlight quirk
  ACPI / video: Fix use_native_backlight selection logic
  ACPICA: ACPI 5.1: Add support for runtime validation of _DSD package.
2014-09-07 11:57:27 -07:00
Linus Torvalds
81368f8bb8 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fix from Ingo Molnar:
 "A boot hang fix for the offloaded callback RCU model (RCU_NOCB_CPU=y
  && (TREE_CPU=y || TREE_PREEMPT_RC)) in certain bootup scenarios"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Make nocb leader kthreads process pending callbacks after spawning
2014-09-07 10:51:42 -07:00
Linus Torvalds
ebc54f278f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Three fixlets from the timer departement:

   - Update the timekeeper before updating vsyscall and pvclock.  This
     fixes the kvm-clock regression reported by Chris and Paolo.

   - Use the proper irq work interface from NMI.  This fixes the
     regression reported by Catalin and Dave.

   - Clarify the compat_nanosleep error handling mechanism to avoid
     future confusion"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Update timekeeper before updating vsyscall and pvclock
  compat: nanosleep: Clarify error handling
  nohz: Restore NMI safe local irq work for local nohz kick
2014-09-07 10:37:48 -07:00
xiaofeng.yan
177ef2a631 sched/deadline: Fix a precision problem in the microseconds range
An overrun could happen in function start_hrtick_dl()
when a task with SCHED_DEADLINE runs in the microseconds
range.

For example, if a task with SCHED_DEADLINE has the following parameters:

  Task  runtime  deadline  period
   P1   200us     500us    500us

The deadline and period from task P1 are less than 1ms.

In order to achieve microsecond precision, we need to enable HRTICK feature
by the next command:

  PC#echo "HRTICK" > /sys/kernel/debug/sched_features
  PC#trace-cmd record -e sched_switch &
  PC#./schedtool -E -t 200000:500000:500000 -e ./test

The binary test is in an endless while(1) loop here.
Some pieces of trace.dat are as follows:

  <idle>-0   157.603157: sched_switch: :R ==> 2481:4294967295: test
  test-2481  157.603203: sched_switch:  2481:R ==> 0:120: swapper/2
  <idle>-0   157.605657: sched_switch:  :R ==> 2481:4294967295: test
  test-2481  157.608183: sched_switch:  2481:R ==> 2483:120: trace-cmd
  trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test

We can get the runtime of P1 from the information above:

  runtime = 157.608183 - 157.605657
  runtime = 0.002526(2.526ms)

The correct runtime should be less than or equal to 200us at some point.

The problem is caused by a conditional judgment "delta > 10000"
in function start_hrtick_dl().

Because no hrtimer start up to control the rest of runtime
when the reset of runtime is less than 10us.

So the process will continue to run until tick-period is coming.

Move the code with the limit of the least time slice
from hrtick_start_fair() to hrtick_start() because the
EDF schedule class also needs this function in start_hrtick_dl().

To fix this problem, we call hrtimer_start() unconditionally in
start_hrtick_dl(), and make sure the scheduling slice won't be smaller
than 10us in hrtimer_start().

Signed-off-by: Xiaofeng Yan <xiaofeng.yan@huawei.com>
Reviewed-by: Li Zefan <lizefan@huawei.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
[ Massaged the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-07 11:09:59 +02:00
Thomas Gleixner
9bf2419fa7 timekeeping: Update timekeeper before updating vsyscall and pvclock
The update_walltime() code works on the shadow timekeeper to make the
seqcount protected region as short as possible. But that update to the
shadow timekeeper does not update all timekeeper fields because it's
sufficient to do that once before it becomes life. One of these fields
is tkr.base_mono. That stays stale in the shadow timekeeper unless an
operation happens which copies the real timekeeper to the shadow.

The update function is called after the update calls to vsyscall and
pvclock. While not correct, it did not cause any problems because none
of the invoked update functions used base_mono.

commit cbcf2dd3b3 (x86: kvm: Make kvm_get_time_and_clockread()
nanoseconds based) changed that in the kvm pvclock update function, so
the stale mono_base value got used and caused kvm-clock to malfunction.

Put the update where it belongs and fix the issue.

Reported-by: Chris J Arges <chris.j.arges@canonical.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-06 12:58:18 +02:00