Commit Graph

29402 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
f79b3f3385 ftrace: Allow enabling of filters via index of available_filter_functions
Enabling of large number of functions by echoing in a large subset of the
functions in available_filter_functions can take a very long time. The
process requires testing all functions registered by the function tracer
(which is in the 10s of thousands), and doing a kallsyms lookup to convert
the ip address into a name, then comparing that name with the string passed
in.

When a function causes the function tracer to crash the system, a binary
bisect of the available_filter_functions can be done to find the culprit.
But this requires passing in half of the functions in
available_filter_functions over and over again, which makes it basically a
O(n^2) operation. With 40,000 functions, that ends up bing 1,600,000,000
opertions! And enabling this can take over 20 minutes.

As a quick speed up, if a number is passed into one of the filter files,
instead of doing a search, it just enables the function at the corresponding
line of the available_filter_functions file. That is:

 # echo 50 > set_ftrace_filter
 # cat set_ftrace_filter
 x86_pmu_commit_txn

 # head -50 available_filter_functions | tail -1
 x86_pmu_commit_txn

This allows setting of half the available_filter_functions to take place in
less than a second!

 # time seq 20000 > set_ftrace_filter
 real    0m0.042s
 user    0m0.005s
 sys     0m0.015s

 # wc -l set_ftrace_filter
 20000 set_ftrace_filter

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-15 13:10:09 -05:00
Changbin Du
85acbb21b9 tracing: Change the function format to display function names by perf
Here is an example for this change.

$ sudo perf record -e 'ftrace:function' --filter='ip==schedule'
$ sudo perf report

The output of perf before this patch:

\# Samples: 100  of event 'ftrace:function'
\# Event count (approx.): 100
\#
\# Overhead  Trace output
\# ........  ......................................
\#
    51.00%   ffffffff81f6aaa0 <-- ffffffff81158e8d
    29.00%   ffffffff81f6aaa0 <-- ffffffff8116ccb2
     8.00%   ffffffff81f6aaa0 <-- ffffffff81f6f2ed
     4.00%   ffffffff81f6aaa0 <-- ffffffff811628db
     4.00%   ffffffff81f6aaa0 <-- ffffffff81f6ec5b
     2.00%   ffffffff81f6aaa0 <-- ffffffff81f6f21a
     1.00%   ffffffff81f6aaa0 <-- ffffffff811b04af
     1.00%   ffffffff81f6aaa0 <-- ffffffff8143ce17

After this patch:

\# Samples: 36  of event 'ftrace:function'
\# Event count (approx.): 36
\#
\# Overhead  Trace output
\# ........  ............................................
\#
    38.89%   schedule <-- schedule_hrtimeout_range_clock
    27.78%   schedule <-- worker_thread
    13.89%   schedule <-- schedule_timeout
    11.11%   schedule <-- smpboot_thread_fn
     5.56%   schedule <-- rcu_gp_kthread
     2.78%   schedule <-- exit_to_usermode_loop

Link: http://lkml.kernel.org/r/20190209161919.32350-1-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-11 14:53:43 -05:00
Miroslav Benes
d325c40296 ring-buffer: Remove unused function ring_buffer_page_len()
Commit 6b7e633fe9 ("tracing: Remove extra zeroing out of the ring
buffer page") removed the only caller of ring_buffer_page_len(). The
function is now unused and may be removed.

Link: http://lkml.kernel.org/r/20181228133847.106177-1-mbenes@suse.cz

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:58:33 -05:00
Changbin Du
f52d569f3d tracing: Show stacktrace for wakeup tracers
This align the behavior of wakeup tracers with irqsoff latency tracer
that we record stacktrace at the beginning and end of waking up. The
stacktrace shows us what is happening in the kernel.

Link: http://lkml.kernel.org/r/20190116160249.7554-1-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:19 -05:00
Changbin Du
afbab501c6 tracing: Put a margin between flags and duration for wakeup tracers
Don't mix context flags with function duration info.

Instead of this:

 # tracer: wakeup_rt
 #
 # wakeup_rt latency trace v1.1.5 on 5.0.0-rc1-test+
 # --------------------------------------------------------------------
 # latency: 177 us, #545/545, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
 #    -----------------
 #    | task: migration/0-11 (uid:0 nice:0 policy:1 rt_prio:99)
 #    -----------------
 #
 #                                       _-----=> irqs-off
 #                                      / _----=> need-resched
 #                                     | / _---=> hardirq/softirq
 #                                     || / _--=> preempt-depth
 #                                     ||| /
 #   REL TIME      CPU  TASK/PID       ||||  DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||   |   |                     |   |   |   |
         0 us |   0)    <idle>-0    |  dNh5              |  /*      0:120:R   + [000]    11:  0:R migration/0 */
         2 us |   0)    <idle>-0    |  dNh5  0.000 us    |            (null)();
         4 us |   0)    <idle>-0    |  dNh4              |  _raw_spin_unlock() {
         4 us |   0)    <idle>-0    |  dNh4  0.304 us    |    preempt_count_sub();
         5 us |   0)    <idle>-0    |  dNh3  1.063 us    |  }
         5 us |   0)    <idle>-0    |  dNh3  0.266 us    |  ttwu_stat();
         6 us |   0)    <idle>-0    |  dNh3              |  _raw_spin_unlock_irqrestore() {
         6 us |   0)    <idle>-0    |  dNh3  0.273 us    |    preempt_count_sub();
         6 us |   0)    <idle>-0    |  dNh2  0.818 us    |  }

Show this:

 # tracer: wakeup
 #
 # wakeup latency trace v1.1.5 on 4.20.0+
 # --------------------------------------------------------------------
 # latency: 593 us, #674/674, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
 #    -----------------
 #    | task: kworker/0:1H-339 (uid:0 nice:-20 policy:0 rt_prio:0)
 #    -----------------
 #
 #                                      _-----=> irqs-off
 #                                     / _----=> need-resched
 #                                    | / _---=> hardirq/softirq
 #                                    || / _--=> preempt-depth
 #                                    ||| /
 #  REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #     |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   0)    <idle>-0    |  dNs. |               |  /*      0:120:R   + [000]   339:100:R kworker/0:1H */
        3 us |   0)    <idle>-0    |  dNs. |   0.000 us    |            (null)();
       67 us |   0)    <idle>-0    |  dNs. |   0.721 us    |  ttwu_stat();
       69 us |   0)    <idle>-0    |  dNs. |   0.607 us    |  _raw_spin_unlock_irqrestore();
       71 us |   0)    <idle>-0    |  .Ns. |   0.598 us    |  _raw_spin_lock_irq();
       72 us |   0)    <idle>-0    |  .Ns. |   0.584 us    |  _raw_spin_lock_irq();
       73 us |   0)    <idle>-0    |  dNs. | + 11.118 us   |  __next_timer_interrupt();
       75 us |   0)    <idle>-0    |  dNs. |               |  call_timer_fn() {
       76 us |   0)    <idle>-0    |  dNs. |               |    delayed_work_timer_fn() {
       76 us |   0)    <idle>-0    |  dNs. |               |      __queue_work() {
       ...

Link: http://lkml.kernel.org/r/20190101154614.8887-4-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:19 -05:00
Changbin Du
97f0a3bcdf tracing: Show more info for funcgraph wakeup tracers
Add these info fields to funcgraph wakeup tracers:
  o Show CPU info since the waker could be on a different CPU.
  o Show function duration and overhead.
  o Show IRQ markers.

Link: http://lkml.kernel.org/r/20190101154614.8887-3-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:19 -05:00
Steven Rostedt (VMware)
6c6dbce196 tracing: Add comment to predicate_parse() about "&&" or "||"
As the predicat_parse() code is rather complex, commenting subtleties is
important. The switch case statement should be commented to describe that it
is only looking for two '&' or '|' together, which is why the fall through
to an error is after the check.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:19 -05:00
Mathieu Malaterre
9399ca21d2 tracing: Annotate implicit fall through in predicate_parse()
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).

This commit remove the following warning:

  kernel/trace/trace_events_filter.c:494:8: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203039.16535-2-malat@debian.org

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:18 -05:00
Mathieu Malaterre
91457c018f tracing: Annotate implicit fall through in parse_probe_arg()
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).

This commit remove the following warning:

  kernel/trace/trace_probe.c:302:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203039.16535-1-malat@debian.org

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:18 -05:00
Changbin Du
9acd8de69d function_graph: Support displaying relative timestamp
When function_graph is used for latency tracers, relative timestamp
is more straightforward than absolute timestamp as function trace
does. This change adds relative timestamp support to function_graph
and applies to latency tracers (wakeup and irqsoff).

Instead of:

 # tracer: irqsoff
 #
 # irqsoff latency trace v1.1.5 on 5.0.0-rc1-test
 # --------------------------------------------------------------------
 # latency: 521 us, #1125/1125, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
 #    -----------------
 #    | task: swapper/2-0 (uid:0 nice:0 policy:0 rt_prio:0)
 #    -----------------
 #  => started at: __schedule
 #  => ended at:   _raw_spin_unlock_irq
 #
 #
 #                                       _-----=> irqs-off
 #                                      / _----=> need-resched
 #                                     | / _---=> hardirq/softirq
 #                                     || / _--=> preempt-depth
 #                                     ||| /
 #     TIME        CPU  TASK/PID       ||||  DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||   |   |                     |   |   |   |
   124.974306 |   2)  systemd-693   |  d..1  0.000 us    |  __schedule();
   124.974307 |   2)  systemd-693   |  d..1              |    rcu_note_context_switch() {
   124.974308 |   2)  systemd-693   |  d..1  0.487 us    |      rcu_preempt_deferred_qs();
   124.974309 |   2)  systemd-693   |  d..1  0.451 us    |      rcu_qs();
   124.974310 |   2)  systemd-693   |  d..1  2.301 us    |    }
[..]
   124.974826 |   2)    <idle>-0    |  d..2              |  finish_task_switch() {
   124.974826 |   2)    <idle>-0    |  d..2              |    _raw_spin_unlock_irq() {
   124.974827 |   2)    <idle>-0    |  d..2  0.000 us    |  _raw_spin_unlock_irq();
   124.974828 |   2)    <idle>-0    |  d..2  0.000 us    |  tracer_hardirqs_on();
   <idle>-0       2d..2  552us : <stack trace>
  => __schedule
  => schedule_idle
  => do_idle
  => cpu_startup_entry
  => start_secondary
  => secondary_startup_64

Show:

 # tracer: irqsoff
 #
 # irqsoff latency trace v1.1.5 on 5.0.0-rc1-test+
 # --------------------------------------------------------------------
 # latency: 511 us, #1053/1053, CPU#7 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
 #    -----------------
 #    | task: swapper/7-0 (uid:0 nice:0 policy:0 rt_prio:0)
 #    -----------------
 #  => started at: __schedule
 #  => ended at:   _raw_spin_unlock_irq
 #
 #
 #                                       _-----=> irqs-off
 #                                      / _----=> need-resched
 #                                     | / _---=> hardirq/softirq
 #                                     || / _--=> preempt-depth
 #                                     ||| /
 #   REL TIME      CPU  TASK/PID       ||||  DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||   |   |                     |   |   |   |
         0 us |   7)   sshd-1704    |  d..1  0.000 us    |  __schedule();
         1 us |   7)   sshd-1704    |  d..1              |    rcu_note_context_switch() {
         1 us |   7)   sshd-1704    |  d..1  0.611 us    |      rcu_preempt_deferred_qs();
         2 us |   7)   sshd-1704    |  d..1  0.484 us    |      rcu_qs();
         3 us |   7)   sshd-1704    |  d..1  2.599 us    |    }
[..]
       509 us |   7)    <idle>-0    |  d..2              |  finish_task_switch() {
       510 us |   7)    <idle>-0    |  d..2              |    _raw_spin_unlock_irq() {
       510 us |   7)    <idle>-0    |  d..2  0.000 us    |  _raw_spin_unlock_irq();
       512 us |   7)    <idle>-0    |  d..2  0.000 us    |  tracer_hardirqs_on();
   <idle>-0       7d..2  543us : <stack trace>
  => __schedule
  => schedule_idle
  => do_idle
  => cpu_startup_entry
  => start_secondary
  => secondary_startup_64

Link: http://lkml.kernel.org/r/20190101154614.8887-2-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-06 11:56:18 -05:00
Linus Torvalds
cc6810e36b Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug fixes from Thomas Gleixner:
 "Two fixes for the cpu hotplug machinery:

   - Replace the overly clever 'SMT disabled by BIOS' detection logic as
     it breaks KVM scenarios and prevents speculation control updates
     when the Hyperthreads are brought online late after boot.

   - Remove a redundant invocation of the speculation control update
     function"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
  x86/speculation: Remove redundant arch_smt_update() invocation
2019-02-03 09:02:03 -08:00
Linus Torvalds
58f6d4287a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A pile of perf updates:

   - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
     write handler

   - Cure a perf script crash which caused by an unitinialized data
     structure

   - Highlight the hottest instruction in perf top and not a random one

   - Cure yet another clang issue when building perf python

   - Handle topology entries with no CPU correctly in the tools

   - Handle perf data which contains both tracepoints and performance
     counter entries correctly.

   - Add a missing NULL pointer check in perf ordered_events_free()"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf script: Fix crash when processing recorded stat data
  perf top: Fix wrong hottest instruction highlighted
  perf tools: Handle TOPOLOGY headers with no CPU
  perf python: Remove -fstack-clash-protection when building with some clang versions
  perf core: Fix perf_proc_update_handler() bug
  perf script: Fix crash with printing mixed trace point and other events
  perf ordered_events: Fix crash in ordered_events__free
2019-02-03 08:59:51 -08:00
Johannes Weiner
1b69ac6b40 psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Andrei Vagin
8fb335e078 kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.

zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list.  In this case, we will have one unkillable process with one or
more dead children.

Thanks to Oleg for the advice to release tasks in find_child_reaper().

Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd2322c ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Josh Poimboeuf
b284909aba cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
With the following commit:

  73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")

... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled.  However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.

The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case.  So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.

Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:

  1) /sys/devices/system/cpu/smt/control showed "on" instead of
     "notsupported"; and

  2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.

I'd propose that we instead consider #1 above to not actually be a
problem.  Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later.  So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).

The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.

So fix it by:

  a) reverting the original "fix" and its followup fix:

     73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
     bc2d8d262c ("cpu/hotplug: Fix SMT supported evaluation")

     and

  b) changing vmx_vm_init() to query the actual current SMT state --
     instead of the sysfs control value -- to determine whether the L1TF
     warning is needed.  This also requires the 'sched_smt_present'
     variable to exported, instead of 'cpu_smt_control'.

Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
2019-01-30 19:27:00 +01:00
Zhenzhong Duan
34d66caf25 x86/speculation: Remove redundant arch_smt_update() invocation
With commit a74cfffb03 ("x86/speculation: Rework SMT state change"),
arch_smt_update() is invoked from each individual CPU hotplug function.

Therefore the extra arch_smt_update() call in the sysfs SMT control is
redundant.

Fixes: a74cfffb03 ("x86/speculation: Rework SMT state change")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <konrad.wilk@oracle.com>
Cc: <dwmw@amazon.co.uk>
Cc: <bp@suse.de>
Cc: <srinivas.eeda@oracle.com>
Cc: <peterz@infradead.org>
Cc: <hpa@zytor.com>
Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default
2019-01-29 22:20:24 +01:00
Linus Torvalds
f907bb4c32 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Glexiner:
 "A single regression fix to address the unintended breakage of posix
  cpu timers.

  This is caused by a new sanity check in the common code, which fails
  for posix cpu timers under certain conditions because the posix cpu
  timer code never updates the variable which is checked"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Unbreak timer rearming
2019-01-27 11:55:06 -08:00
Linus Torvalds
9881051828 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A small series of fixes which all address possible missed wakeups:

   - Document and fix the wakeup ordering of wake_q

   - Add the missing barrier in rcuwait_wake_up(), which was documented
     in the comment but missing in the code

   - Fix the possible missed wakeups in the rwsem and futex code"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Fix (possible) missed wakeup
  futex: Fix (possible) missed wakeup
  sched/wake_q: Fix wakeup ordering for wake_q
  sched/wake_q: Document wake_q_add()
  sched/wait: Fix rcuwait_wake_up() ordering
2019-01-27 11:52:50 -08:00
Linus Torvalds
0d484375d7 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A small set of fixes for the interrupt subsystem:

   - Fix a double increment in the irq descriptor allocator which
     resulted in a sanity check only being done for every second
     affinity mask

   - Add a missing device tree translation in the stm32-exti driver.
     Without that the interrupt association is completely wrong.

   - Initialize the mutex in the GIC-V3 MBI driver

   - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
     multi MSI allocations work correctly

   - Ensure that the initial affinity of a interrupt is not empty at
     startup time.

   - Drop bogus include in the madera irq chip driver

   - Fix KernelDoc regression"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
  genirq/irqdesc: Fix double increment in alloc_descs()
  genirq: Fix the kerneldoc comment for struct irq_affinity_desc
  irqchip/madera: Drop GPIO includes
  irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
  irqchip/stm32-exti: Add domain translate function
  genirq: Make sure the initial affinity is not empty
2019-01-27 11:25:38 -08:00
Ingo Molnar
d3c8c0af75 Merge tag 'perf-urgent-for-mingo-5.0-20190121' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

Kernel:

  Stephane Eranian:

  - Fix perf_proc_update_handler() bug.

perf script:

  Andi Kleen:

  - Fix crash with printing mixed trace point and other events.

  Tony Jones:

  - Fix crash when processing recorded stat data.

perf top:

  He Kuang:

  - Fix wrong hottest instruction highlighted.

perf python:

  Arnaldo Carvalho de Melo:

  - Remove -fstack-clash-protection when building with some clang versions.

perf ordered_events:

  Jiri Olsa:

  - Fix out of buffers crash in ordered_events__free().

perf cpu_map:

  Stephane Eranian:

  - Handle TOPOLOGY headers with no CPU.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-22 11:08:47 +01:00
Xie Yongji
e158488be2 locking/rwsem: Fix (possible) missed wakeup
Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:

  e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")

relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.

[ peterz: Added changelog. ]

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:39 +01:00
Peter Zijlstra
b061c38bef futex: Fix (possible) missed wakeup
We must not rely on wake_q_add() to delay the wakeup; in particular
commit:

  1d0dcb3ad9 ("futex: Implement lockless wakeups")

moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1d0dcb3ad9 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:38 +01:00
Peter Zijlstra
4c4e373156 sched/wake_q: Fix wakeup ordering for wake_q
Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.

Andrea Parri provided:

  C wake_up_q-wake_q_add

  {
	int next = 0;
	int y = 0;
  }

  P0(int *next, int *y)
  {
	int r0;

	/* in wake_up_q() */

	WRITE_ONCE(*next, 1);   /* node->next = NULL */
	smp_mb();               /* implied by wake_up_process() */
	r0 = READ_ONCE(*y);
  }

  P1(int *next, int *y)
  {
	int r1;

	/* in wake_q_add() */

	WRITE_ONCE(*y, 1);      /* wake_cond = true */
	smp_mb__before_atomic();
	r1 = cmpxchg_relaxed(next, 1, 2);
  }

  exists (0:r0=0 /\ 1:r1=0)

  This "exists" clause cannot be satisfied according to the LKMM:

  Test wake_up_q-wake_q_add Allowed
  States 3
  0:r0=0; 1:r1=1;
  0:r0=1; 1:r1=0;
  0:r0=1; 1:r1=1;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (0:r0=0 /\ 1:r1=0)
  Observation wake_up_q-wake_q_add Never 0 3

Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:37 +01:00
Peter Zijlstra
e6018c0f5c sched/wake_q: Document wake_q_add()
The only guarantee provided by wake_q_add() is that a wakeup will
happen after it, it does _NOT_ guarantee the wakeup will be delayed
until the matching wake_up_q().

If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
that can happen at any time after the cmpxchg(). This means we should
not rely on the wakeup happening at wake_q_up(), but should be ready
for wake_q_add() to issue the wakeup.

The delay; if provided (most likely); should only result in more efficient
behaviour.

Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:36 +01:00
Prateek Sood
6dc080eeb2 sched/wait: Fix rcuwait_wake_up() ordering
For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.

This mistake has been observed to cause a deadlock in the following
situation:

    P1					P2

    percpu_up_read()			percpu_down_write()
      rcu_sync_is_idle() // false
					  rcu_sync_enter()
					  ...
      __percpu_up_read()

[S] ,-  __this_cpu_dec(*sem->read_count)
    |   smp_rmb();
[L] |   task = rcu_dereference(w->task) // NULL
    |
    |				    [S]	    w->task = current
    |					    smp_mb();
    |				    [L]	    readers_active_check() // fail
    `-> <store happens here>

Where the smp_rmb() (obviously) fails to constrain the store.

[ peterz: Added changelog. ]

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:36 +01:00