The hardware latency tracer has been in the PREEMPT_RT patch for some time.
It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.
The logic is pretty simple. It simply creates a thread that spins on a
single CPU for a specified amount of time (width) within a periodic window
(window). These numbers may be adjusted by their cooresponding names in
/sys/kernel/tracing/hwlat_detector/
The defaults are window = 1000000 us (1 second)
width = 500000 us (1/2 second)
The loop consists of:
t1 = trace_clock_local();
t2 = trace_clock_local();
Where trace_clock_local() is a variant of sched_clock().
The difference of t2 - t1 is recorded as the "inner" timestamp and also the
timestamp t1 - prev_t2 is recorded as the "outer" timestamp. If either of
these differences are greater than the time denoted in
/sys/kernel/tracing/tracing_thresh then it records the event.
When this tracer is started, and tracing_thresh is zero, it changes to the
default threshold of 10 us.
The hwlat tracer in the PREEMPT_RT patch was originally written by
Jon Masters. I have modified it quite a bit and turned it into a
tracer.
Based-on-code-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap like events
do") placed ftrace_init_tracefs into the instance creation, and encapsulated
the top level updating with an if conditional, as the top level only gets
updated at boot up. Unfortunately, this triggers section mismatch errors as
the init functions are called from a function that can be called later, and
the section mismatch logic is unaware of the if conditional that would
prevent it from happening at run time.
To make everyone happy, create a separate ftrace_init_tracefs_toplevel()
routine that only gets called by init functions, and this will be what calls
other init functions for the toplevel directory.
Link: http://lkml.kernel.org/r/20160704102139.19cbc0d9@gandalf.local.home
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap like events do")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
# echo 1 > options/stacktrace
# echo 1 > events/sched/sched_switch/enable
# cat trace
<idle>-0 [002] d..2 1982.525169: <stack trace>
=> save_stack_trace
=> __ftrace_trace_stack
=> trace_buffer_unlock_commit_regs
=> event_trigger_unlock_commit
=> trace_event_buffer_commit
=> trace_event_raw_event_sched_switch
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
The above shows that we are seeing 6 functions before ever making it to the
caller of the sched_switch event.
# echo stacktrace > events/sched/sched_switch/trigger
# cat trace
<idle>-0 [002] d..3 2146.335208: <stack trace>
=> trace_event_buffer_commit
=> trace_event_raw_event_sched_switch
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
The stacktrace trigger isn't as bad, because it adds its own skip to the
stacktracing, but still has two events extra.
One issue is that if the stacktrace passes its own "regs" then there should
be no addition to the skip, as the regs will not include the functions being
called. This was an issue that was fixed by commit 7717c6be69 ("tracing:
Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()" as adding
the skip number for kprobes made the probes not have any stack at all.
But since this is only an issue when regs is being used, a skip should be
added if regs is NULL. Now we have:
# echo 1 > options/stacktrace
# echo 1 > events/sched/sched_switch/enable
# cat trace
<idle>-0 [000] d..2 1297.676333: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> rest_init
=> start_kernel
=> x86_64_start_reservations
=> x86_64_start_kernel
# echo stacktrace > events/sched/sched_switch/trigger
# cat trace
<idle>-0 [002] d..3 1370.759745: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
And kprobes are not touched.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the trace_printk code chooses which static buffer to use based
on what type of atomic context (NMI, IRQ, etc) it's in. Simplify the
code and make it more robust: simply count the nesting depth and choose
a buffer based on the current nesting depth.
The new code will only drop an event if we nest more than 4 deep,
and the old code was guaranteed to malfunction if that happened.
Link: http://lkml.kernel.org/r/07ab03aecfba25fcce8f9a211b14c9c5e2865c58.1464289095.git.luto@kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Convert set_ftrace_pid to use the bitmap like set_event_pid does. This
allows for instances to use the pid filtering as well, and will allow for
function-fork option to set if the children of a traced function should be
traced or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The addition of PIDs into a pid_list via the write operation of
set_event_pid is a bit complex. The same operation will be needed for
function tracing pids. Move the code into its own generic function in
trace.c, so that we can avoid duplication of this code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To allow other aspects of ftrace to use the pid_list logic, we need to reuse
the seq_file functions. Making the generic part into functions that can be
called by other files will help in this regard.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the filtered_pid functions are going to be used by function tracer as
well as trace_events, move the code into the generic trace.c file.
The functions moved are:
trace_find_filtered_pid()
trace_ignore_this_task()
trace_filter_add_remove_task()
Kernel Doc text was also added.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Filtering of events requires the data to be written to the ring buffer
before it can be decided to filter or not. This is because the parameters of
the filter are based on the result that is written to the ring buffer and
not on the parameters that are passed into the trace functions.
The ftrace ring buffer is optimized for writing into the ring buffer and
committing. The discard procedure used when filtering decides the event
should be discarded is much more heavy weight. Thus, using a temporary
filter when filtering events can speed things up drastically.
Without a temp buffer we have:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.790706626 seconds time elapsed ( +- 0.71% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.566904059 seconds time elapsed ( +- 0.27% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.690598511 seconds time elapsed ( +- 0.19% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.707486364 seconds time elapsed ( +- 0.30% )
The first run above is without any tracing, just to get a based figure.
hackbench takes ~0.79 seconds to run on the system.
The second run enables tracing all events where nothing is filtered. This
increases the time by 100% and hackbench takes 1.57 seconds to run.
The third run filters all events where the preempt count will equal "20"
(this should never happen) thus all events are discarded. This takes 1.69
seconds to run. This is 10% slower than just committing the events!
The last run enables all events and filters where the filter will commit all
events, and this takes 1.70 seconds to run. The filtering overhead is
approximately 10%. Thus, the discard and commit of an event from the ring
buffer may be about the same time.
With this patch, the numbers change:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.778233033 seconds time elapsed ( +- 0.38% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.582102692 seconds time elapsed ( +- 0.28% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.309230710 seconds time elapsed ( +- 0.22% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.786001924 seconds time elapsed ( +- 0.20% )
The first run is again the base with no tracing.
The second run is all tracing with no filtering. It is a little slower, but
that may be well within the noise.
The third run shows that discarding all events only took 1.3 seconds. This
is a speed up of 23%! The discard is much faster than even the commit.
The one downside is shown in the last run. Events that are not discarded by
the filter will take longer to add, this is due to the extra copy of the
event.
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no real difference between trace_buffer_unlock_commit() and
trace_buffer_unlock_commit_regs() except that the former passes NULL to
ftrace_stack_trace() instead of regs. Have the former be a static inline of
the latter which passes NULL for regs.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_buffer_unlock_commit() and the _regs() version are only
used within the kernel/trace directory. Move them to the local header and
remove the export as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function filter_check_discard() is small and only called by one user,
its code can be folded into that one caller and make the code a bit less
comlplex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Nothing outside of the tracing directory calls filter_check_discard() or
check_filter_check_discard(). They should not be called by modules. Move
their prototypes into the local tracing header and remove their
EXPORT_SYMBOL() macros.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the following code snippet:
...
char buf[64];
...
if (copy_from_user(&buf, ubuf, cnt))
...
Even though the value of "&buf" equals "buf", but there is no need
to get the address of the "buf" again. Use "buf" instead of "&buf".
Link: http://lkml.kernel.org/r/20160418152329.18b72bea@debian
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the event-fork option requires doing work when enabled and disabled, it
can not be passed down to created instances. The instance must clear this
flag when it is created, and must clear it when its removed.
As more options may be created with this need, a macro ZEROED_TRACE_FLAGS is
created that holds the flags that must not be inherited by the top level
instance, and must be cleared on removal of instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to define 'named' hist triggers. All triggers created
with the same 'name=xxx' option will update the same shared histogram
data.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx ... [ if filter] > event/trigger
to this:
# echo hist:name=xxx:keys=xxx ... [ if filter] > event/trigger
Named histograms must use a 'compatible' set of keys and values, which
means each event added to a set of named triggers must have the same
names and types.
Reading the 'hist' file of any of the participating events will
produce the same output as any other participating event, which is to
be expected since they share the same data.
Link: http://lkml.kernel.org/r/1dbc84ee3322a75daaf5b3ef1d0cc0a2fb682fc7.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Similar to enable_event/disable_event triggers, these triggers enable
and disable the aggregation of events into maps rather than enabling
and disabling their writing into the trace buffer.
They can be used to automatically start and stop hist triggers based
on a matching filter condition.
If there's a paused hist trigger on system:event, the following would
start it when the filter condition was hit:
# echo enable_hist:system:event [ if filter] > event/trigger
And the following would disable a running system:event hist trigger:
# echo disable_hist:system:event [ if filter] > event/trigger
See Documentation/trace/events.txt for real examples.
Link: http://lkml.kernel.org/r/f812f086e52c8b7c8ad5443487375e03c96a601f.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>