futex_requeue() can acquire the lock on behalf of a waiter
early on or during the requeue loop if it is uncontended or in
the event of a lock steal or owner died. On wakeup, the waiter
(in futex_wait_requeue_pi()) cleans up the pi_state owner using
the lock_ptr to protect against concurrent access to the
pi_state. The pi_state is hung off futex_q's on the requeue
target futex hash bucket so the lock_ptr needs to be updated
accordingly.
The problem manifested by triggering the WARN_ON in
lookup_pi_state() about the pid != pi_state->owner->pid. With
this patch, the pi_state is properly guarded against concurrent
access via the requeue target hb lock.
The astute reviewer may notice that there is a window of time
between when futex_requeue() unlocks the hb locks and when
futex_wait_requeue_pi() will acquire hb2->lock. During this
time the pi_state and uval are not in sync with the underlying
rtmutex owner (but the uval does indicate there are waiters, so
no atomic changes will occur in userspace). However, this is
not a problem. Should a contending thread enter
lookup_pi_state() and acquire hb2->lock before the ownership is
fixed up, it will find the pi_state hung off a waiter's
(possibly the pending owner's) futex_q and block on the
rtmutex. Once futex_wait_requeue_pi() fixes up the owner, it
will also move the pi_state from the old owner's
task->pi_state_list to its own.
v3: Fix plist lock name for application to mainline (rather
than -rt) Compile tested against tip/v2.6.31-rc5.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4A7F4EFF.6090903@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix_cpu_timers_exit_group(): Do not use thread_group_cputimer()
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Fix/complete ftrace event records sampling
perf_counter, ftrace: Fix perf_counter integration
tracing/filters: Always free pred on filter_add_subsystem_pred() failure
tracing/filters: Don't use pred on alloc failure
ring-buffer: Fix memleak in ring_buffer_free()
tracing: Fix recordmcount.pl to handle sections with only weak functions
ring-buffer: Fix advance of reader in rb_buffer_peek()
tracing: do not use functions starting with .L in recordmcount.pl
ring-buffer: do not disable ring buffer on oops_in_progress
ring-buffer: fix check of try_to_discard result
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix typos in documentation
lockdep: Fix file mode of lock_stat
rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()
This patch implements the kernel side support for ftrace event
record sampling.
A new counter sampling attribute is added:
PERF_SAMPLE_TP_RECORD
which requests ftrace events record sampling. In this case
if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
fires, we emit the tracepoint binary record to the
perfcounter event buffer, as a sample.
Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
record:
perf record -f -F 1 -a -e workqueue:workqueue_execution
perf report -D
0x21e18 [0x48]: event: 9
.
. ... raw event: size 72 bytes
. 0000: 09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff ......H........
. 0010: 0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00 ........!......
. 0020: 2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e +...........eve
. 0030: 74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00 ts/1...........
. 0040: e0 b1 31 81 ff ff ff ff .......
.
0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33
The raw ftrace binary record starts at offset 0020.
Translation:
struct trace_entry {
type = 0x2b = 43;
flags = 1;
preempt_count = 2;
pid = 0xa = 10;
tgid = 0xa = 10;
}
thread_comm = "events/1"
thread_pid = 0xa = 10;
func = 0xffffffff8131b1e0 = flush_to_ldisc()
What will come next?
- Userspace support ('perf trace'), 'flight data recorder' mode
for perf trace, etc.
- The unconditional copy from the profiling callback brings
some costs however if someone wants no such sampling to
occur, and needs to be fixed in the future. For that we need
to have an instant access to the perf counter attribute.
This is a matter of a flag to add in the struct ftrace_event.
- Take care of the events recursivity! Don't ever try to record
a lock event for example, it seems some locking is used in
the profiling fast path and lead to a tracing recursivity.
That will be fixed using raw spinlock or recursivity
protection.
- [...]
- Profit! :-)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds possible second part to the assign argument of TP_EVENT().
TP_perf_assign(
__perf_count(foo);
__perf_addr(bar);
)
Which, when specified make the swcounter increment with @foo instead
of the usual 1, and report @bar for PERF_SAMPLE_ADDR (data address
associated with the event) when this triggers a counter overflow.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: Merge up to almost-rc6 to pick up latest perfcounters
(on which we'll queue up a dependent fix)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the process exits we don't have to run new cputimer nor
use running one (as it not accounts when tsk->exit_state != 0)
to get process CPU times. As there is only one thread we can
just use CPU times fields from task and signal structs.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If filter_add_subsystem_pred() fails due to ENOSPC or ENOMEM,
the pred doesn't get freed, while as a side effect it does for
other errors. Make it so the caller always frees the pred for
any error.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1249746593.6453.32.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dan Carpenter sent me a fix to prevent pred from being used if
it couldn't be allocated. I noticed the same problem also
existed for the create_pred() case and added a fix for that.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1249746549.6453.29.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Fix double list iteration in per task precise stats
perf: Auto-detect libelf
perf symbol: Fix symbol parsing in certain cases: use the build-id as a symlink
perf_counter/powerpc: Check oprofile_cpu_type for NULL before using it
ftrace: Fix perf-tracepoint OOPS
perf report: Add missing command line options to man page
perf: Auto-detect libbfd
perf report: Make --sort comm,dso,symbol the default
While looking at Jens Rosenboom bug report
(http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
a dying "ps" program, we found following problem.
clone() syscall has special support for TID of created threads. This
support includes two features.
One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
TID value.
One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
thread dies.
The integer location is a user provided pointer, provided at clone()
time.
kernel keeps this pointer value into current->clear_child_tid.
At execve() time, we should make sure kernel doesnt keep this user
provided pointer, as full user memory is replaced by a new one.
As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
memory in forked processes.
Following sequence could happen:
1) bash (or any program) starts a new process, by a fork() call that
glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
...) syscall
2) When new process starts, its current->clear_child_tid is set to a
location that has a meaning only in bash (or initial program) context
(&THREAD_SELF->tid)
3) This new process does the execve() syscall to start a new program.
current->clear_child_tid is left unchanged (a non NULL value)
4) If this new program creates some threads, and initial thread exits,
kernel will attempt to clear the integer pointed by
current->clear_child_tid from mm_release() :
if (tsk->clear_child_tid
&& !(tsk->flags & PF_SIGNALED)
&& atomic_read(&mm->mm_users) > 1) {
u32 __user * tidptr = tsk->clear_child_tid;
tsk->clear_child_tid = NULL;
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
<< here >> put_user(0, tidptr);
sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
}
5) OR : if new program is not multi-threaded, but spied by /proc/pid
users (ps command for example), mm_users > 1, and the exiting program
could corrupt 4 bytes in a persistent memory area (shm or memory mapped
file)
If current->clear_child_tid points to a writeable portion of memory of the
new program, kernel happily and silently corrupts 4 bytes of memory, with
unexpected effects.
Fix is straightforward and should not break any sane program.
Reported-by: Jens Rosenboom <jens@mcbone.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sonny Rao <sonnyrao@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brice Goglin reported this crash with per task precise stats:
> I finally managed to test the threaded perfcounter statistics (thanks a
> lot for implementing it). I am running 2.6.31-rc5 (with the AMD
> magny-cours patches but I don't think they matter here). I am trying to
> measure local/remote memory accesses per thread during the well-known
> stream benchmark. It's compiled with OpenMP using 16 threads on a
> quad-socket quad-core barcelona machine.
>
> Command line is:
> /mnt/scratch/bgoglin/cpunode/linux-2.6.31/tools/perf/perf record -f -s
> -e r1000001e0 -e r1000002e0 -e r1000004e0 -e r1000008e0 ./stream
>
> It seems to work fine with a single -e <counter> on the command line
> while it crashes when there are at least 2 of them.
> It seems to work fine without -s as well.
A silly copy-paste resulted in a messed up iteration which would
cause the OOPS.
Reported-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Brice Goglin <Brice.Goglin@inria.fr>
LKML-Reference: <1249574786.32113.550.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When calling rb_buffer_peek() from ring_buffer_consume() and a
padding event is returned, the function rb_advance_reader() is
called twice. This may lead to missing samples or under high
workloads to the warning below. This patch fixes this. If a padding
event is returned by rb_buffer_peek() it will be consumed by the
calling function now.
Also, I simplified some code in ring_buffer_consume().
------------[ cut here ]------------
WARNING: at /dev/shm/.source/linux/kernel/trace/ring_buffer.c:2289 rb_advance_reader+0x2e/0xc5()
Hardware name: Anaheim
Modules linked in:
Pid: 29, comm: events/2 Tainted: G W 2.6.31-rc3-oprofile-x86_64-standard-00059-g5050dc2 #1
Call Trace:
[<ffffffff8106776f>] ? rb_advance_reader+0x2e/0xc5
[<ffffffff81039ffe>] warn_slowpath_common+0x77/0x8f
[<ffffffff8103a025>] warn_slowpath_null+0xf/0x11
[<ffffffff8106776f>] rb_advance_reader+0x2e/0xc5
[<ffffffff81068bda>] ring_buffer_consume+0xa0/0xd2
[<ffffffff81326933>] op_cpu_buffer_read_entry+0x21/0x9e
[<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
[<ffffffff8132749b>] sync_buffer+0xa5/0x401
[<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
[<ffffffff81326c1b>] ? wq_sync_buffer+0x0/0x78
[<ffffffff81326c76>] wq_sync_buffer+0x5b/0x78
[<ffffffff8104aa30>] worker_thread+0x113/0x1ac
[<ffffffff8104dd95>] ? autoremove_wake_function+0x0/0x38
[<ffffffff8104a91d>] ? worker_thread+0x0/0x1ac
[<ffffffff8104dc9a>] kthread+0x88/0x92
[<ffffffff8100bdba>] child_rip+0xa/0x20
[<ffffffff8104dc12>] ? kthread+0x0/0x92
[<ffffffff8100bdb0>] ? child_rip+0x0/0x20
---[ end trace f561c0a58fcc89bd ]---
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Not all tracepoints are created equal, in specific the ftrace
tracepoints are created with TRACE_EVENT_FORMAT() which does
not generate the needed bits to tie them into perf counters.
For those events, don't create the 'id' file and fail
->profile_enable when their ID is specified through other
means.
Reported-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1249497664.5890.4.camel@laptop>
[ v2: fix build error in the !CONFIG_EVENT_PROFILE case ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the event of a lock steal or owner died,
rt_mutex_start_proxy_lock() will give the rt_mutex to the
waiting task, but it fails to release the wait_lock. This leads
to subsequent deadlocks when other tasks try to acquire the
rt_mutex.
I also removed a few extra blank lines that really spaced this
routine out. I must have been high on the \n when I wrote this
originally...
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4A79D7F1.4000405@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The commit:
commit e0fdace10e
Author: David Miller <davem@davemloft.net>
Date: Fri Aug 1 01:11:22 2008 -0700
debug_locks: set oops_in_progress if we will log messages.
Otherwise lock debugging messages on runqueue locks can deadlock the
system due to the wakeups performed by printk().
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Will permanently set oops_in_progress on any lockdep failure.
When this triggers it will cause any read from the ring buffer to
permanently disable the ring buffer (not to mention no locking of
printk).
This patch removes the check. It keeps the print in NMI which makes
sense. This is probably OK, since the ring buffer should not cause
something to set oops_in_progress anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function ring_buffer_discard_commit inversed the code path
of the result of try_to_discard. It should skip incrementing the
entry counter if try_to_discard succeeded. But instead, it increments
the entry conder if it succeeded to discard, and does not increment
it if it fails.
The result of this bug is that filtering will make the stat counters
incorrect.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Set the CONFIG_PERF_COUNTERS default to y if CONFIG_PROFILING=y
perf: Fix read buffer overflow
perf top: Add mwait_idle_with_hints to skip_symbols[]
perf tools: Fix faulty check
perf report: Update for the new FORK/EXIT events
perf_counter: Full task tracing
perf_counter: Collapse inherit on read()
tracing, perf_counter: Add help text to CONFIG_EVENT_PROFILE
perf_counter tools: Fix link errors with older toolchains
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix race in cpupri introduced by cpumask_var changes
sched: Fix latencytop and sleep profiling vs group scheduling
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix-timers: Fix oops in clock_nanosleep() with CLOCK_MONOTONIC_RAW
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix missing function_graph events when we splice_read from trace_pipe
tracing: Fix invalid function_graph entry
trace: stop tracer in oops_enter()
ftrace: Only update $offset when we update $ref_func
ftrace: Fix the conditional that updates $ref_func
tracing: only truncate ftrace files when O_TRUNC is set
tracing: show proper address for trace-printk format