A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Like do_wp_page(), __replace_page() should do munlock_vma_page()
for the case when the old page still has other !VM_LOCKED
mappings. Unfortunately this needs mm/internal.h.
Also, move put_page() outside of ptl lock. This doesn't really
matter but looks a bit better.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182249.GA20372@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. vma_address() returns loff_t, this looks confusing and this
is unnecessary after the previous change. Make it return "ulong",
all callers truncate the result anyway.
2. Its name conflicts with mm/rmap.c:vma_address(), rename it to
offset_to_vaddr(), this matches vaddr_to_offset().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182247.GA20365@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. register_for_each_vma() checks that vma_address() == vaddr,
but this is not enough. We should also ensure that
vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only.
2. After the prevous changes, register_for_each_vma() is the
only reason why vma_address() has to return loff_t, all other
users know that we have the valid mapping at this offset and
thus the overflow is not possible.
Change the code to use vaddr_to_offset() instead, imho this looks
more clean/understandable and now we can change vma_address().
3. While at it, remove the unnecessary type-cast.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182244.GA20362@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently build_probe_list() builds the list of all uprobes
attached to the given inode, and the caller should filter out
those who don't fall into the [start,end) range, this is
sub-optimal.
This patch turns find_least_offset_node() into
find_node_in_range() which returns the first node inside the
[min,max] range, and changes build_probe_list() to use this node
as a starting point for rb_prev() and rb_next() to find all
other nodes the caller needs. The resulting list is no longer
sorted but we do not care.
This can speed up both build_probe_list() and the callers, but
there is another reason to introduce find_node_in_range(). It
can be used to figure out whether the given vma has uprobes or
not, this will be needed soon.
While at it, shift INIT_LIST_HEAD(tmp_list) into
build_probe_list().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182240.GA20352@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
uprobe_munmap() does get_user_pages() and it is also called from
the final mmput()->exit_mmap() path. This slows down
exit/mmput() for no reason, and I think it is simply
dangerous/wrong to try to fault-in a page into the dying mm. If
nothing else, this happens after the last sync_mm_rss(), afaics
handle_mm_fault() can change the task->rss_stat and make the
subsequent check_mm() unhappy.
Change uprobe_munmap() to check mm->mm_users != 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182231.GA20336@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the
usage of uprobe->pending_list").
Yes, we do not care about uprobe->pending_list after return and
nobody can remove the current list entry, but put_uprobe(uprobe)
can actually free it and thus we need list_for_each_safe().
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120729182229.GA20329@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment above write_opcode()->lock_page(old_page) tells
about the race with do_wp_page(). I don't really understand
which exactly race it means, but afaics this lock_page() was not
enough to close all races with do_wp_page().
Anyway, since:
77fc4af1b5 uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
this code is always called with ->mmap_sem held for writing,
so we can forget about do_wp_page().
However, we can't simply remove this lock_page(), and the only
(afaics) reason is __replace_page()->try_to_free_swap().
Nothing in write_opcode() needs it, move it into
__replace_page() and fix the comment.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182220.GA20322@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() does lock_page(new_page) for no reason. Nobody
can see this page until __replace_page() exposes it under ptl
lock, and we do nothing with this page after pte_unmap_unlock().
If nothing else, the similar code in do_wp_page() doesn't lock
the new page for page_add_new_anon_rmap/set_pte_at_notify.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182218.GA20315@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
page_address_in_vma(old_page) in __replace_page() is ugly and
wrong. The caller already knows the correct virtual address,
this page was found by get_user_pages(vaddr).
However, page_address_in_vma() can actually fail if
page->mapping was cleared by __delete_from_page_cache() after
get_user_pages() returns. But this means the race with page
reclaim, write_opcode() should not fail, it should retry and
read this page again. Probably the race with remove_mapping() is
not possible due to page_freeze_refs() logic, but afaics at
least shmem_writepage()->shmem_delete_from_page_cache() can
clear ->mapping.
We could change __replace_page() to return -EAGAIN in this case,
but it would be better to simply use the caller's vaddr and rely
on page_check_address().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182216.GA20311@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
write_opcode() rechecks valid_vma() and ->f_mapping, this is
pointless. The caller, register_for_each_vma() or uprobe_mmap(),
has already done these checks under mmap_sem.
To clarify, uprobe_mmap() checks valid_vma() only, but we can
rely on build_probe_list(vm_file->f_mapping->host).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182212.GA20304@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An rmdir pushes css's ref count to zero. However, if the associated
directory is open at the time, the dentry ref count is non-zero. If
the fd for this directory is then passed into perf_event_open, it
does a css_get(). This bounces the ref count back up from zero. This
is a problem by itself. But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.
css_tryget() does not fall into that trap. So, we use that instead.
Reproduction test-case for the bug:
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/unistd.h>
#include <linux/perf_event.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#define PERF_FLAG_PID_CGROUP (1U << 2)
int perf_event_open(struct perf_event_attr *hw_event_uptr,
pid_t pid, int cpu, int group_fd, unsigned long flags) {
return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
group_fd, flags);
}
/*
* Directly poke at the perf_event bug, since it's proving hard to repro
* depending on where in the kernel tree. what moved?
*/
int main(int argc, char **argv)
{
int fd;
struct perf_event_attr attr;
memset(&attr, 0, sizeof(attr));
attr.exclude_kernel = 1;
attr.size = sizeof(attr);
mkdir("/dev/cgroup/perf_event/blah", 0777);
fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
perror("open");
rmdir("/dev/cgroup/perf_event/blah");
sleep(2);
perf_event_open(&attr, fd, 0, -1, PERF_FLAG_PID_CGROUP);
perror("perf_event_open");
close(fd);
return 0;
}
Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>