Commit Graph

553 Commits

Author SHA1 Message Date
Linus Torvalds a7fc726bb2 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A set of perf related fixes:

   - fix a CR4.PCE propagation issue caused by usage of mm instead of
     active_mm and therefore propagated the wrong value.

   - perf core fixes, which plug a use-after-free issue and make the
     event inheritance on fork more robust.

   - a tooling fix for symbol handling"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf symbols: Fix symbols__fixup_end heuristic for corner cases
  x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
  x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
  perf/core: Better explain the inherit magic
  perf/core: Simplify perf_event_free_task()
  perf/core: Fix event inheritance on fork()
  perf/core: Fix use-after-free in perf_release()
2017-03-17 13:59:52 -07:00
Peter Zijlstra d8a8cfc769 perf/core: Better explain the inherit magic
While going through the event inheritance code Oleg got confused.

Add some comments to better explain the silent dissapearance of
orphaned events.

So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.

Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.

We then avoid copying orphaned events; in order to avoid getting more
of them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:53 +01:00
Peter Zijlstra 15121c789e perf/core: Simplify perf_event_free_task()
We have ctx->event_list that contains all events; no need to
repeatedly iterate the group lists to find them all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:53 +01:00
Peter Zijlstra e7cc4865f0 perf/core: Fix event inheritance on fork()
While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.

Spotted-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: 889ff01506 ("perf/core: Split context's event group list into pinned and non-pinned lists")
Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:52 +01:00
Peter Zijlstra e552a8389a perf/core: Fix use-after-free in perf_release()
Dmitry reported syzcaller tripped a use-after-free in perf_release().

After much puzzlement Oleg spotted the below scenario:

  Task1                           Task2

  fork()
    perf_event_init_task()
    /* ... */
    goto bad_fork_$foo;
    /* ... */
    perf_event_free_task()
      mutex_lock(ctx->lock)
      perf_free_event(B)

                                  perf_event_release_kernel(A)
                                    mutex_lock(A->child_mutex)
                                    list_for_each_entry(child, ...) {
                                      /* child == B */
                                      ctx = B->ctx;
                                      get_ctx(ctx);
                                      mutex_unlock(A->child_mutex);

        mutex_lock(A->child_mutex)
        list_del_init(B->child_list)
        mutex_unlock(A->child_mutex)

        /* ... */

      mutex_unlock(ctx->lock);
      put_ctx() /* >0 */
    free_task();
                                      mutex_lock(ctx->lock);
                                      mutex_lock(A->child_mutex);
                                      /* ... */
                                      mutex_unlock(A->child_mutex);
                                      mutex_unlock(ctx->lock)
                                      put_ctx() /* 0 */
                                        ctx->task && !TOMBSTONE
                                          put_task_struct() /* UAF */

This patch closes the hole by making perf_event_free_task() destroy the
task <-> ctx relation such that perf_event_release_kernel() will no longer
observe the now dead task.

Spotted-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: c6e5b73242 ("perf: Synchronously clean up child events")
Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:52 +01:00
Masahiro Yamada 8a1115ff6b scripts/spelling.txt: add "disble(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  disble||disable
  disbled||disabled

I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched.  The macro is not referenced at all, but this commit is
touching only comment blocks just in case.

Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Ingo Molnar 6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar e601757102 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Linus Torvalds 3f26b0c876 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes on the kernel and tooling side - nothing in particular
  stands out"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf/core: Fix the perf_cpu_time_max_percent check
  perf/core: Fix perf_event_enable_on_exec() timekeeping (again)
  perf/core: Remove confusing comment and move put_ctx()
  perf record: Honor --quiet option properly
  perf annotate: Add -q/--quiet option
  perf diff: Add -q/--quiet option
  perf report: Add -q/--quiet option
  perf utils: Check verbose flag properly
  perf utils: Add perf_quiet_option()
  perf record: Add -a as default target
  perf stat: Add -a as default target
  perf tools: Fail on using multiple bits long terms without value
  perf tools: Move new_term arguments into struct parse_events_term template
  perf build: Add special fixdep cleaning rule
  perf tools: Replace _SC_NPROCESSORS_CONF with max_present_cpu in cpu_topology_map
  perf header: Make build_cpu_topology skip offline/absent CPUs
  perf cpumap: Add cpu__max_present_cpu()
  perf session: Fix DEBUG=1 build with clang
  tools lib traceevent: It's preempt not prempt
  perf python: Filter out -specs=/a/b/c from the python binding cc options
  ...
2017-02-28 11:38:18 -08:00
Linus Torvalds f7878dc3a9 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Several noteworthy changes.

   - Parav's rdma controller is finally merged. It is very straight
     forward and can limit the abosolute numbers of common rdma
     constructs used by different cgroups.

   - kernel/cgroup.c got too chubby and disorganized. Created
     kernel/cgroup/ subdirectory and moved all cgroup related files
     under kernel/ there and reorganized the core code. This hurts for
     backporting patches but was long overdue.

   - cgroup v2 process listing reimplemented so that it no longer
     depends on allocating a buffer large enough to cache the entire
     result to sort and uniq the output. v2 has always mangled the sort
     order to ensure that users don't depend on the sorted output, so
     this shouldn't surprise anybody. This makes the pid listing
     functions use the same iterators that are used internally, which
     have to have the same iterating capabilities anyway.

   - perf cgroup filtering now works automatically on cgroup v2. This
     patch was posted a long time ago but somehow fell through the
     cracks.

   - misc fixes asnd documentation updates"

* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
  kernfs: fix locking around kernfs_ops->release() callback
  cgroup: drop the matching uid requirement on migration for cgroup v2
  cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
  cgroup: misc cleanups
  cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
  cgroup: track migration context in cgroup_mgctx
  cgroup: cosmetic update to cgroup_taskset_add()
  rdmacg: Fixed uninitialized current resource usage
  cgroup: Add missing cgroup-v2 PID controller documentation.
  rdmacg: Added documentation for rdmacg
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added rdma cgroup controller
  cgroup: fix a comment typo
  cgroup: fix RCU related sparse warnings
  cgroup: move namespace code to kernel/cgroup/namespace.c
  cgroup: rename functions for consistency
  cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
  cgroup: separate out cgroup1_kf_syscall_ops
  cgroup: refactor mount path and clearly distinguish v1 and v2 paths
  cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
  ...
2017-02-27 21:41:08 -08:00
Dave Jiang 11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Tan Xiaojun 1572e45a92 perf/core: Fix the perf_cpu_time_max_percent check
Use "proc_dointvec_minmax" instead of "proc_dointvec" to check the input
value from user-space.

If not, we can set a big value and some vars will overflow like
"sysctl_perf_event_sample_rate" which will cause a lot of unexpected
problems.

Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <acme@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1487829879-56237-1-git-send-email-tanxiaojun@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:33 +01:00
Peter Zijlstra 7bbba0eb1a perf/core: Fix perf_event_enable_on_exec() timekeeping (again)
Where commit:

  7fce250915 ("perf: Fix scaling vs.  perf_event_enable_on_exec()")

disabled the ctx-time a-priory, such that all events get enabled and
scheduled at the time point in time, there is one hole in that patch,
when no events do get enabled nothing re-enables the ctx-time.

Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 7fce250915 ("perf: Fix scaling vs.  perf_event_enable_on_exec()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:32 +01:00
Peter Zijlstra 279b5165ff perf/core: Remove confusing comment and move put_ctx()
Since commit:

  321027c1fe ("perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race")

... the code looks like (assuming move_group==1):

  gctx = __perf_event_ctx_lock_double(group_leader, ctx);

  perf_remove_from_context(group_leader, 0);
  list_for_each_entry(sibling, &group_leader->sibling_list, group_entry) {
	perf_remove_from_context(sibling, 0);
	put_ctx(gctx);
  }

  /* ... */

  /* misleading comment about how this is the last reference */
  put_ctx(gctx);

  perf_event_ctx_unlock(group_leader, gctx);

What that 'last' put_ctx() does is drop @group_leader's reference on
gctx after having dropped all its potential sibling references.

But the thing is that __perf_event_ctx_lock_double() returns with a
reference _and_ a held lock, and perf_event_ctx_unlock() unlocks that
lock and drops that reference. Therefore that put_ctx() cannot be the
'last' of anything, nor is there an unbalance in puts.

To reduce confusion, remove the comment and place the put_ctx() next
to the remove_from_context() call.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:32 +01:00
Ingo Molnar 210f400d68 Merge tag 'v4.10-rc8' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-14 07:29:14 +01:00
Alexander Shishkin 6ce77bfd6c perf/core: Allow kernel filters on CPU events
While supporting file-based address filters for CPU events requires some
extra context switch handling, kernel address filters are easy, since the
kernel mapping is preserved across address spaces. It is also useful as
it permits tracing scheduling paths of the kernel.

This patch allows setting up kernel filters for CPU events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:08:09 +01:00
Alexander Shishkin 9ccbfbb157 perf/core: Do error out on a kernel filter on an exclude_filter event
It is currently possible to configure a kernel address filter for a
event that excludes kernel from its traces (attr.exclude_kernel==1).

While in reality this doesn't make sense, the SET_FILTER ioctl() should
return a error in such case, currently it does not. Furthermore, it
will still silently discard the filter and any potentially valid filters
that came with it.

This patch makes the SET_FILTER ioctl() error out in such cases.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:08:09 +01:00
Peter Zijlstra 451d24d1e5 perf/core: Fix crash in perf_event_read()
Alexei had his box explode because doing read() on a package
(rapl/uncore) event that isn't currently scheduled in ends up doing an
out-of-bounds load.

Rework the code to more explicitly deal with event->oncpu being -1.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Fixes: d6a2f9035b ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG")
Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:04:50 +01:00
Tejun Heo 968ebff1ef cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
perf_event is a utility controller whose primary role is identifying
cgroup membership to filter perf events; however, because it also
tracks some per-css state, it can't be replaced by pure cgroup
membership test.  Mark the controller as implicitly enabled on the
default hierarchy so that perf events can always be filtered based on
cgroup v2 path as long as the controller is not mounted on a legacy
hierarchy.

"perf record" is updated accordingly so that it searches for both v1
and v2 hierarchies.  A v1 hierarchy is used if perf_event is mounted
on it; otherwise, it uses the v2 hierarchy.

v2: Doc updated to reflect more flexible rebinding behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2017-02-02 13:47:02 -05:00
Kan Liang 40999312c7 perf/core: Try parent PMU first when initializing a child event
perf has additional overhead when monitoring the task which
frequently generates child tasks.

perf_init_event() is one of the hotspots for the additional overhead:

Currently, to get the PMU, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole PMUs list.

It will be a big performance issue, if the PMUs list is long (e.g. server
with many uncore boxes) and the task frequently generates child tasks.

The child event inherits its parent event. So the child event should
try its parent PMU first.

Here is some data from the overhead test on Broadwell server:

  perf record -e $TEST_EVENTS -- ./loop.sh 50000

  loop.sh
    start=$(date +%s%N)
    i=0
    while [ "$i" -le "$1" ]
    do
            date > /dev/null
            i=`expr $i + 1`
    done
    end=$(date +%s%N)
    elapsed=`expr $end - $start`

  Event#	Original elapsed time	Elapsed time with patch		delta
  1		196,573,192,397		189,162,029,998			-3.77%
  2		257,567,753,013		241,620,788,683			-6.19%
  4		398,730,726,971		370,518,938,714			-7.08%
  8		824,983,761,120		740,702,489,329			-10.22%
  16		1,883,411,923,498	1,672,027,508,355		-11.22%

... which shows a nice performance improvement.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
[ Tidied up the changelog and the code comment. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:16 +01:00
Alexander Shishkin 487f05e18a perf/core: Optimize event rescheduling on active contexts
When new events are added to an active context, we go and reschedule
all cpu groups and all task groups in order to preserve the priority
(cpu pinned, task pinned, cpu flexible, task flexible), but in
reality we only need to reschedule groups of the same priority as
that of the events being added, and below.

This patch changes the behavior so that only groups that need to be
rescheduled are rescheduled.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:15 +01:00
Alexander Shishkin fe45bafbd0 perf/core: Don't re-schedule CPU flexible events needlessly
In the sched-in path, we first remove a CPU's flexible events in order to
give priority to the task's pinned events. However, this step can be safely
skipped if the task doesn't have its own pinned events.

This patch implements this skipping.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros 1fd7e41699 perf/core: Remove perf_cpu_context::unique_pmu
cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
with shared pmus in order to avoid visiting the same cpuctx more than once
in a for_each_pmu loop.

cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.

The change above, together with the previous patch in this series, removed
the remaining uses of cpuctx->unique_pmu, so we remove it altogether.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros 058fe1c044 perf/core: Make cgroup switch visit only cpuctxs with cgroup events
This patch follows from a conversation in CQM/CMT's last series about
speeding up the context switch for cgroup events:

  https://patchwork.kernel.org/patch/9478617/

This is a low-hanging fruit optimization. It replaces the iteration over
the "pmus" list in cgroup switch by an iteration over a new list that
contains only cpuctxs with at least one cgroup event.

This is necessary because the number of PMUs have increased over the years
e.g modern x86 server systems have well above 50 PMUs.

The iteration over the full PMU list is unneccessary and can be costly in
heavy cache contention scenarios.

Below are some instrumentation measurements with 10, 50 and 90 percentiles
of the total cost of context switch before and after this optimization for
a simple array read/write microbenchark.

  Contention
    Level    Nr events      Before (us)            After (us)       Median
  L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
  --------------------------------------------------------------------------
  Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
  High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
  High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%

  Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
  High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
  High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%

where:

  1 event type  = cycles
  2 event types = cycles,intel_cqm/llc_occupancy/

   Contention L2 Low: workset  <  L2 cache size.
                 High:  "     >>  L2   "     " .
   Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                 High:   "     "   "   "   "    "    >>  L3   "     " .

   Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before

Unsurprisingly, the benefits of this optimization decrease with the number
of cpuctxs with a cgroup events, yet, is never detrimental.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:13 +01:00
Peter Zijlstra 0b3589be9b perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory
Andres reported that MMAP2 records for anonymous memory always have
their protection field 0.

Turns out, someone daft put the prot/flags generation code in the file
branch, leaving them unset for anonymous memory.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Don Zickus <dzickus@redhat.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: anton@ozlabs.org
Cc: namhyung@kernel.org
Cc: stable@vger.kernel.org # v3.16+
Fixes: f972eb63b1 ("perf: Pass protection and flags bits through mmap2 interface")
Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:41:26 +01:00