* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits)
perf kvm: Fix copy & paste error in description
perf script: Kill script_spec__delete
perf top: Fix a memory leak
perf stat: Introduce get_ratio_color() helper
perf session: Remove impossible condition check
perf tools: Fix feature-bits rework fallout, remove unused variable
perf script: Add generic perl handler to process events
perf tools: Use for_each_set_bit() to iterate over feature flags
perf tools: Unify handling of features when writing feature section
perf report: Accept fifos as input file
perf tools: Moving code in some files
perf tools: Fix out-of-bound access to struct perf_session
perf tools: Continue processing header on unknown features
perf tools: Improve macros for struct feature_ops
perf: builtin-record: Document and check that mmap_pages must be a power of two.
perf: builtin-record: Provide advice if mmap'ing fails with EPERM.
perf tools: Fix truncated annotation
perf script: look up thread using tid instead of pid
perf tools: Look up thread names for system wide profiling
perf tools: Fix comm for processes with named threads
...
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
cpu: Export cpu_up()
rcu: Apply ACCESS_ONCE() to rcu_boost() return value
Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"
docs: Additional LWN links to RCU API
rcu: Augment rcu_batch_end tracing for idle and callback state
rcu: Add rcutorture tests for srcu_read_lock_raw()
rcu: Make rcutorture test for hotpluggability before offlining CPUs
driver-core/cpu: Expose hotpluggability to the rest of the kernel
rcu: Remove redundant rcu_cpu_stall_suppress declaration
rcu: Adaptive dyntick-idle preparation
rcu: Keep invoking callbacks if CPU otherwise idle
rcu: Irq nesting is always 0 on rcu_enter_idle_common
rcu: Don't check irq nesting from rcu idle entry/exit
rcu: Permit dyntick-idle with callbacks pending
rcu: Document same-context read-side constraints
rcu: Identify dyntick-idle CPUs on first force_quiescent_state() pass
rcu: Remove dynticks false positives and RCU failures
rcu: Reduce latency of rcu_prepare_for_idle()
rcu: Eliminate RCU_FAST_NO_HZ grace-period hang
rcu: Avoid needlessly IPIing CPUs at GP end
...
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations. Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead. Most converions are straight-forward.
Noteworthy changes are,
* In cgroup_freezer, remove unnecessary NULL assignments to unused
methods. It's useless and very prone to get out of sync, which
already happened.
* In cpuset, PF_THREAD_BOUND test is checked for each task. This
doesn't make any practical difference but is conceptually cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
perf_event_sched_in() shouldn't try to schedule task events if there
are none otherwise task's ctx->is_active will be set and will not be
cleared during sched_out. This will prevent newly added events from
being scheduled into the task context.
Fixes a boo-boo in commit 1d5f003f5a ("perf: Do not set task_ctx
pointer in cpuctx if there are no events in the context").
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111122140821.GF2557@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
jump_lable patching is very expensive operation that involves pausing all
cpus. The patching of perf_sched_events jump_label is easily controllable
from userspace by unprivileged user.
When te user runs a loop like this:
"while true; do perf stat -e cycles true; done"
... the performance of my test application that just increments a counter
for one second drops by 4%.
This is on a 16 cpu box with my test application using only one of
them. An impact on a real server doing real work will be worse.
Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
record" since KVM PMU implementation creates and destroys perf event
frequently.
This patch introduces a way to rate limit jump_label patching and uses
it to fix the above problem.
I believe that as jump_label use will spread the problem will become more
common and thus solving it in a generic code is appropriate. Also fixing
it in the perf code would result in moving jump_label accounting logic to
perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
patch all details are nicely hidden inside jump_label code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gleb writes:
> Currently pmu is disabled and re-enabled on each timer interrupt even
> when no rotation or frequency adjustment is needed. On Intel CPU this
> results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
> it does not cause significant slowdown, but when running perf in a virtual
> machine it leads to 20% slowdown on my machine.
Cure this by keeping a perf_event_context::nr_freq counter that counts the
number of active events that require frequency adjustments and use this in a
similar fashion to the already existing nr_events != nr_active test in
perf_rotate_context().
By being able to exclude both rotation and frequency adjustments a-priory for
the common case we can avoid the otherwise superfluous PMU disable.
Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When you do:
$ perf record -e cycles,cycles,cycles noploop 10
You expect about 10,000 samples for each event, i.e., 10s at
1000samples/sec. However, this is not what's happening. You
get much fewer samples, maybe 3700 samples/event:
$ perf report -D | tail -15
Aggregated stats:
TOTAL events: 10998
MMAP events: 66
COMM events: 2
SAMPLE events: 10930
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
cycles stats:
TOTAL events: 3642
SAMPLE events: 3642
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
On a Intel Nehalem or even AMD64, there are 4 counters capable
of measuring cycles, so there is plenty of space to measure those
events without multiplexing (even with the NMI watchdog active).
And even with multiplexing, we'd expect roughly the same number
of samples per event.
The root of the problem was that when the event that caused the buffer
to become full was not the first event passed on the cmdline, the user
notification would get lost. The notification was sent to the file
descriptor of the overflowed event but the perf tool was not polling
on it. The perf tool aggregates all samples into a single buffer,
i.e., the buffer of the first event. Consequently, it assumes
notifications for any event will come via that descriptor.
The seemingly straight forward solution of moving the waitq into the
ringbuffer object doesn't work because of life-time issues. One could
perf_event_set_output() on a fd that you're also blocking on and cause
the old rb object to be freed while its waitq would still be
referenced by the blocked thread -> FAIL.
Therefore link all events to the ringbuffer and broadcast the wakeup
from the ringbuffer object to all possible events that could be waited
upon. This is rather ugly, and we're open to better solutions but it
works for now.
Reported-by: Stephane Eranian <eranian@google.com>
Finished-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch solves the following problem:
Now some samples may be lost due to throttling. The number of samples is
restricted by sysctl_perf_event_sample_rate/HZ. A trace event is
divided on some samples according to event's period. I don't sure, that
we should generate more than one sample on each trace event. I think the
better way to use SAMPLE_PERIOD.
E.g.: I want to trace when a process sleeps. I created a process, which
sleeps for 1ms and for 4ms. perf got 100 events in both cases.
swapper 0 [000] 1141.371830: sched_stat_sleep: comm=foo pid=1801 delay=1386750 [ns]
swapper 0 [000] 1141.369444: sched_stat_sleep: comm=foo pid=1801 delay=4499585 [ns]
In the first case a kernel want to send 4499585 events and
in the second case it wants to send 1386750 events.
perf-reports shows that process sleeps in both places equal time. It's
bug.
With this patch kernel generates one event on each "sleep" and the time
slice is saved in the field "period". Perf knows how handle it.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320670457-2633428-3-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do not set task_ctx pointer during sched_in if there are no
events associated with the context. Otherwise if during task
execution total number of events in the system will become zero
perf_event_context_sched_out() will not be called and cpuctx->task_ctx
will be left with a stale value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111023171033.GI17571@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
The legacy x86 nmi watchdog code was removed with the implementation
of the perf based nmi watchdog. This broke Oprofile's nmi timer
mode. To run nmi timer mode we relied on a continuous ticking nmi
source which the nmi watchdog provided. The nmi tick was no longer
available and current watchdog can not be used anymore since it runs
with very long periods in the range of seconds. This patch
reimplements the nmi timer mode using a perf counter nmi source.
V2:
* removing pr_info()
* fix undefined reference to `__udivdi3' for 32 bit build
* fix section mismatch of .cpuinit.data:nmi_timer_cpu_nb
* removed nmi timer setup in arch/x86
* implemented function stubs for op_nmi_init/exit()
* made code more readable in oprofile_init()
V3:
* fix architectural initialization in oprofile_init()
* fix CONFIG_OPROFILE_NMI_TIMER dependencies
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Robert Richter <robert.richter@amd.com>
This reverts commit 144060fee0.
It causes a resume regression for Andi on his Acer Aspire 1830T post
3.1. The screen just stays black after wakeup.
Also, it really looks like the wrong way to suspend and resume perf
events: I think they should be done as part of the CPU suspend and
resume, rather than as a notifier that does smp_call_function().
Reported-by: Andi Kleen <andi@firstfloor.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".
The difference between mlocking and pinning is:
A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.
B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.
I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:
Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.
This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These files were getting <linux/module.h> via an implicit non-obvious
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>