jump_lable patching is very expensive operation that involves pausing all
cpus. The patching of perf_sched_events jump_label is easily controllable
from userspace by unprivileged user.
When te user runs a loop like this:
"while true; do perf stat -e cycles true; done"
... the performance of my test application that just increments a counter
for one second drops by 4%.
This is on a 16 cpu box with my test application using only one of
them. An impact on a real server doing real work will be worse.
Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
record" since KVM PMU implementation creates and destroys perf event
frequently.
This patch introduces a way to rate limit jump_label patching and uses
it to fix the above problem.
I believe that as jump_label use will spread the problem will become more
common and thus solving it in a generic code is appropriate. Also fixing
it in the perf code would result in moving jump_label accounting logic to
perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
patch all details are nicely hidden inside jump_label code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).
The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.
With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.
V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
V4:
* Fix broken event assignment if weight of the first event is not
wmin (perf_sched_init()).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gleb writes:
> Currently pmu is disabled and re-enabled on each timer interrupt even
> when no rotation or frequency adjustment is needed. On Intel CPU this
> results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
> it does not cause significant slowdown, but when running perf in a virtual
> machine it leads to 20% slowdown on my machine.
Cure this by keeping a perf_event_context::nr_freq counter that counts the
number of active events that require frequency adjustments and use this in a
similar fashion to the already existing nr_events != nr_active test in
perf_rotate_context().
By being able to exclude both rotation and frequency adjustments a-priory for
the common case we can avoid the otherwise superfluous PMU disable.
Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When you do:
$ perf record -e cycles,cycles,cycles noploop 10
You expect about 10,000 samples for each event, i.e., 10s at
1000samples/sec. However, this is not what's happening. You
get much fewer samples, maybe 3700 samples/event:
$ perf report -D | tail -15
Aggregated stats:
TOTAL events: 10998
MMAP events: 66
COMM events: 2
SAMPLE events: 10930
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
cycles stats:
TOTAL events: 3642
SAMPLE events: 3642
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
On a Intel Nehalem or even AMD64, there are 4 counters capable
of measuring cycles, so there is plenty of space to measure those
events without multiplexing (even with the NMI watchdog active).
And even with multiplexing, we'd expect roughly the same number
of samples per event.
The root of the problem was that when the event that caused the buffer
to become full was not the first event passed on the cmdline, the user
notification would get lost. The notification was sent to the file
descriptor of the overflowed event but the perf tool was not polling
on it. The perf tool aggregates all samples into a single buffer,
i.e., the buffer of the first event. Consequently, it assumes
notifications for any event will come via that descriptor.
The seemingly straight forward solution of moving the waitq into the
ringbuffer object doesn't work because of life-time issues. One could
perf_event_set_output() on a fd that you're also blocking on and cause
the old rb object to be freed while its waitq would still be
referenced by the blocked thread -> FAIL.
Therefore link all events to the ringbuffer and broadcast the wakeup
from the ringbuffer object to all possible events that could be waited
upon. This is rather ugly, and we're open to better solutions but it
works for now.
Reported-by: Stephane Eranian <eranian@google.com>
Finished-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'rmobile-fixes-for-linus' of git://github.com/pmundt/linux-sh:
ARM: mach-shmobile: cpuidle single/global and last_state fixes
ARM: mach-shmobile: move helper macro PORTCR to sh_pfc.h
ARM: mach-shmobile: move helper macro PORT_xx to sh_pfc.h
ARM: mach-shmobile: move helper macro PORT_DATA_xx to sh_pfc.h
ARM: mach-shmobile: ap4evb: remove white space from end of line
ARM: mach-shmobile: clock-sh7372: remove un-necessary index
ARM: mach-shmobile: kota2: add comment out separator
ARM: mach-shmobile: sh73a0: add MMC data pin pull-up
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Use kmemdup rather than duplicating its implementation
ALSA: hda - Re-enable the check NO_PRESENCE misc bit
ALSA: vmaster - Free slave-links when freeing the master element
ALSA: hda - Don't add elements of other codecs to vmaster slave
ALSA: intel8x0: improve virtual environment detection
ALSA: intel8x0: move virtual environment detection code into one place
ALSA: snd_usb_audio: add Logitech HD Webcam c510 to quirk-384
ALSA: hda - fix internal mic on Dell Vostro 3500 laptop
ALSA: HDA: Remove quirk for Toshiba T110
ALSA: usb-audio - Fix the missing volume quirks at delayed init
ALSA: hda - Mute unused capture sources for Realtek codecs
ALSA: intel8x0: Improve comments for VM optimization
ASoC: Ensure we get an impedence reported for WM8958 jack detect
ASoC: Don't use wm8994->control_data when requesting IRQs
ASoC: Don't use wm8994->control_data in wm8994_readable_register()
ASoC: Update git repository URL
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (42 commits)
drm/radeon/kms/pm: switch to dynamically allocating clock mode array
drm/radeon/kms: optimize r600_pm_profile_init
drm/radeon/kms/pm: add a proper pm profile init function for fusion
drm/radeon/kms: remove extraneous calls to radeon_pm_compute_clocks()
drm/exynos: added padding to be 64-bit align.
drm: fix kconfig unmet dependency warning
drm: add some comments to drm_wait_vblank and drm_queue_vblank_event
drm/radeon/benchmark: signedness bug in radeon_benchmark_move()
drm: do not sleep on vblank while holding a mutex
MAINTAINERS: exynos: Add EXYNOS DRM maintainer entry
drm: try to restore previous CRTC config if mode set fails
drm/radeon/kms: make an aux failure debug only
drm: drop select of SLOW_WORK
drm: serialize access to list of debugfs files
drm/radeon/kms: fix use of vram scratch page on evergreen/ni
drm/radeon: Make sure CS mutex is held across GPU reset.
drm: Ensure string is null terminated.
vmwgfx: Only allow 64x64 cursors
vmwgfx: Initialize clip rect loop correctly in surface dirty
vmwgfx: Close screen object system
...
Nouveau, when configured with debugfs, creates debugfs files for every
channel, so structure holding list of files needs to be protected from
simultaneous changes by multiple threads.
Without this patch it's possible to hit kernel oops in
drm_debugfs_remove_files just by running a couple of xterms with
looped glxinfo.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
This patch moves PORT_xx helper macro to sh_pfc.h,
and it expects CPU_ALL_PORT() macro for each CPU
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Now that all of the named string association with clocks has been
migrated to clkdev lookups there's no meaningful named topology that can
be constructed for a debugfs tree view. Get rid of the left over bits,
and shrink struct clk a bit in the process.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Fix below build warning:
CC arch/arm/mach-omap2/hwspinlock.o
In file included from arch/arm/mach-omap2/hwspinlock.c:22:
include/linux/hwspinlock.h: In function '__hwspin_unlock':
include/linux/hwspinlock.h:121: warning: 'return' with a value, in function returning void
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
The "private_date" field in struct devfreq_dev_status almost certainly
wants to be "private_data"; since there are no in-tree users of this
functionality, now seems like an easy time to make the fix.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
forcedeth: fix a few sparse warnings (variable shadowing)
forcedeth: Improve stats counters
forcedeth: remove unneeded stats updates
forcedeth: Acknowledge only interrupts that are being processed
forcedeth: fix race when unloading module
MAINTAINERS/rds: update maintainer
wanrouter: Remove kernel_lock annotations
usbnet: fix oops in usbnet_start_xmit
ixgbe: Fix compile for kernel without CONFIG_PCI_IOV defined
etherh: Add MAINTAINERS entry for etherh
bonding: comparing a u8 with -1 is always false
sky2: fix regression on Yukon Optima
netlink: clarify attribute length check documentation
netlink: validate NLA_MSECS length
i825xx:xscale:8390:freescale: Fix Kconfig dependancies
macvlan: receive multicast with local address
tg3: Update version to 3.121
tg3: Eliminate timer race with reset_task
tg3: Schedule at most one tg3_reset_task run
tg3: Obtain PCI function number from device
...
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
cpuidle: Single/Global registration of idle states
cpuidle: Split cpuidle_state structure and move per-cpu statistics fields
cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()
cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state
ACPI: Fix CONFIG_ACPI_DOCK=n compiler warning
ACPI: Export FADT pm_profile integer value to userspace
thermal: Prevent polling from happening during system suspend
ACPI: Drop ACPI_NO_HARDWARE_INIT
ACPI atomicio: Convert width in bits to bytes in __acpi_ioremap_fast()
PNPACPI: Simplify disabled resource registration
ACPI: Fix possible recursive locking in hwregs.c
ACPI: use kstrdup()
mrst pmu: update comment
tools/power turbostat: less verbose debugging