Currently, when the ring buffer drops events, it does not record
the fact that it did so. It does inform the writer that the event
was dropped by returning a NULL event, but it does not put in any
place holder where the event was dropped.
This is not a trivial thing to add because the ring buffer mostly
runs in overwrite (flight recorder) mode. That is, when the ring
buffer is full, new data will overwrite old data.
In a produce/consumer mode, where new data is simply dropped when
the ring buffer is full, it is trivial to add the placeholder
for dropped events. When there's more room to write new data, then
a special event can be added to notify the reader about the dropped
events.
But in overwrite mode, any new write can overwrite events. A place
holder can not be inserted into the ring buffer since there never
may be room. A reader could also come in at anytime and miss the
placeholder.
Luckily, the way the ring buffer works, the read side can find out
if events were lost or not, and how many events. Everytime a write
takes place, if it overwrites the header page (the next read) it
updates a "overrun" variable that keeps track of the number of
lost events. When a reader swaps out a page from the ring buffer,
it can record this number, perfom the swap, and then check to
see if the number changed, and take the diff if it has, which would be
the number of events dropped. This can be stored by the reader
and returned to callers of the reader.
Since the reader page swap will fail if the writer moved the head
page since the time the reader page set up the swap, this gives room
to record the overruns without worrying about races. If the reader
sets up the pages, records the overrun, than performs the swap,
if the swap succeeds, then the overrun variable has not been
updated since the setup before the swap.
For binary readers of the ring buffer, a flag is set in the header
of each sub page (sub buffer) of the ring buffer. This flag is embedded
in the size field of the data on the sub buffer, in the 31st bit (the size
can be 32 or 64 bits depending on the architecture), but only 27
bits needs to be used for the actual size (less actually).
We could add a new field in the sub buffer header to also record the
number of events dropped since the last read, but this will change the
format of the binary ring buffer a bit too much. Perhaps this change can
be made if the information on the number of events dropped is considered
important enough.
Note, the notification of dropped events is only used by consuming reads
or peeking at the ring buffer. Iterating over the ring buffer does not
keep this information because the necessary data is only available when
a page swap is made, and the iterator does not swap out pages.
Cc: Robert Richter <robert.richter@amd.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If modules are configured in the build but unloading of modules is not,
then the refcnt is not defined. Place the get/put module tracepoints
under CONFIG_MODULE_UNLOAD since it references this field in the module
structure.
As a side-effect, this patch also reduces the code when MODULE_UNLOAD
is not set, because these unused tracepoints are not created.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove the @refcnt argument, because it has side-effects, and arguments with
side-effects are not skipped by the jump over disabled instrumentation and are
executed even when the tracepoint is disabled.
This was also causing a GPF as found by Randy Dunlap:
Subject: 2.6.33 GP fault only when built with tracing
LKML-Reference: <4BA2B69D.3000309@oracle.com>
Note, the current 2.6.34-rc has a fix for the actual cause of the GPF,
but this fixes one of its triggers.
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA7.6040406@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Do not free zero sized per cpu areas
x86: Make sure free_init_pages() frees pages on page boundary
x86: Make smp_locks end with page alignment
CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
instances. Change the remaining instances. This makes the debugfs file
display the time mark and the owner's description again.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1
x86/PCI: for host bridge address space collisions, show conflicting resource
frv/PCI: remove redundant warnings
x86/PCI: remove redundant warnings
PCI: don't say we claimed a resource if we failed
PCI quirk: Disable MSI on VIA K8T890 systems
PCI quirk: RS780/RS880: work around missing MSI initialization
PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present
PCI: complain about devices that seem to be broken
PCI: print resources consistently with %pR
PCI: make disabled window printk style match the enabled ones
PCI: break out primary/secondary/subordinate for readability
PCI: for address space collisions, show conflicting resource
resources: add interfaces that return conflict information
PCI: cleanup error return for pcix get and set mmrbc functions
PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
PCI: kill off pci_register_set_vga_state() symbol export.
PCI: fix return value from pcix_get_max_mmrbc()
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
time: Fix accumulation bug triggered by long delay.
posix-cpu-timers: Reset expire cache when no timer is running
timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
clockevents: Sanitize min_delta_ns adjustment and prevent overflows
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Use proper type in sched_getaffinity()
kernel/sched.c: Suppress unused var warning
sched: sched_getaffinity(): Allow less than NR_CPUS length
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Move two IRQ functions from .init.text to .text
genirq: Protect access to irq_desc->action in can_request_irq()
genirq: Prevent oneshot irq thread race
can_request_irq() accesses and dereferences irq_desc->action w/o
holding irq_desc->lock. So action can be freed on another CPU before
it's dereferenced. Unlikely, but ...
Protect it with desc->lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
request_resource() and insert_resource() only return success or failure,
which no information about what existing resource conflicted with the
proposed new reservation. This patch adds request_resource_conflict()
and insert_resource_conflict(), which return the conflicting resource.
Callers may use this for better error messages or to adjust the new
resource and retry the request.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The logarithmic accumulation done in the timekeeping has some overflow
protection that limits the max shift value. That means it will take
more then shift loops to accumulate all of the cycles. This causes
the shift decrement to underflow, which causes the loop to never exit.
The simplest fix would be simply to do a:
if (shift)
shift--;
However that is not optimal, as we know the cycle offset is larger
then the interval << shift, the above would make shift drop to zero,
then we would be spinning for quite awhile accumulating at interval
chunks at a time.
Instead, this patch only decreases shift if the offset is smaller
then cycle_interval << shift. This makes sure we accumulate using
the largest chunks possible without overflowing tick_length, and limits
the number of iterations through the loop.
This issue was found and reported by Sonic Zhang, who also tested the fix.
Many thanks your explanation and testing!
Reported-by: Sonic Zhang <sonic.adi@gmail.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Sonic Zhang <sonic.adi@gmail.com>
LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Ensure additions on touch_ts do not overflow. This can occur
when the top 32 bits of the TSC reach 0xffffffff causing
additions to touch_ts to overflow and this in turn generates
spurious softlockup warnings.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <1268994482.1798.6.camel@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ring buffer uses 4 byte alignment while recording events into the
buffer, even on 64bit machines. This saves space when there are lots
of events being recorded at 4 byte boundaries.
The ring buffer has a zero copy method to write into the buffer, with
the reserving of space and then committing it. This may cause problems
when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
PPC this is not an issue, but on some architectures this would cause an
out-of-alignment exception.
This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
it forces the ring buffer event header to be 8 bytes and not 4,
and will align the length of the data to be 8 byte aligned.
This keeps the data payload at 8 byte alignments and will allow these
machines to run without issue.
The trick to this is that the header can be either 4 bytes or 8 bytes
depending on the length of the data payload. The 4 byte header
has a length field that supports up to 112 bytes. If the length of
the data is more than 112, the length field is set to zero, and the actual
length is stored in the next 4 bytes after the header.
When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
zero in the 4 byte header forcing the length to be stored in the 4 byte
array, even with a small data load. It also forces the length of the
data load to be 8 byte aligned. The combination of these two guarantee
that the data is always at 8 byte alignment.
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
(on sparc64)
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
perf: Fix unexported generic perf_arch_fetch_caller_regs
perf record: Don't try to find buildids in a zero sized file
perf: export perf_trace_regs and perf_arch_fetch_caller_regs
perf, x86: Fix hw_perf_enable() event assignment
perf, ppc: Fix compile error due to new cpu notifiers
perf: Make the install relative to DESTDIR if specified
kprobes: Calculate the index correctly when freeing the out-of-line execution slot
perf tools: Fix sparse CPU numbering related bugs
perf_event: Fix oops triggered by cpu offline/online
perf: Drop the obsolete profile naming for trace events
perf: Take a hot regs snapshot for trace events
perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
perf/x86-64: Use frame pointer to walk on irq and process stacks
lockdep: Move lock events under lockdep recursion protection
perf report: Print the map table just after samples for which no map was found
perf report: Add multiple event support
perf session: Change perf_session post processing functions to take histogram tree
perf session: Add storage for seperating event types in report
perf session: Change add_hist_entry to take the tree root instead of session
perf record: Add ID and to recorded event data when recording multiple events
...