Merge tag 'perf-tools-for-v5.12-2020-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tool updates from Arnaldo Carvalho de Melo:
 "New features:

   - Support instruction latency in 'perf report', with both memory
     latency (weight) and instruction latency information, users can
     locate expensive load instructions and understand time spent in
     different stages.

   - Extend 'perf c2c' to display the number of loads which were blocked
     by data or address conflict.

   - Add 'perf stat' support for L2 topdown events in systems such as
     Intel's Sapphire rapids server.

   - Add support for PERF_SAMPLE_CODE_PAGE_SIZE in various tools, as a
     sort key, for instance:

        perf report --stdio --sort=comm,symbol,code_page_size

   - New 'perf daemon' command to run long running sessions while
     providing a way to control the enablement of events without
     restarting a traditional 'perf record' session.

   - Enable counting events for BPF programs in 'perf stat' just like
     for other targets (tid, cgroup, cpu, etc), e.g.:

        # perf stat -e ref-cycles,cycles -b 254 -I 1000
           1.487903822            115,200      ref-cycles
           1.487903822             86,012      cycles
           2.489147029             80,560      ref-cycles
           2.489147029             73,784      cycles
        ^C

     The example above counts 'cycles' and 'ref-cycles' of BPF program
     of id 254. It is similar to bpftool-prog-profile command, but more
     flexible.

   - Support the new layout for PERF_RECORD_MMAP2 to carry the DSO
     build-id using infrastructure generalised from the eBPF subsystem,
     removing the need for traversing the perf.data file to collect
     build-ids at the end of 'perf record' sessions and helping with
     long running sessions where binaries can get replaced in updates,
     leading to possible mis-resolution of symbols.

   - Support filtering by hex address in 'perf script'.

   - Support DSO filter in 'perf script', like in other perf tools.

   - Add namespaces support to 'perf inject'

   - Add support for SDT (Dtrace Style Markers) events on ARM64.

  perf record:

   - Fix handling of eventfd() when draining a buffer in 'perf record'.

   - Improvements to the generation of metadata events for pre-existing
     threads (mmaps, comm, etc), speeding up the work done at the start
     of system wide or per CPU 'perf record' sessions.

  Hardware tracing:

   - Initial support for tracing KVM with Intel PT.

   - Intel PT fixes for IPC

   - Support Intel PT PSB (synchronization packets) events.

   - Automatically group aux-output events to overcome --filter syntax.

   - Enable PERF_SAMPLE_DATA_SRC on ARMs SPE.

   - Update ARM's CoreSight hardware tracing OpenCSD library to v1.0.0.

  perf annotate TUI:

   - Fix handling of 'k' ("show line number") hotkey

   - Fix jump parsing for C++ code.

  perf probe:

   - Add protection to avoid endless loop.

  cgroups:

   - Avoid reading cgroup mountpoint multiple times, caching it.

   - Fix handling of cgroup v1/v2 in mixed hierarchy.

  Symbol resolving:

   - Add OCaml symbol demangling.

   - Further fixes for handling PE executables when using perf with Wine
     and .exe/.dll files.

   - Fix 'perf unwind' DSO handling.

   - Resolve symbols against debug file first, to deal with artifacts
     related to LTO.

   - Fix gap between kernel end and module start on powerpc.

  Reporting tools:

   - The DSO filter shouldn't show samples in unresolved maps.

   - Improve debuginfod support in various tools.

  build ids:

   - Fix 16-byte build ids in 'perf buildid-cache', add a 'perf test'
     entry for that case.

  perf test:

   - Support for PERF_SAMPLE_WEIGHT_STRUCT.

   - Add test case for PERF_SAMPLE_CODE_PAGE_SIZE.

   - Shell based tests for 'perf daemon's commands ('start', 'stop,
     'reconfig', 'list', etc).

   - ARM cs-etm 'perf test' fixes.

   - Add parse-metric memory bandwidth testcase.

  Compiler related:

   - Fix 'perf probe' kretprobe issue caused by gcc 11 bug when used
     with -fpatchable-function-entry.

   - Fix ARM64 build with gcc 11's -Wformat-overflow.

   - Fix unaligned access in sample parsing test.

   - Fix printf conversion specifier for IP addresses on arm64, s390 and
     powerpc.

  Arch specific:

   - Support exposing Performance Monitor Counter SPRs as part of
     extended regs on powerpc.

   - Add JSON 'perf stat' metrics for ARM64's imx8mp, imx8mq and imx8mn
     DDR, fix imx8mm ones.

   - Fix common and uarch events for ARM64's A76 and Ampere eMag"

* tag 'perf-tools-for-v5.12-2020-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (148 commits)
  perf buildid-cache: Don't skip 16-byte build-ids
  perf buildid-cache: Add test for 16-byte build-id
  perf symbol: Remove redundant libbfd checks
  perf test: Output the sub testing result in cs-etm
  perf test: Suppress logs in cs-etm testing
  perf tools: Fix arm64 build error with gcc-11
  perf intel-pt: Add documentation for tracing virtual machines
  perf intel-pt: Split VM-Entry and VM-Exit branches
  perf intel-pt: Adjust sample flags for VM-Exit
  perf intel-pt: Allow for a guest kernel address filter
  perf intel-pt: Support decoding of guest kernel
  perf machine: Factor out machine__idle_thread()
  perf machine: Factor out machines__find_guest()
  perf intel-pt: Amend decoder to track the NR flag
  perf intel-pt: Retain the last PIP packet payload as is
  perf intel_pt: Add vmlaunch and vmresume as branches
  perf script: Add branch types for VM-Entry and VM-Exit
  perf auxtrace: Automatically group aux-output events
  perf test: Fix unaligned access in sample parsing test
  perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT processing
  ...
This commit is contained in:
Linus Torvalds
2021-02-22 13:59:43 -08:00
183 changed files with 6941 additions and 977 deletions

View File

@@ -55,17 +55,33 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR3,
PERF_REG_POWERPC_SIER2,
PERF_REG_POWERPC_SIER3,
PERF_REG_POWERPC_PMC1,
PERF_REG_POWERPC_PMC2,
PERF_REG_POWERPC_PMC3,
PERF_REG_POWERPC_PMC4,
PERF_REG_POWERPC_PMC5,
PERF_REG_POWERPC_PMC6,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
};
#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1)
/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) - PERF_REG_PMU_MASK)
/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
#define PERF_REG_PMU_MASK_31 (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - PERF_REG_PMU_MASK)
/* Exclude MMCR3, SIER2, SIER3 for CPU_FTR_ARCH_300 */
#define PERF_EXCLUDE_REG_EXT_300 (7ULL << PERF_REG_POWERPC_MMCR3)
#define PERF_REG_MAX_ISA_300 (PERF_REG_POWERPC_MMCR2 + 1)
#define PERF_REG_MAX_ISA_31 (PERF_REG_POWERPC_SIER3 + 1)
/*
* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300
* includes 9 SPRS from MMCR0 to PMC6 excluding the
* unsupported SPRS in PERF_EXCLUDE_REG_EXT_300.
*/
#define PERF_REG_PMU_MASK_300 ((0xfffULL << PERF_REG_POWERPC_MMCR0) - PERF_EXCLUDE_REG_EXT_300)
/*
* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31
* includes 12 SPRs from MMCR0 to PMC6.
*/
#define PERF_REG_PMU_MASK_31 (0xfffULL << PERF_REG_POWERPC_MMCR0)
#define PERF_REG_EXTENDED_MAX (PERF_REG_POWERPC_PMC6 + 1)
#endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */

View File

@@ -146,6 +146,8 @@ VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \
/boot/vmlinux-$(shell uname -r)
VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
bootstrap: $(BPFTOOL_BOOTSTRAP)
ifneq ($(VMLINUX_BTF)$(VMLINUX_H),)
ifeq ($(feature-clang-bpf-co-re),1)

View File

@@ -99,7 +99,9 @@ FEATURE_TESTS_EXTRA := \
clang \
libbpf \
libpfm4 \
libdebuginfod
libdebuginfod \
clang-bpf-co-re
FEATURE_TESTS ?= $(FEATURE_TESTS_BASIC)

View File

@@ -4,9 +4,9 @@
/*
* Check OpenCSD library version is sufficient to provide required features
*/
#define OCSD_MIN_VER ((0 << 16) | (14 << 8) | (0))
#define OCSD_MIN_VER ((1 << 16) | (0 << 8) | (0))
#if !defined(OCSD_VER_NUM) || (OCSD_VER_NUM < OCSD_MIN_VER)
#error "OpenCSD >= 0.14.0 is required"
#error "OpenCSD >= 1.0.0 is required"
#endif
int main(void)

View File

@@ -145,12 +145,14 @@ enum perf_event_sample_format {
PERF_SAMPLE_CGROUP = 1U << 21,
PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22,
PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23,
PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24,
PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */
PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */
__PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
};
#define PERF_SAMPLE_WEIGHT_TYPE (PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT)
/*
* values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
*
@@ -386,7 +388,8 @@ struct perf_event_attr {
aux_output : 1, /* generate AUX records instead of events */
cgroup : 1, /* include cgroup events */
text_poke : 1, /* include text poke events */
__reserved_1 : 30;
build_id : 1, /* use build id in mmap2 events */
__reserved_1 : 29;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -659,6 +662,22 @@ struct perf_event_mmap_page {
__u64 aux_size;
};
/*
* The current state of perf_event_header::misc bits usage:
* ('|' used bit, '-' unused bit)
*
* 012 CDEF
* |||---------||||
*
* Where:
* 0-2 CPUMODE_MASK
*
* C PROC_MAP_PARSE_TIMEOUT
* D MMAP_DATA / COMM_EXEC / FORK_EXEC / SWITCH_OUT
* E MMAP_BUILD_ID / EXACT_IP / SCHED_OUT_PREEMPT
* F (reserved)
*/
#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
#define PERF_RECORD_MISC_CPUMODE_UNKNOWN (0 << 0)
#define PERF_RECORD_MISC_KERNEL (1 << 0)
@@ -690,6 +709,7 @@ struct perf_event_mmap_page {
*
* PERF_RECORD_MISC_EXACT_IP - PERF_RECORD_SAMPLE of precise events
* PERF_RECORD_MISC_SWITCH_OUT_PREEMPT - PERF_RECORD_SWITCH* events
* PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
*
*
* PERF_RECORD_MISC_EXACT_IP:
@@ -699,9 +719,13 @@ struct perf_event_mmap_page {
*
* PERF_RECORD_MISC_SWITCH_OUT_PREEMPT:
* Indicates that thread was preempted in TASK_RUNNING state.
*
* PERF_RECORD_MISC_MMAP_BUILD_ID:
* Indicates that mmap2 event carries build id data.
*/
#define PERF_RECORD_MISC_EXACT_IP (1 << 14)
#define PERF_RECORD_MISC_SWITCH_OUT_PREEMPT (1 << 14)
#define PERF_RECORD_MISC_MMAP_BUILD_ID (1 << 14)
/*
* Reserve the last bit to indicate some extended misc field
*/
@@ -890,7 +914,24 @@ enum perf_event_type {
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
*
* { u64 weight; } && PERF_SAMPLE_WEIGHT
* { union perf_sample_weight
* {
* u64 full; && PERF_SAMPLE_WEIGHT
* #if defined(__LITTLE_ENDIAN_BITFIELD)
* struct {
* u32 var1_dw;
* u16 var2_w;
* u16 var3_w;
* } && PERF_SAMPLE_WEIGHT_STRUCT
* #elif defined(__BIG_ENDIAN_BITFIELD)
* struct {
* u16 var3_w;
* u16 var2_w;
* u32 var1_dw;
* } && PERF_SAMPLE_WEIGHT_STRUCT
* #endif
* }
* }
* { u64 data_src; } && PERF_SAMPLE_DATA_SRC
* { u64 transaction; } && PERF_SAMPLE_TRANSACTION
* { u64 abi; # enum perf_sample_regs_abi
@@ -915,10 +956,20 @@ enum perf_event_type {
* u64 addr;
* u64 len;
* u64 pgoff;
* u32 maj;
* u32 min;
* u64 ino;
* u64 ino_generation;
* union {
* struct {
* u32 maj;
* u32 min;
* u64 ino;
* u64 ino_generation;
* };
* struct {
* u8 build_id_size;
* u8 __reserved_1;
* u16 __reserved_2;
* u8 build_id[20];
* };
* };
* u32 prot, flags;
* char filename[];
* struct sample_id sample_id;
@@ -1127,14 +1178,16 @@ union perf_mem_data_src {
mem_lvl_num:4, /* memory hierarchy level number */
mem_remote:1, /* remote */
mem_snoopx:2, /* snoop mode, ext */
mem_rsvd:24;
mem_blk:3, /* access blocked */
mem_rsvd:21;
};
};
#elif defined(__BIG_ENDIAN_BITFIELD)
union perf_mem_data_src {
__u64 val;
struct {
__u64 mem_rsvd:24,
__u64 mem_rsvd:21,
mem_blk:3, /* access blocked */
mem_snoopx:2, /* snoop mode, ext */
mem_remote:1, /* remote */
mem_lvl_num:4, /* memory hierarchy level number */
@@ -1217,6 +1270,12 @@ union perf_mem_data_src {
#define PERF_MEM_TLB_OS 0x40 /* OS fault handler */
#define PERF_MEM_TLB_SHIFT 26
/* Access blocked */
#define PERF_MEM_BLK_NA 0x01 /* not available */
#define PERF_MEM_BLK_DATA 0x02 /* data could not be forwarded */
#define PERF_MEM_BLK_ADDR 0x04 /* address conflict */
#define PERF_MEM_BLK_SHIFT 40
#define PERF_MEM_S(a, s) \
(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
@@ -1248,4 +1307,23 @@ struct perf_branch_entry {
reserved:40;
};
union perf_sample_weight {
__u64 full;
#if defined(__LITTLE_ENDIAN_BITFIELD)
struct {
__u32 var1_dw;
__u16 var2_w;
__u16 var3_w;
};
#elif defined(__BIG_ENDIAN_BITFIELD)
struct {
__u16 var3_w;
__u16 var2_w;
__u32 var1_dw;
};
#else
#error "Unknown endianness"
#endif
};
#endif /* _UAPI_LINUX_PERF_EVENT_H */

View File

@@ -251,5 +251,8 @@ struct prctl_mm_map {
#define PR_SET_SYSCALL_USER_DISPATCH 59
# define PR_SYS_DISPATCH_OFF 0
# define PR_SYS_DISPATCH_ON 1
/* The control values for the user space selector when dispatch is enabled */
# define SYSCALL_DISPATCH_FILTER_ALLOW 0
# define SYSCALL_DISPATCH_FILTER_BLOCK 1
#endif /* _LINUX_PRCTL_H */

View File

@@ -8,12 +8,29 @@
#include <string.h>
#include "fs.h"
struct cgroupfs_cache_entry {
char subsys[32];
char mountpoint[PATH_MAX];
};
/* just cache last used one */
static struct cgroupfs_cache_entry cached;
int cgroupfs_find_mountpoint(char *buf, size_t maxlen, const char *subsys)
{
FILE *fp;
char mountpoint[PATH_MAX + 1], tokens[PATH_MAX + 1], type[PATH_MAX + 1];
char path_v1[PATH_MAX + 1], path_v2[PATH_MAX + 2], *path;
char *token, *saved_ptr = NULL;
char *line = NULL;
size_t len = 0;
char *p, *path;
char mountpoint[PATH_MAX];
if (!strcmp(cached.subsys, subsys)) {
if (strlen(cached.mountpoint) < maxlen) {
strcpy(buf, cached.mountpoint);
return 0;
}
return -1;
}
fp = fopen("/proc/mounts", "r");
if (!fp)
@@ -22,45 +39,63 @@ int cgroupfs_find_mountpoint(char *buf, size_t maxlen, const char *subsys)
/*
* in order to handle split hierarchy, we need to scan /proc/mounts
* and inspect every cgroupfs mount point to find one that has
* perf_event subsystem
* the given subsystem. If we found v1, just use it. If not we can
* use v2 path as a fallback.
*/
path_v1[0] = '\0';
path_v2[0] = '\0';
mountpoint[0] = '\0';
while (fscanf(fp, "%*s %"__stringify(PATH_MAX)"s %"__stringify(PATH_MAX)"s %"
__stringify(PATH_MAX)"s %*d %*d\n",
mountpoint, type, tokens) == 3) {
/*
* The /proc/mounts has the follow format:
*
* <devname> <mount point> <fs type> <options> ...
*
*/
while (getline(&line, &len, fp) != -1) {
/* skip devname */
p = strchr(line, ' ');
if (p == NULL)
continue;
if (!path_v1[0] && !strcmp(type, "cgroup")) {
/* save the mount point */
path = ++p;
p = strchr(p, ' ');
if (p == NULL)
continue;
token = strtok_r(tokens, ",", &saved_ptr);
*p++ = '\0';
while (token != NULL) {
if (subsys && !strcmp(token, subsys)) {
strcpy(path_v1, mountpoint);
break;
}
token = strtok_r(NULL, ",", &saved_ptr);
}
/* check filesystem type */
if (strncmp(p, "cgroup", 6))
continue;
if (p[6] == '2') {
/* save cgroup v2 path */
strcpy(mountpoint, path);
continue;
}
if (!path_v2[0] && !strcmp(type, "cgroup2"))
strcpy(path_v2, mountpoint);
/* now we have cgroup v1, check the options for subsystem */
p += 7;
if (path_v1[0] && path_v2[0])
break;
p = strstr(p, subsys);
if (p == NULL)
continue;
/* sanity check: it should be separated by a space or a comma */
if (!strchr(" ,", p[-1]) || !strchr(" ,", p[strlen(subsys)]))
continue;
strcpy(mountpoint, path);
break;
}
free(line);
fclose(fp);
if (path_v1[0])
path = path_v1;
else if (path_v2[0])
path = path_v2;
else
return -1;
strncpy(cached.subsys, subsys, sizeof(cached.subsys) - 1);
strcpy(cached.mountpoint, mountpoint);
if (strlen(path) < maxlen) {
strcpy(buf, path);
if (mountpoint[0] && strlen(mountpoint) < maxlen) {
strcpy(buf, mountpoint);
return 0;
}
return -1;

View File

@@ -23,10 +23,20 @@ struct perf_record_mmap2 {
__u64 start;
__u64 len;
__u64 pgoff;
__u32 maj;
__u32 min;
__u64 ino;
__u64 ino_generation;
union {
struct {
__u32 maj;
__u32 min;
__u64 ino;
__u64 ino_generation;
};
struct {
__u8 build_id_size;
__u8 __reserved_1;
__u16 __reserved_2;
__u8 build_id[20];
};
};
__u32 prot;
__u32 flags;
char filename[PATH_MAX];

View File

@@ -24,6 +24,7 @@ perf-y += builtin-mem.o
perf-y += builtin-data.o
perf-y += builtin-version.o
perf-y += builtin-c2c.o
perf-y += builtin-daemon.o
perf-$(CONFIG_TRACE) += builtin-trace.o
perf-$(CONFIG_LIBELF) += builtin-probe.o

View File

@@ -3,7 +3,7 @@
****** perf by examples ******
------------------------------
[ From an e-mail by Ingo Molnar, http://lkml.org/lkml/2009/8/4/346 ]
[ From an e-mail by Ingo Molnar, https://lore.kernel.org/lkml/20090804195717.GA5998@elte.hu ]
First, discovery/enumeration of available counters can be done via

View File

@@ -4,7 +4,7 @@
r synthesize branches events (returns only)
x synthesize transactions events
w synthesize ptwrite events
p synthesize power events
p synthesize power events (incl. PSB events for Intel PT)
o synthesize other events recorded due to the use
of aux-output (refer to perf record)
e synthesize error events

View File

@@ -74,6 +74,12 @@ OPTIONS
used when creating a uprobe for a process that resides in a
different mount namespace from the perf(1) utility.
--debuginfod=URLs::
Specify debuginfod URL to be used when retrieving perf.data binaries,
it follows the same syntax as the DEBUGINFOD_URLS variable, like:
buildid-cache.debuginfod=http://192.168.122.174:8002
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-buildid-list[1]

View File

@@ -238,6 +238,13 @@ buildid.*::
cache location, or to disable it altogether. If you want to disable it,
set buildid.dir to /dev/null. The default is $HOME/.debug
buildid-cache.*::
buildid-cache.debuginfod=URLs
Specify debuginfod URLs to be used when retrieving perf.data binaries,
it follows the same syntax as the DEBUGINFOD_URLS variable, like:
buildid-cache.debuginfod=http://192.168.122.174:8002
annotate.*::
These are in control of addresses, jump function, source code
in lines of assembly code from a specific program.
@@ -552,11 +559,12 @@ kmem.*::
record.*::
record.build-id::
This option can be 'cache', 'no-cache' or 'skip'.
This option can be 'cache', 'no-cache', 'skip' or 'mmap'.
'cache' is to post-process data and save/update the binaries into
the build-id cache (in ~/.debug). This is the default.
But if this option is 'no-cache', it will not update the build-id cache.
'skip' skips post-processing and does not update the cache.
'mmap' skips post-processing and reads build-ids from MMAP events.
record.call-graph::
This is identical to 'call-graph.record-mode', except it is
@@ -695,6 +703,20 @@ auxtrace.*::
If the directory does not exist or has the wrong file type,
the current directory is used.
daemon.*::
daemon.base::
Base path for daemon data. All sessions data are stored under
this path.
session-<NAME>.*::
session-<NAME>.run::
Defines new record session for daemon. The value is record's
command line without the 'record' keyword.
SEE ALSO
--------
linkperf:perf[1]

View File

@@ -0,0 +1,208 @@
perf-daemon(1)
==============
NAME
----
perf-daemon - Run record sessions on background
SYNOPSIS
--------
[verse]
'perf daemon'
'perf daemon' [<options>]
'perf daemon start' [<options>]
'perf daemon stop' [<options>]
'perf daemon signal' [<options>]
'perf daemon ping' [<options>]
DESCRIPTION
-----------
This command allows to run simple daemon process that starts and
monitors configured record sessions.
You can imagine 'perf daemon' of background process with several
'perf record' child tasks, like:
# ps axjf
...
1 916507 ... perf daemon start
916507 916508 ... \_ perf record --control=fifo:control,ack -m 10M -e cycles --overwrite --switch-output -a
916507 916509 ... \_ perf record --control=fifo:control,ack -m 20M -e sched:* --overwrite --switch-output -a
Not every 'perf record' session is suitable for running under daemon.
User need perf session that either produces data on query, like the
flight recorder sessions in above example or session that is configured
to produce data periodically, like with --switch-output configuration
for time and size.
Each session is started with control setup (with perf record --control
options).
Sessions are configured through config file, see CONFIG FILE section
with EXAMPLES.
OPTIONS
-------
-v::
--verbose::
Be more verbose.
--config=<PATH>::
Config file path. If not provided, perf will check system and default
locations (/etc/perfconfig, $HOME/.perfconfig).
--base=<PATH>::
Base directory path. Each daemon instance is running on top
of base directory. Only one instance of server can run on
top of one directory at the time.
All generic options are available also under commands.
START COMMAND
-------------
The start command creates the daemon process.
-f::
--foreground::
Do not put the process in background.
STOP COMMAND
------------
The stop command stops all the session and the daemon process.
SIGNAL COMMAND
--------------
The signal command sends signal to configured sessions.
--session::
Send signal to specific session.
PING COMMAND
------------
The ping command sends control ping to configured sessions.
--session::
Send ping to specific session.
CONFIG FILE
-----------
The daemon is configured within standard perf config file by
following new variables:
daemon.base:
Base path for daemon data. All sessions data are
stored under this path.
session-<NAME>.run:
Defines new record session. The value is record's command
line without the 'record' keyword.
Each perf record session is run in daemon.base/<NAME> directory.
EXAMPLES
--------
Example with 2 record sessions:
# cat ~/.perfconfig
[daemon]
base=/opt/perfdata
[session-cycles]
run = -m 10M -e cycles --overwrite --switch-output -a
[session-sched]
run = -m 20M -e sched:* --overwrite --switch-output -a
Starting the daemon:
# perf daemon start
Check sessions:
# perf daemon
[603349:daemon] base: /opt/perfdata
[603350:cycles] perf record -m 10M -e cycles --overwrite --switch-output -a
[603351:sched] perf record -m 20M -e sched:* --overwrite --switch-output -a
First line is daemon process info with configured daemon base.
Check sessions with more info:
# perf daemon -v
[603349:daemon] base: /opt/perfdata
output: /opt/perfdata/output
lock: /opt/perfdata/lock
up: 1 minutes
[603350:cycles] perf record -m 10M -e cycles --overwrite --switch-output -a
base: /opt/perfdata/session-cycles
output: /opt/perfdata/session-cycles/output
control: /opt/perfdata/session-cycles/control
ack: /opt/perfdata/session-cycles/ack
up: 1 minutes
[603351:sched] perf record -m 20M -e sched:* --overwrite --switch-output -a
base: /opt/perfdata/session-sched
output: /opt/perfdata/session-sched/output
control: /opt/perfdata/session-sched/control
ack: /opt/perfdata/session-sched/ack
up: 1 minutes
The 'base' path is daemon/session base.
The 'lock' file is daemon's lock file guarding that no other
daemon is running on top of the base.
The 'output' file is perf record output for specific session.
The 'control' and 'ack' files are perf control files.
The 'up' number shows minutes daemon/session is running.
Make sure control session is online:
# perf daemon ping
OK cycles
OK sched
Send USR2 signal to session 'cycles' to generate perf.data file:
# perf daemon signal --session cycles
signal 12 sent to session 'cycles [603452]'
# tail -2 /opt/perfdata/session-cycles/output
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2020123017013149 ]
Send USR2 signal to all sessions:
# perf daemon signal
signal 12 sent to session 'cycles [603452]'
signal 12 sent to session 'sched [603453]'
# tail -2 /opt/perfdata/session-cycles/output
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2020123017024689 ]
# tail -2 /opt/perfdata/session-sched/output
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2020123017024713 ]
Stop daemon:
# perf daemon stop
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-config[1]

View File

@@ -858,7 +858,7 @@ The letters are:
b synthesize "branches" events
x synthesize "transactions" events
w synthesize "ptwrite" events
p synthesize "power" events
p synthesize "power" events (incl. PSB events)
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
@@ -913,6 +913,11 @@ Where:
For more details refer to the Intel 64 and IA-32 Architectures Software
Developer Manuals.
PSB events show when a PSB+ occurred and also the byte-offset in the trace.
Emitting a PSB+ can cause a CPU a slight delay. When doing timing analysis
of code with Intel PT, it is useful to know if a timing bubble was caused
by Intel PT or not.
Error events show where the decoder lost the trace. Error events
are quite important. Users must know if what they are seeing is a complete
picture or not. The "e" option may be followed by flags which affect what errors
@@ -1141,6 +1146,88 @@ XED
include::build-xed.txt[]
Tracing Virtual Machines
------------------------
Currently, only kernel tracing is supported and only with "timeless" decoding
i.e. no TSC timestamps
Other limitations and caveats
VMX controls may suppress packets needed for decoding resulting in decoding errors
VMX controls may block the perf NMI to the host potentially resulting in lost trace data
Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors
Guest thread information is unknown
Guest VCPU is unknown but may be able to be inferred from the host thread
Callchains are not supported
Example
Start VM
$ sudo virsh start kubuntu20.04
Domain kubuntu20.04 started
Mount the guest file system. Note sshfs needs -o direct_io to enable reading of proc files. root access is needed to read /proc/kcore.
$ mkdir vm0
$ sshfs -o direct_io root@vm0:/ vm0
Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
$ perf buildid-cache -v --kcore vm0/proc/kcore
kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306
$ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
Find the VM process
$ ps -eLl | grep 'KVM\|PID'
F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
3 S 64055 1430 1 1440 1 80 0 - 1921718 - ? 00:02:47 CPU 0/KVM
3 S 64055 1430 1 1441 1 80 0 - 1921718 - ? 00:02:41 CPU 1/KVM
3 S 64055 1430 1 1442 1 80 0 - 1921718 - ? 00:02:38 CPU 2/KVM
3 S 64055 1430 1 1443 2 80 0 - 1921718 - ? 00:03:18 CPU 3/KVM
Start an open-ended perf record, tracing the VM process, do something on the VM, and then ctrl-C to stop.
TSC is not supported and tsc=0 must be specified. That means mtc is useless, so add mtc=0.
However, IPC can still be determined, hence cyc=1 can be added.
Only kernel decoding is supported, so 'k' must be specified.
Intel PT traces both the host and the guest so --guest and --host need to be specified.
Without timestamps, --per-thread must be specified to distinguish threads.
$ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread
^C
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 5.829 MB ]
perf script can be used to provide an instruction trace
$ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
CPU 0/KVM 1440 ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9
CPU 0/KVM 1440 ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10
CPU 0/KVM 1440 ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11
CPU 0/KVM 1440 ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12
CPU 0/KVM 1440 ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13
CPU 0/KVM 1440 ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14
CPU 0/KVM 1440 ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15
CPU 0/KVM 1440 ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax
CPU 0/KVM 1440 ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40
CPU 0/KVM 1440 ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46
CPU 0/KVM 1440 ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.11 (50/445)
:1440 1440 ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms]) nopl %eax, (%rax,%rax,1)
:1440 1440 ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms]) retq IPC: 0.04 (2/41)
:1440 1440 ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms]) data16 nop
:1440 1440 ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms]) xor %eax, %eax
:1440 1440 ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms]) popq %rbp
:1440 1440 ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms]) retq IPC: 0.16 (4/25)
:1440 1440 ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms]) test %eax, %eax
:1440 1440 ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms]) jz 0xffffffffbb74603c IPC: 0.06 (2/30)
:1440 1440 ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms]) popq %rbx
:1440 1440 ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms]) popq %r12
SEE ALSO
--------

View File

@@ -63,6 +63,9 @@ OPTIONS
--phys-data::
Record/Report sample physical addresses
--data-page-size::
Record/Report sample data address page size
RECORD OPTIONS
--------------
-e::

View File

@@ -296,6 +296,9 @@ OPTIONS
--data-page-size::
Record the sampled data address data page size.
--code-page-size::
Record the sampled code address (ip) page size
-T::
--timestamp::
Record the sample timestamps. Use it with 'perf report -D' to see the
@@ -485,6 +488,9 @@ Specify vmlinux path which has debuginfo.
--buildid-all::
Record build-id of all DSOs regardless whether it's actually hit or not.
--buildid-mmap::
Record build ids in mmap2 events, disables build id cache (implies --no-buildid).
--aio[=n]::
Use <n> control blocks in asynchronous (Posix AIO) trace writing mode (default: 1, max: 4).
Asynchronous mode is supported only when linking Perf tool with libc library
@@ -640,9 +646,18 @@ ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows.
Listen on ctl-fd descriptor for command to control measurement.
Available commands:
'enable' : enable events
'disable' : disable events
'snapshot': AUX area tracing snapshot).
'enable' : enable events
'disable' : disable events
'enable name' : enable event 'name'
'disable name' : disable event 'name'
'snapshot' : AUX area tracing snapshot).
'stop' : stop perf record
'ping' : ping
'evlist [-v|-g|-F] : display all events
-F Show just the sample frequency used for each event.
-v Show all fields.
-g Show event group information.
Measurements can be started with events disabled using --delay=-1 option. Optionally
send control command completion ('ack\n') to ack-fd descriptor to synchronize with the

View File

@@ -108,6 +108,10 @@ OPTIONS
- period: Raw number of event count of sample
- time: Separate the samples by time stamp with the resolution specified by
--time-quantum (default 100ms). Specify with overhead and before it.
- code_page_size: the code page size of sampled code address (ip)
- ins_lat: Instruction latency in core cycles. This is the global instruction
latency
- local_ins_lat: Local instruction latency version
By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
@@ -139,7 +143,7 @@ OPTIONS
If the --mem-mode option is used, the following sort keys are also available
(incompatible with --branch-stack):
symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline.
symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline, blocked.
- symbol_daddr: name of data symbol being executed on at the time of sample
- dso_daddr: name of library or module containing the data being executed
@@ -151,9 +155,11 @@ OPTIONS
- dcacheline: the cacheline the data address is on at the time of the sample
- phys_daddr: physical address of data being executed on at the time of sample
- data_page_size: the data page size of data being executed on at the time of sample
- blocked: reason of blocked load access for the data at the time of the sample
And the default sort keys are changed to local_weight, mem, sym, dso,
symbol_daddr, dso_daddr, snoop, tlb, locked, see '--mem-mode'.
symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
see '--mem-mode'.
If the data file has tracepoint event(s), following (dynamic) sort keys
are also available:

View File

@@ -118,7 +118,7 @@ OPTIONS
comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
srcline, period, iregs, uregs, brstack, brstacksym, flags, bpf-output,
brstackinsn, brstackoff, callindent, insn, insnlen, synth, phys_addr,
metric, misc, srccode, ipc, data_page_size.
metric, misc, srccode, ipc, data_page_size, code_page_size.
Field list can be prepended with the type, trace, sw or hw,
to indicate to which event type the field list applies.
e.g., -F sw:comm,tid,time,ip,sym and -F trace:time,cpu,trace
@@ -422,9 +422,32 @@ include::itrace.txt[]
Only consider the listed symbols. Symbols are typically a name
but they may also be hexadecimal address.
The hexadecimal address may be the start address of a symbol or
any other address to filter the trace records
For example, to select the symbol noploop or the address 0x4007a0:
perf script --symbols=noploop,0x4007a0
Support filtering trace records by symbol name, start address of
symbol, any hexadecimal address and address range.
The comparison order is:
1. symbol name comparison
2. symbol start address comparison.
3. any hexadecimal address comparison.
4. address range comparison (see --addr-range).
--addr-range::
Use with -S or --symbols to list traced records within address range.
For example, to list the traced records within the address range
[0x4007a0, 0x0x4007a9]:
perf script -S 0x4007a0 --addr-range 10
--dsos=::
Only consider symbols in these DSOs.
--call-trace::
Show call stream for intel_pt traces. The CPUs are interleaved, but
can be filtered with -C.

View File

@@ -75,6 +75,24 @@ report::
--tid=<tid>::
stat events on existing thread id (comma separated list)
-b::
--bpf-prog::
stat events on existing bpf program id (comma separated list),
requiring root rights. bpftool-prog could be used to find program
id all bpf programs in the system. For example:
# bpftool prog | head -n 1
17247: tracepoint name sys_enter tag 192d548b9d754067 gpl
# perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
Performance counter stats for 'BPF program(s) 17247':
85,967 cycles
28,982 instructions # 0.34 insn per cycle
1.102235068 seconds time elapsed
ifdef::HAVE_LIBPFM[]
--pfm-events events::
Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
@@ -358,7 +376,7 @@ See perf list output for the possble metrics and metricgroups.
Do not aggregate counts across all monitored CPUs.
--topdown::
Print top down level 1 metrics if supported by the CPU. This allows to
Print complete top-down metrics supported by the CPU. This allows to
determine bottle necks in the CPU pipeline for CPU bound workloads,
by breaking the cycles consumed down into frontend bound, backend bound,
bad speculation and retiring.
@@ -393,6 +411,18 @@ To interpret the results it is usually needed to know on which
CPUs the workload runs on. If needed the CPUs can be forced using
taskset.
--td-level::
Print the top-down statistics that equal to or lower than the input level.
It allows users to print the interested top-down metrics level instead of
the complete top-down metrics.
The availability of the top-down metrics level depends on the hardware. For
example, Ice Lake only supports L1 top-down metrics. The Sapphire Rapids
supports both L1 and L2 top-down metrics.
Default: 0 means the max level that the current hardware support.
Error out if the input is higher than the supported max level.
--no-merge::
Do not merge results from same PMUs.

Some files were not shown because too many files have changed in this diff Show More