Commit Graph

2605 Commits

Author SHA1 Message Date
Linus Torvalds
9d64bf433c Merge tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Arnaldo Carvalho de Melo:
 "Add Namhyung Kim as tools/perf/ co-maintainer, we're taking turns
  processing patches, switching roles from perf-tools to perf-tools-next
  at each Linux release.

  Data profiling:

   - Associate samples that identify loads and stores with data
     structures. This uses events available on Intel, AMD and others and
     DWARF info:

       # To get memory access samples in kernel for 1 second (on Intel)
       $ perf mem record -a -K --ldlat=4 -- sleep 1

       # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
       $ perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000' -- sleep 1

     Then, amongst several modes of post processing, one can do things like:

       $ perf report -s type,typeoff --hierarchy --group --stdio
       ...
       #
       # Samples: 10K of events 'cpu/mem-loads,ldlat=4/P, cpu/mem-stores/P, dummy:u'
       # Event count (approx.): 602758064
       #
       #                    Overhead  Data Type / Data Type Offset
       # ...........................  ............................
       #
           26.09%   3.28%   0.00%     long unsigned int
              26.09%   3.28%   0.00%     long unsigned int +0 (no field)
           18.48%   0.73%   0.00%     struct page
              10.83%   0.02%   0.00%     struct page +8 (lru.next)
               3.90%   0.28%   0.00%     struct page +0 (flags)
               3.45%   0.06%   0.00%     struct page +24 (mapping)
               0.25%   0.28%   0.00%     struct page +48 (_mapcount.counter)
               0.02%   0.06%   0.00%     struct page +32 (index)
               0.02%   0.00%   0.00%     struct page +52 (_refcount.counter)
               0.02%   0.01%   0.00%     struct page +56 (memcg_data)
               0.00%   0.01%   0.00%     struct page +16 (lru.prev)
           15.37%  17.54%   0.00%     (stack operation)
              15.37%  17.54%   0.00%     (stack operation) +0 (no field)
           11.71%  50.27%   0.00%     (unknown)
              11.71%  50.27%   0.00%     (unknown) +0 (no field)

       $ perf annotate --data-type
       ...
       Annotate type: 'struct cfs_rq' in [kernel.kallsyms] (13 samples):
       ============================================================================
           samples     offset       size  field
                13          0        640  struct cfs_rq         {
                 2          0         16      struct load_weight       load {
                 2          0          8          unsigned long        weight;
                 0          8          4          u32  inv_weight;
                                              };
                 0         16          8      unsigned long    runnable_weight;
                 0         24          4      unsigned int     nr_running;
                 1         28          4      unsigned int     h_nr_running;
       ...

       $ perf annotate --data-type=page --group
       Annotate type: 'struct page' in [kernel.kallsyms] (480 samples):
        event[0] = cpu/mem-loads,ldlat=4/P
        event[1] = cpu/mem-stores/P
        event[2] = dummy:u
       ===================================================================================
                samples  offset  size  field
       447  33        0       0    64  struct page     {
       108   8        0       0     8	 long unsigned int  flags;
       319  13        0       8    40	 union       {
       319  13        0       8    40          struct          {
       236   2        0       8    16              union       {
       236   2        0       8    16                  struct list_head       lru {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
       236   2        0       8    16                  struct          {
       236   1        0       8     8                      void*      __filler;
         0   1        0      16     4                      unsigned int       mlock_count;
                                                       };
       236   2        0       8    16                  struct list_head       buddy_list {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
       236   2        0       8    16                  struct list_head       pcp_list {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
                                                   };
        82   4        0      24     8              struct address_space*      mapping;
         1   7        0      32     8              union       {
         1   7        0      32     8                  long unsigned int      index;
         1   7        0      32     8                  long unsigned int      share;
                                                   };
         0   0        0      40     8              long unsigned int  private;
                                                                 };

     This uses the existing annotate code, calling objdump to do the
     disassembly, with improvements to avoid having this take too long,
     but longer term a switch to a disassembler library, possibly
     reusing code in the kernel will be pursued.

     This is the initial implementation, please use it and report
     impressions and bugs. Make sure the kernel-debuginfo packages match
     the running kernel. The 'perf report' phase for non short perf.data
     files may take a while.

     There is a great article about it on LWN:

       https://lwn.net/Articles/955709/ - "Data-type profiling for perf"

     One last test I did while writing this text, on a AMD Ryzen 5950X,
     using a distro kernel, while doing a simple 'find /' on an
     otherwise idle system resulted in:

     # uname -r
     6.6.9-100.fc38.x86_64
     # perf -vv | grep BPF_
                      bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
            bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
     # rpm -qa | grep kernel-debuginfo
     kernel-debuginfo-common-x86_64-6.6.9-100.fc38.x86_64
     kernel-debuginfo-6.6.9-100.fc38.x86_64
     #
     # perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000'
     ^C[ perf record: Woken up 1 times to write data ]
     [ perf record: Captured and wrote 2.199 MB perf.data (2913 samples) ]
     #
     # ls -la perf.data
     -rw-------. 1 root root 2346486 Jan  9 18:36 perf.data
     # perf evlist
     ibs_op//
     dummy:u
     # perf evlist -v
     ibs_op//: type: 11, size: 136, config: 0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|CPU|PERIOD|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1
     dummy:u: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|ADDR|CPU|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, mmap_data: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
     #
     # perf report -s type,typeoff --hierarchy --group --stdio
     # Total Lost Samples: 0
     #
     # Samples: 2K of events 'ibs_op//, dummy:u'
     # Event count (approx.): 1904553038
     #
     #            Overhead  Data Type / Data Type Offset
     # ...................  ............................
     #
         73.70%   0.00%     (unknown)
            73.70%   0.00%     (unknown) +0 (no field)
          3.01%   0.00%     long unsigned int
             3.00%   0.00%     long unsigned int +0 (no field)
             0.01%   0.00%     long unsigned int +2 (no field)
          2.73%   0.00%     struct task_struct
             1.71%   0.00%     struct task_struct +52 (on_cpu)
             0.38%   0.00%     struct task_struct +2104 (rcu_read_unlock_special.b.blocked)
             0.23%   0.00%     struct task_struct +2100 (rcu_read_lock_nesting)
             0.14%   0.00%     struct task_struct +2384 ()
             0.06%   0.00%     struct task_struct +3096 (signal)
             0.05%   0.00%     struct task_struct +3616 (cgroups)
             0.05%   0.00%     struct task_struct +2344 (active_mm)
             0.02%   0.00%     struct task_struct +46 (flags)
             0.02%   0.00%     struct task_struct +2096 (migration_disabled)
             0.01%   0.00%     struct task_struct +24 (__state)
             0.01%   0.00%     struct task_struct +3956 (mm_cid_active)
             0.01%   0.00%     struct task_struct +1048 (cpus_ptr)
             0.01%   0.00%     struct task_struct +184 (se.group_node.next)
             0.01%   0.00%     struct task_struct +20 (thread_info.cpu)
             0.00%   0.00%     struct task_struct +104 (on_rq)
             0.00%   0.00%     struct task_struct +2456 (pid)
          1.36%   0.00%     struct module
             0.59%   0.00%     struct module +952 (kallsyms)
             0.42%   0.00%     struct module +0 (state)
             0.23%   0.00%     struct module +8 (list.next)
             0.12%   0.00%     struct module +216 (syms)
          0.95%   0.00%     struct inode
             0.41%   0.00%     struct inode +40 (i_sb)
             0.22%   0.00%     struct inode +0 (i_mode)
             0.06%   0.00%     struct inode +76 (i_rdev)
             0.06%   0.00%     struct inode +56 (i_security)
     <SNIP>

  perf top/report:

   - Don't ignore job control, allowing control+Z + bg to work.

   - Add s390 raw data interpretation for PAI (Processor Activity
     Instrumentation) counters.

  perf archive:

   - Add new option '--all' to pack perf.data with DSOs.

   - Add new option '--unpack' to expand tarballs.

  Initialization speedups:

   - Lazily initialize zstd streams to save memory when not using it.

   - Lazily allocate/size mmap event copy.

   - Lazy load kernel symbols in 'perf record'.

   - Be lazier in allocating lost samples buffer in 'perf record'.

   - Don't synthesize BPF events when disabled via the command line
     (perf record --no-bpf-event).

  Assorted improvements:

   - Show note on AMD systems that the :p, :pp, :ppp and :P are all the
     same, as IBS (Instruction Based Sampling) is used and it is
     inherentely precise, not having levels of precision like in Intel
     systems.

   - When 'cycles' isn't available, fall back to the "task-clock" event
     when not system wide, not to 'cpu-clock'.

   - Add --debug-file option to redirect debug output, e.g.:

       $ perf --debug-file /tmp/perf.log record -v true

   - Shrink 'struct map' to under one cacheline by avoiding function
     pointers for selecting if addresses are identity or DSO relative,
     and using just a byte for some boolean struct members.

   - Resolve the arch specific strerrno just once to use in
     perf_env__arch_strerrno().

   - Reduce memory for recording PERF_RECORD_LOST_SAMPLES event.

  Assorted fixes:

   - Fix the default 'perf top' usage on Intel hybrid systems, now it
     starts with a browser showing the number of samples for Efficiency
     (cpu_atom/cycles/P) and Performance (cpu_core/cycles/P). This
     behaviour is similar on ARM64, with its respective set of
     big.LITTLE processors.

   - Fix segfault on build_mem_topology() error path.

   - Fix 'perf mem' error on hybrid related to availability of mem event
     in a PMU.

   - Fix missing reference count gets (map, maps) in the db-export code.

   - Avoid recursively taking env->bpf_progs.lock in the 'perf_env'
     code.

   - Use the newly introduced maps__for_each_map() to add missing
     locking around iteration of 'struct map' entries.

   - Parse NOTE segments until the build id is found, don't stop on the
     first one, ELF files may have several such NOTE segments.

   - Remove 'egrep' usage, its deprecated, use 'grep -E' instead.

   - Warn first about missing libelf, not libbpf, that depends on
     libelf.

   - Use alternative to 'find ... -printf' as this isn't supported in
     busybox.

   - Address python 3.6 DeprecationWarning for string scapes.

   - Fix memory leak in uniq() in libsubcmd.

   - Fix man page formatting for 'perf lock'

   - Fix some spelling mistakes.

  perf tests:

   - Fail shell tests that needs some symbol in perf itself if it is
     stripped. These tests check if a symbol is resolved, if some hot
     function is indeed detected by profiling, etc.

   - The 'perf test sigtrap' test is currently failing on PREEMPT_RT,
     skip it if sleeping spinlocks are detected (using BTF) and point to
     the mailing list discussion about it. This test is also being
     skipped on several architectures (powerpc, s390x, arm and aarch64)
     due to other pending issues with intruction breakpoints.

   - Adjust test case perf record offcpu profiling tests for s390.

   - Fix 'Setup struct perf_event_attr' fails on s390 on z/VM guest,
     addressing issues caused by the fallback from cycles to task-clock
     done in this release.

   - Fix mask for VG register in the user-regs test.

   - Use shellcheck on 'perf test' shell scripts automatically to make
     sure changes don't introduce things it flags as problematic.

   - Add option to change objdump binary and allow it to be set via
     'perf config'.

   - Add basic 'perf script', 'perf list --json" and 'perf diff' tests.

   - Basic branch counter support.

   - Make DSO tests a suite rather than individual.

   - Remove atomics from test_loop to avoid test failures.

   - Fix call chain match on powerpc for the record+probe_libc_inet_pton
     test.

   - Improve Intel hybrid tests.

  Vendor event files (JSON):

  powerpc:

   - Update datasource event name to fix duplicate events on IBM's
     Power10.

   - Add PVN for HX-C2000 CPU with Power8 Architecture.

  Intel:

   - Alderlake/rocketlake metric fixes.

   - Update emeraldrapids events to v1.02.

   - Update icelakex events to v1.23.

   - Update sapphirerapids events to v1.17.

   - Add skx, clx, icx and spr upi bandwidth metric.

  AMD:

   - Add Zen 4 memory controller events.

  RISC-V:

   - Add StarFive Dubhe-80 and Dubhe-90 JSON files.
       https://www.starfivetech.com/en/site/cpu-u

   - Add T-HEAD C9xx JSON file.
       https://github.com/riscv-software-src/opensbi/blob/master/docs/platform/thead-c9xx.md

  ARM64:

   - Remove UTF-8 characters from cmn.json, that were causing build
     failure in some distros.

   - Add core PMU events and metrics for Ampere One X.

   - Rename Ampere One's BPU_FLUSH_MEM_FAULT to GPC_FLUSH_MEM_FAULT

  libperf:

   - Rename several perf_cpu_map constructor names to clarify what they
     really do.

   - Ditto for some other methods, coping with some issues in their
     semantics, like perf_cpu_map__empty() ->
     perf_cpu_map__has_any_cpu_or_is_empty().

   - Document perf_cpu_map__nr()'s behavior

  perf stat:

   - Exit if parse groups fails.

   - Combine the -A/--no-aggr and --no-merge options.

   - Fix help message for --metric-no-threshold option.

  Hardware tracing:

  ARM64 CoreSight:

   - Bump minimum OpenCSD version to ensure a bugfix is present.

   - Add 'T' itrace option for timestamp trace

   - Set start vm addr of exectable file to 0 and don't ignore first
     sample on the arm-cs-trace-disasm.py 'perf script'"

* tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits)
  MAINTAINERS: Add Namhyung as tools/perf/ co-maintainer
  perf test: test case 'Setup struct perf_event_attr' fails on s390 on z/vm
  perf db-export: Fix missing reference count get in call_path_from_sample()
  perf tests: Add perf script test
  libsubcmd: Fix memory leak in uniq()
  perf TUI: Don't ignore job control
  perf vendor events intel: Update sapphirerapids events to v1.17
  perf vendor events intel: Update icelakex events to v1.23
  perf vendor events intel: Update emeraldrapids events to v1.02
  perf vendor events intel: Alderlake/rocketlake metric fixes
  perf x86 test: Add hybrid test for conflicting legacy/sysfs event
  perf x86 test: Update hybrid expectations
  perf vendor events amd: Add Zen 4 memory controller events
  perf stat: Fix hard coded LL miss units
  perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES event
  perf env: Avoid recursively taking env->bpf_progs.lock
  perf annotate: Add --insn-stat option for debugging
  perf annotate: Add --type-stat option for debugging
  perf annotate: Support event group display
  perf annotate: Add --data-type option
  ...
2024-01-19 14:25:23 -08:00
Andrii Nakryiko
76ec90a996 libbpf: warn on unexpected __arg_ctx type when rewriting BTF
On kernel that don't support arg:ctx tag, before adjusting global
subprog BTF information to match kernel's expected canonical type names,
make sure that types used by user are meaningful, and if not, warn and
don't do BTF adjustments.

This is similar to checks that kernel performs, but narrower in scope,
as only a small subset of BPF program types can be accommodated by
libbpf using canonical type names.

Libbpf unconditionally allows `struct pt_regs *` for perf_event program
types, unlike kernel, which supports that conditionally on architecture.
This is done to keep things simple and not cause unnecessary false
positives. This seems like a minor and harmless deviation, which in
real-world programs will be caught by kernels with arg:ctx tag support
anyways. So KISS principle.

This logic is hard to test (especially on latest kernels), so manual
testing was performed instead. Libbpf emitted the following warning for
perf_event program with wrong context argument type:

  libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:06 -08:00
Andrii Nakryiko
01b55f4f0c libbpf: feature-detect arg:ctx tag support in kernel
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.

test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).

Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=*

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:05 -08:00
Ian Rogers
ad30469a84 libsubcmd: Fix memory leak in uniq()
uniq() will write one command name over another causing the overwritten
string to be leaked. Fix by doing a pass that removes duplicates and a
second that removes the holes.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Chenyuan Mi <cymi20@fudan.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231208000515.1693746-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04 18:29:17 -03:00
Andrii Nakryiko
2f38fe6894 libbpf: implement __arg_ctx fallback logic
Out of all special global func arg tag annotations, __arg_ctx is
practically is the most immediately useful and most critical to have
working across multitude kernel version, if possible. This would allow
end users to write much simpler code if __arg_ctx semantics worked for
older kernels that don't natively understand btf_decl_tag("arg:ctx") in
verifier logic.

Luckily, it is possible to ensure __arg_ctx works on old kernels through
a bit of extra work done by libbpf, at least in a lot of common cases.

To explain the overall idea, we need to go back at how context argument
was supported in global funcs before __arg_ctx support was added. This
was done based on special struct name checks in kernel. E.g., for
BPF_PROG_TYPE_PERF_EVENT the expectation is that argument type `struct
bpf_perf_event_data *` mark that argument as PTR_TO_CTX. This is all
good as long as global function is used from the same BPF program types
only, which is often not the case. If the same subprog has to be called
from, say, kprobe and perf_event program types, there is no single
definition that would satisfy BPF verifier. Subprog will have context
argument either for kprobe (if using bpf_user_pt_regs_t struct name) or
perf_event (with bpf_perf_event_data struct name), but not both.

This limitation was the reason to add btf_decl_tag("arg:ctx"), making
the actual argument type not important, so that user can just define
"generic" signature:

  __noinline int global_subprog(void *ctx __arg_ctx) { ... }

I won't belabor how libbpf is implementing subprograms, see a huge
comment next to bpf_object_relocate_calls() function. The idea is that
each main/entry BPF program gets its own copy of global_subprog's code
appended.

This per-program copy of global subprog code *and* associated func_info
.BTF.ext information, pointing to FUNC -> FUNC_PROTO BTF type chain
allows libbpf to simulate __arg_ctx behavior transparently, even if the
kernel doesn't yet support __arg_ctx annotation natively.

The idea is straightforward: each time we append global subprog's code
and func_info information, we adjust its FUNC -> FUNC_PROTO type
information, if necessary (that is, libbpf can detect the presence of
btf_decl_tag("arg:ctx") just like BPF verifier would do it).

The rest is just mechanical and somewhat painful BTF manipulation code.
It's painful because we need to clone FUNC -> FUNC_PROTO, instead of
reusing it, as same FUNC -> FUNC_PROTO chain might be used by another
main BPF program within the same BPF object, so we can't just modify it
in-place (and cloning BTF types within the same struct btf object is
painful due to constant memory invalidation, see comments in code).
Uploaded BPF object's BTF information has to work for all BPF
programs at the same time.

Once we have FUNC -> FUNC_PROTO clones, we make sure that instead of
using some `void *ctx` parameter definition, we have an expected `struct
bpf_perf_event_data *ctx` definition (as far as BPF verifier and kernel
is concerned), which will mark it as context for BPF verifier. Same
global subprog relocated and copied into another main BPF program will
get different type information according to main program's type. It all
works out in the end in a completely transparent way for end user.

Libbpf maintains internal program type -> expected context struct name
mapping internally. Note, not all BPF program types have named context
struct, so this approach won't work for such programs (just like it
didn't before __arg_ctx). So native __arg_ctx is still important to have
in kernel to have generic context support across all BPF program types.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
1004742d7f libbpf: move BTF loading step after relocation step
With all the preparations in previous patches done we are ready to
postpone BTF loading and sanitization step until after all the
relocations are performed.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
fb03be7c4a libbpf: move exception callbacks assignment logic into relocation step
Move the logic of finding and assigning exception callback indices from
BTF sanitization step to program relocations step, which seems more
logical and will unblock moving BTF loading to after relocation step.

Exception callbacks discovery and assignment has no dependency on BTF
being loaded into the kernel, it only uses BTF information. It does need
to happen before subprogram relocations happen, though. Which is why the
split.

No functional changes.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
dac645b950 libbpf: use stable map placeholder FDs
Move map creation to later during BPF object loading by pre-creating
stable placeholder FDs (utilizing memfd_create()). Use dup2()
syscall to then atomically make those placeholder FDs point to real
kernel BPF map objects.

This change allows to delay BPF map creation to after all the BPF
program relocations. That, in turn, allows to delay BTF finalization and
loading into kernel to after all the relocations as well. We'll take
advantage of the latter in subsequent patches to allow libbpf to adjust
BTF in a way that helps with BPF global function usage.

Clean up a few places where we close map->fd, which now shouldn't
happen, because map->fd should be a valid FD regardless of whether map
was created or not. Surprisingly and nicely it simplifies a bunch of
error handling code. If this change doesn't backfire, I'm tempted to
pre-create such stable FDs for other entities (progs, maybe even BTF).
We previously did some manipulations to make gen_loader work with fake
map FDs, with stable map FDs this hack is not necessary for maps (we
still have it for BTF, but I left it as is for now).

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
f08c18e083 libbpf: don't rely on map->fd as an indicator of map being created
With the upcoming switch to preallocated placeholder FDs for maps,
switch various getters/setter away from checking map->fd. Use
map_is_created() helper that detect whether BPF map can be modified based
on map->obj->loaded state, with special provision for maps set up with
bpf_map__reuse_fd().

For backwards compatibility, we take map_is_created() into account in
bpf_map__fd() getter as well. This way before bpf_object__load() phase
bpf_map__fd() will always return -1, just as before the changes in
subsequent patches adding stable map->fd placeholders.

We also get rid of all internal uses of bpf_map__fd() getter, as it's
more oriented for uses external to libbpf. The above map_is_created()
check actually interferes with some of the internal uses, if map FD is
fetched through bpf_map__fd().

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
fa98b54bff libbpf: use explicit map reuse flag to skip map creation steps
Instead of inferring whether map already point to previously
created/pinned BPF map (which user can specify with bpf_map__reuse_fd()) API),
use explicit map->reused flag that is set in such case.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:49 -08:00
Andrii Nakryiko
df7c3f7d3a libbpf: make uniform use of btf__fd() accessor inside libbpf
It makes future grepping and code analysis a bit easier.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:22:48 -08:00
Mingyi Zhang
fc3a5534e2 libbpf: Fix NULL pointer dereference in bpf_object__collect_prog_relos
An issue occurred while reading an ELF file in libbpf.c during fuzzing:

	Program received signal SIGSEGV, Segmentation fault.
	0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
	4206 in libbpf.c
	(gdb) bt
	#0 0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
	#1 0x000000000094f9d6 in bpf_object.collect_relos () at libbpf.c:6706
	#2 0x000000000092bef3 in bpf_object_open () at libbpf.c:7437
	#3 0x000000000092c046 in bpf_object.open_mem () at libbpf.c:7497
	#4 0x0000000000924afa in LLVMFuzzerTestOneInput () at fuzz/bpf-object-fuzzer.c:16
	#5 0x000000000060be11 in testblitz_engine::fuzzer::Fuzzer::run_one ()
	#6 0x000000000087ad92 in tracing::span::Span::in_scope ()
	#7 0x00000000006078aa in testblitz_engine::fuzzer::util::walkdir ()
	#8 0x00000000005f3217 in testblitz_engine::entrypoint::main::{{closure}} ()
	#9 0x00000000005f2601 in main ()
	(gdb)

scn_data was null at this code(tools/lib/bpf/src/libbpf.c):

	if (rel->r_offset % BPF_INSN_SZ || rel->r_offset >= scn_data->d_size) {

The scn_data is derived from the code above:

	scn = elf_sec_by_idx(obj, sec_idx);
	scn_data = elf_sec_data(obj, scn);

	relo_sec_name = elf_sec_str(obj, shdr->sh_name);
	sec_name = elf_sec_name(obj, scn);
	if (!relo_sec_name || !sec_name)// don't check whether scn_data is NULL
		return -EINVAL;

In certain special scenarios, such as reading a malformed ELF file,
it is possible that scn_data may be a null pointer

Signed-off-by: Mingyi Zhang <zhangmingyi5@huawei.com>
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Changye Wu <wuchangye@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231221033947.154564-1-liuxin350@huawei.com
2023-12-21 10:05:42 +01:00
Alyssa Ross
812d8bf876 libbpf: Skip DWARF sections in linker sanity check
clang can generate (with -g -Wa,--compress-debug-sections) 4-byte
aligned DWARF sections that declare themselves to be 8-byte aligned in
the section header.  Since DWARF sections are dropped during linking
anyway, just skip running the sanity checks on them.

Reported-by: Sergei Trofimovich <slyich@gmail.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/ZXcFRJVKbKxtEL5t@nz.home/
Link: https://lore.kernel.org/bpf/20231219110324.8989-1-hi@alyssa.is
2023-12-21 10:05:15 +01:00
Andrii Nakryiko
aae9c25dda libbpf: add __arg_xxx macros for annotating global func args
Add a set of __arg_xxx macros which can be used to augment BPF global
subprogs/functions with extra information for use by BPF verifier.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:47 -08:00
Andrii Nakryiko
d17aff807f Revert BPF token-related functionality
This patch includes the following revert (one  conflicting BPF FS
patch and three token patch sets, represented by merge commits):
  - revert 0f5d5454c7 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
  - revert 750e785796 "bpf: Support uid and gid when mounting bpffs";
  - revert 733763285a "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
  - revert c35919dcce "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".

Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2023-12-19 08:23:03 -08:00
Arnaldo Carvalho de Melo
ab1c247094 Merge remote-tracking branch 'torvalds/master' into perf-tools-next
To pick up fixes that went thru perf-tools for v6.7 and to get in sync
with upstream to check for drift in the copies of headers, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-18 21:37:07 -03:00
Ian Rogers
67bc993446 libperf cpumap: Document perf_cpu_map__nr()'s behavior
perf_cpu_map__nr()'s behavior around an empty CPU map is strange as it
returns that there is 1 CPU. Changing code that may rely on this
behavior is hard, we can at least document the behavior.

Reviewed-by: James Clark <james.clark@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Atish Patra <atishp@rivosinc.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paran Lee <p4ranlee@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20231129060211.1890454-15-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-18 21:34:46 -03:00
Andrii Nakryiko
ed54124b88 libbpf: support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
To allow external admin authority to override default BPF FS location
(/sys/fs/bpf) for implicit BPF token creation, teach libbpf to recognize
LIBBPF_BPF_TOKEN_PATH envvar. If it is specified and user application
didn't explicitly specify neither bpf_token_path nor bpf_token_fd
option, it will be treated exactly like bpf_token_path option,
overriding default /sys/fs/bpf location and making BPF token mandatory.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:05 -08:00
Andrii Nakryiko
1d0dd6ea2e libbpf: wire up BPF token support at BPF object level
Add BPF token support to BPF object-level functionality.

BPF token is supported by BPF object logic either as an explicitly
provided BPF token from outside (through BPF FS path or explicit BPF
token FD), or implicitly (unless prevented through
bpf_object_open_opts).

Implicit mode is assumed to be the most common one for user namespaced
unprivileged workloads. The assumption is that privileged container
manager sets up default BPF FS mount point at /sys/fs/bpf with BPF token
delegation options (delegate_{cmds,maps,progs,attachs} mount options).
BPF object during loading will attempt to create BPF token from
/sys/fs/bpf location, and pass it for all relevant operations
(currently, map creation, BTF load, and program load).

In this implicit mode, if BPF token creation fails due to whatever
reason (BPF FS is not mounted, or kernel doesn't support BPF token,
etc), this is not considered an error. BPF object loading sequence will
proceed with no BPF token.

In explicit BPF token mode, user provides explicitly either custom BPF
FS mount point path or creates BPF token on their own and just passes
token FD directly. In such case, BPF object will either dup() token FD
(to not require caller to hold onto it for entire duration of BPF object
lifetime) or will attempt to create BPF token from provided BPF FS
location. If BPF token creation fails, that is considered a critical
error and BPF object load fails with an error.

Libbpf provides a way to disable implicit BPF token creation, if it
causes any troubles (BPF token is designed to be completely optional and
shouldn't cause any problems even if provided, but in the world of BPF
LSM, custom security logic can be installed that might change outcome
dependin on the presence of BPF token). To disable libbpf's default BPF
token creation behavior user should provide either invalid BPF token FD
(negative), or empty bpf_token_path option.

BPF token presence can influence libbpf's feature probing, so if BPF
object has associated BPF token, feature probing is instructed to use
BPF object-specific feature detection cache and token FD.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:05 -08:00
Andrii Nakryiko
a75bb6a165 libbpf: wire up token_fd into feature probing logic
Adjust feature probing callbacks to take into account optional token_fd.
In unprivileged contexts, some feature detectors would fail to detect
kernel support just because BPF program, BPF map, or BTF object can't be
loaded due to privileged nature of those operations. So when BPF object
is loaded with BPF token, this token should be used for feature probing.

This patch is setting support for this scenario, but we don't yet pass
non-zero token FD. This will be added in the next patch.

We also switched BPF cookie detector from using kprobe program to
tracepoint one, as tracepoint is somewhat less dangerous BPF program
type and has higher likelihood of being allowed through BPF token in the
future. This change has no effect on detection behavior.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:05 -08:00
Andrii Nakryiko
ab8fc393b2 libbpf: move feature detection code into its own file
It's quite a lot of well isolated code, so it seems like a good
candidate to move it out of libbpf.c to reduce its size.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:05 -08:00
Andrii Nakryiko
29c302a2e2 libbpf: further decouple feature checking logic from bpf_object
Add feat_supported() helper that accepts feature cache instead of
bpf_object. This allows low-level code in bpf.c to not know or care
about higher-level concept of bpf_object, yet it will be able to utilize
custom feature checking in cases where BPF token might influence the
outcome.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:05 -08:00
Andrii Nakryiko
c6c5be3eee libbpf: split feature detectors definitions from cached results
Split a list of supported feature detectors with their corresponding
callbacks from actual cached supported/missing values. This will allow
to have more flexible per-token or per-object feature detectors in
subsequent refactorings.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:04 -08:00
Daniel Xu
2f70803532 libbpf: Add BPF_CORE_WRITE_BITFIELD() macro
=== Motivation ===

Similar to reading from CO-RE bitfields, we need a CO-RE aware bitfield
writing wrapper to make the verifier happy.

Two alternatives to this approach are:

1. Use the upcoming `preserve_static_offset` [0] attribute to disable
   CO-RE on specific structs.
2. Use broader byte-sized writes to write to bitfields.

(1) is a bit hard to use. It requires specific and not-very-obvious
annotations to bpftool generated vmlinux.h. It's also not generally
available in released LLVM versions yet.

(2) makes the code quite hard to read and write. And especially if
BPF_CORE_READ_BITFIELD() is already being used, it makes more sense to
to have an inverse helper for writing.

=== Implementation details ===

Since the logic is a bit non-obvious, I thought it would be helpful
to explain exactly what's going on.

To start, it helps by explaining what LSHIFT_U64 (lshift) and RSHIFT_U64
(rshift) is designed to mean. Consider the core of the
BPF_CORE_READ_BITFIELD() algorithm:

        val <<= __CORE_RELO(s, field, LSHIFT_U64);
        val = val >> __CORE_RELO(s, field, RSHIFT_U64);

Basically what happens is we lshift to clear the non-relevant (blank)
higher order bits. Then we rshift to bring the relevant bits (bitfield)
down to LSB position (while also clearing blank lower order bits). To
illustrate:

        Start:    ........XXX......
        Lshift:   XXX......00000000
        Rshift:   00000000000000XXX

where `.` means blank bit, `0` means 0 bit, and `X` means bitfield bit.

After the two operations, the bitfield is ready to be interpreted as a
regular integer.

Next, we want to build an alternative (but more helpful) mental model
on lshift and rshift. That is, to consider:

* rshift as the total number of blank bits in the u64
* lshift as number of blank bits left of the bitfield in the u64

Take a moment to consider why that is true by consulting the above
diagram.

With this insight, we can now define the following relationship:

              bitfield
                 _
                | |
        0.....00XXX0...00
        |      |   |    |
        |______|   |    |
         lshift    |    |
                   |____|
              (rshift - lshift)

That is, we know the number of higher order blank bits is just lshift.
And the number of lower order blank bits is (rshift - lshift).

Finally, we can examine the core of the write side algorithm:

        mask = (~0ULL << rshift) >> lshift;              // 1
        val = (val & ~mask) | ((nval << rpad) & mask);   // 2

1. Compute a mask where the set bits are the bitfield bits. The first
   left shift zeros out exactly the number of blank bits, leaving a
   bitfield sized set of 1s. The subsequent right shift inserts the
   correct amount of higher order blank bits.

2. On the left of the `|`, mask out the bitfield bits. This creates
   0s where the new bitfield bits will go. On the right of the `|`,
   bring nval into the correct bit position and mask out any bits
   that fall outside of the bitfield. Finally, by bor'ing the two
   halves, we get the final set of bits to write back.

[0]: https://reviews.llvm.org/D133361
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jonathan Lemon <jlemon@aviatrix.com>
Signed-off-by: Jonathan Lemon <jlemon@aviatrix.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-12-13 15:42:19 -08:00
Ian Rogers
5805c82513 libperf cpumap: Add for_each_cpu() that skips the "any CPU" case
When iterating CPUs in a CPU map it is often desirable to skip the "any
CPU" (aka dummy) case. Add a helper for this and use in builtin-record.

Reviewed-by: James Clark <james.clark@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Cc: Atish Patra <atishp@rivosinc.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paran Lee <p4ranlee@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: bpf@vger.kernel.org
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20231129060211.1890454-6-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-12 14:55:13 -03:00