Commit Graph

43260 Commits

Author SHA1 Message Date
Frederic Weisbecker
dd5403869a sched/cpuidle: Comment about timers requirements VS idle handler
Add missing explanation concerning IRQs re-enablement constraints in
the cpuidle path against timers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-2-frederic@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra
63ba8422f8 sched/deadline: Introduce deadline servers
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra
2f7a0f5894 sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra
9e07d45c52 sched/deadline: Collect sched_dl_entity initialization
Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra
c708a4dc5a sched: Unify more update_curr*()
Now that trace_sched_stat_runtime() no longer takes a vruntime
argument, the task specific bits are identical between
update_curr_common() and update_curr().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra
5fe6ec8f6a sched: Remove vruntime from trace_sched_stat_runtime()
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.

Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra
5d69eca542 sched: Unify runtime accounting across classes
All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
2023-11-15 09:57:48 +01:00
Abel Wu
ee4373dc90 sched/eevdf: O(1) fastpath for task selection
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.

All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.

  hackbench: process/thread + pipe/socket + 1/2/4/8 groups
  netperf:   TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
  tbench:    loopback 24/48/72/96/192 threads
  schbench:  1/2/4/8 mthreads

  direct:    cfs_rq has only one entity
  parity:    RUN_TO_PARITY
  fast:      O(1) fastpath
  slow:	     heap search

    (%)		direct	parity	fast	slow
  hackbench	92.95	2.02	4.91	0.12
  netperf	68.08	6.60	24.18	1.14
  tbench	67.55	11.22	20.61	0.62
  schbench	69.91	2.65	25.73	1.71

The above results indicate that this fastpath really makes task
selection more efficient.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Abel Wu
2227a957e1 sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Raghavendra K T
84db47ca71 sched/numa: Fix mm numa_scan_seq based unconditional scan
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic")

NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.

Rationale:
 - Without initial scan subsequent PTE update may never happen.
 - Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.

But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.

For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.

Fix:
 - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
 - Allow unconditional scan till start_seq + 2.

Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.

kernbench
==========		base                  patched %gain
Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*

Duration User       41404.28    41375.08
Duration System      9862.22     9768.48
Duration Elapsed      519.87      518.72

Ops NUMA PTE updates           1041416.00      831536.00
Ops NUMA hint faults            263296.00      220966.00
Ops NUMA pages migrated         258021.00      212769.00
Ops AutoNUMA cost                 1328.67        1114.69

autonumabench

NUMA01_THREADLOCAL
==================
Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*

Duration User       54832.73    47379.67
Duration System        75.00      185.75
Duration Elapsed      576.72      476.09

Ops NUMA PTE updates                  394429.00    11121044.00
Ops NUMA hint faults                    1001.00     8906404.00
Ops NUMA pages migrated                  288.00     2998694.00
Ops AutoNUMA cost                          7.77       44666.84

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
2023-11-15 09:57:46 +01:00
Paul E. McKenney
d6111cf45c sched: Use WRITE_ONCE() for p->on_rq
Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side
matches with WRITE_ONCE().

Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop
2023-11-15 09:57:45 +01:00
Keisuke Nishimura
6d7e4782bc sched/fair: Fix the decision for load balance
should_we_balance is called for the decision to do load-balancing.
When sched ticks invoke this function, only one CPU should return
true. However, in the current code, two CPUs can return true. The
following situation, where b means busy and i means idle, is an
example, because CPU 0 and CPU 2 return true.

        [0, 1] [2, 3]
         b  b   i  b

This fix checks if there exists an idle CPU with busy sibling(s)
after looking for a CPU on an idle core. If some idle CPUs with busy
siblings are found, just the first one should do load-balancing.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14 22:27:01 +01:00
Johannes Weiner
8b39d20ece sched: psi: fix unprivileged polling against cgroups
519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") breaks unprivileged psi polling on cgroups.

Historically, we had a privilege check for polling in the open() of a
pressure file in /proc, but were erroneously missing it for the open()
of cgroup pressure files.

When unprivileged polling was introduced in d82caa2735 ("sched/psi:
Allow unprivileged polling of N*2s period"), it needed to filter
privileges depending on the exact polling parameters, and as such
moved the CAP_SYS_RESOURCE check from the proc open() callback to
psi_trigger_create(). Both the proc files as well as cgroup files go
through this during write(). This implicitly added the missing check
for privileges required for HT polling for cgroups.

When 519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") followed right after to remove further restrictions on the
RT polling window, it incorrectly assumed the cgroup privilege check
was still missing and added it to the cgroup open(), mirroring what we
used to do for proc files in the past.

As a result, unprivileged poll requests that would be supported now
get rejected when opening the cgroup pressure file for writing.

Remove the cgroup open() check. psi_trigger_create() handles it.

Fixes: 519fabc7aa ("psi: remove 500ms min window size limitation for triggers")
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # 6.5+
Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-14 22:27:00 +01:00
Abel Wu
eab03c23c2 sched/eevdf: Fix vruntime adjustment on reweight
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-14 22:27:00 +01:00
Linus Torvalds
3ca112b71f Merge tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - Documentation update: Add a note about argument and return value
   fetching is the best effort because it depends on the type.

 - objpool: Fix to make internal global variables static in
   test_objpool.c.

 - kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the
   same prototypes in asm/kprobes.h for some architectures, but some of
   them are missing the prototype and it causes a warning. So move the
   prototype into linux/kprobes.h.

 - tracing: Fix to check the tracepoint event and return event at
   parsing stage. The tracepoint event doesn't support %return but if
   $retval exists, it will be converted to %return silently. This finds
   that case and rejects it.

 - tracing: Fix the order of the descriptions about the parameters of
   __kprobe_event_gen_cmd_start() to be consistent with the argument
   list of the function.

* tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Fix the order of argument descriptions
  tracing: fprobe-event: Fix to check tracepoint event and return
  kprobes: unify kprobes_exceptions_nofify() prototypes
  lib: test_objpool: make global variables static
  Documentation: tracing: Add a note about argument and retval access
2023-11-10 16:35:04 -08:00
Yujie Liu
f032c53bea tracing/kprobes: Fix the order of argument descriptions
The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.

int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
                                 const char *name, const char *loc, ...)

Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/

Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-11 08:00:43 +09:00
Linus Torvalds
391ce5b9c4 Merge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - don't leave pages decrypted for DMA in encrypted memory setups linger
   around on failure (Petr Tesarik)

 - fix an out of bounds access in the new dynamic swiotlb code (Petr
   Tesarik)

 - fix dma_addressing_limited for systems with weird physical memory
   layouts (Jia He)

* tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
  dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
  dma-mapping: move dma_addressing_limited() out of line
  swiotlb: do not free decrypted pages if dynamic
2023-11-10 11:09:07 -08:00
Masami Hiramatsu (Google)
ce51e6153f tracing: fprobe-event: Fix to check tracepoint event and return
Fix to check the tracepoint event is not valid with $retval.
The commit 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is
a return event by $retval") introduced automatic return probe
conversion with $retval. But since tracepoint event does not
support return probe, $retval is not acceptable.

Without this fix, ftracetest, tprobe_syntax_errors.tc fails;

[22] Tracepoint probe event parser error log check      [FAIL]
 ----
 # tail 22-tprobe_syntax_errors.tc-log.mRKroL
 + ftrace_errlog_check trace_fprobe t kfree ^$retval dynamic_events
 + printf %s t kfree
 + wc -c
 + pos=8
 + printf %s t kfree ^$retval
 + tr -d ^
 + command=t kfree $retval
 + echo Test command: t kfree $retval
 Test command: t kfree $retval
 + echo
 ----

So 't kfree $retval' should fail (tracepoint doesn't support
return probe) but passed it.

Link: https://lore.kernel.org/all/169944555933.45057.12831706585287704173.stgit@devnote2/

Fixes: 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is a return event by $retval")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-10 20:06:12 +09:00
Linus Torvalds
89cdf9d556 Merge tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter and bpf.

  Current release - regressions:

   - sched: fix SKB_NOT_DROPPED_YET splat under debug config

  Current release - new code bugs:

   - tcp:
       - fix usec timestamps with TCP fastopen
       - fix possible out-of-bounds reads in tcp_hash_fail()
       - fix SYN option room calculation for TCP-AO

   - tcp_sigpool: fix some off by one bugs

   - bpf: fix compilation error without CGROUPS

   - ptp:
       - ptp_read() should not release queue
       - fix tsevqs corruption

  Previous releases - regressions:

   - llc: verify mac len before reading mac header

  Previous releases - always broken:

   - bpf:
       - fix check_stack_write_fixed_off() to correctly spill imm
       - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
       - check map->usercnt after timer->timer is assigned

   - dsa: lan9303: consequently nested-lock physical MDIO

   - dccp/tcp: call security_inet_conn_request() after setting IP addr

   - tg3: fix the TX ring stall due to incorrect full ring handling

   - phylink: initialize carrier state at creation

   - ice: fix direction of VF rules in switchdev mode

  Misc:

   - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come"

* tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  net: ti: icss-iep: fix setting counter value
  ptp: fix corrupted list in ptp_open
  ptp: ptp_read should not release queue
  net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP
  net: kcm: fill in MODULE_DESCRIPTION()
  net/sched: act_ct: Always fill offloading tuple iifidx
  netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses
  netfilter: xt_recent: fix (increase) ipv6 literal buffer length
  ipvs: add missing module descriptions
  netfilter: nf_tables: remove catchall element in GC sync path
  netfilter: add missing module descriptions
  drivers/net/ppp: use standard array-copy-function
  net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN
  virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt()
  r8169: respect userspace disabling IFF_MULTICAST
  selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly
  bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
  net: phylink: initialize carrier state at creation
  test/vsock: add dobule bind connect test
  test/vsock: refactor vsock_accept
  ...
2023-11-09 17:09:35 -08:00
Linus Torvalds
90450a0616 Merge tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU fixes from Frederic Weisbecker:

 - Fix a lock inversion between scheduler and RCU introduced in
   v6.2-rc4. The scenario could trigger on any user of RCU_NOCB
   (mostly Android but also nohz_full)

 - Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking
   some RCU-Tasks and RCU-Tasks-Trace expectations as to what
   exactly is an idle task. This resulted in potential spurious
   stalls and warnings.

* tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  rcu/tasks-trace: Handle new PF_IDLE semantics
  rcu/tasks: Handle new PF_IDLE semantics
  rcu: Introduce rcu_cpu_online()
  rcu: Break rcu_node_0 --> &rq->__lock order
2023-11-08 09:47:52 -08:00
Linus Torvalds
c1ef4df14e Merge tag 'kgdb-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
 "Just two patches for you this time!

   - During a panic, flush the console before entering kgdb.

     This makes things a little easier to comprehend, especially if an
     NMI backtrace was triggered on all CPUs just before we enter the
     panic routines

   - Correcting a couple of misleading (a.k.a. plain wrong) comments"

* tag 'kgdb-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Corrects comment for kdballocenv
  kgdb: Flush console before entering kgdb on panic
2023-11-08 09:28:38 -08:00
Petr Tesarik
53c87e846e swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
Limit the free list length to the size of the IO TLB. Transient pool can be
smaller than IO_TLB_SEGSIZE, but the free list is initialized with the
assumption that the total number of slots is a multiple of IO_TLB_SEGSIZE.
As a result, swiotlb_area_find_slots() may allocate slots past the end of
a transient IO TLB buffer.

Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Closes: https://lore.kernel.org/linux-iommu/104a8c8fedffd1ff8a2890983e2ec1c26bff6810.camel@linux.ibm.com/
Fixes: 79636caad3 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
Cc: stable@vger.kernel.org
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-08 16:27:05 +01:00
Chuyi Zhou
0de4f50de2 bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to
teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it
doesn't work well.

The reason is, bpf_iter__task -> task would go through btf_ctx_access()
which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in
task_iter.c, we actually explicitly declare that the
ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL.

Actually we have a previous case like this[1] where PTR_TRUSTED is added to
the arg flag for map_iter.

This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL |
PTR_TRUSTED in task_reg_info.

Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are
under the protection of cgroup_mutex and we would check cgroup_is_dead()
in __cgroup_iter_seq_show().

This patch is to improve the user experience of the newly introduced
bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is
pointing to the commit introduced the bpf_iter_css_task kfunc.

Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/

Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-07 15:24:25 -08:00
Yuran Pereira
23816724fd kdb: Corrects comment for kdballocenv
This patch corrects the comment for the kdballocenv function.
The previous comment incorrectly described the function's
parameters and return values.

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Link: https://lore.kernel.org/r/DB3PR10MB6835B383B596133EDECEA98AE8ABA@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM
[daniel.thompson@linaro.org: fixed whitespace alignment in new lines]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2023-11-06 17:13:55 +00:00
Jia He
a409d96009 dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
There is an unusual case that the range map covers right up to the top
of system RAM, but leaves a hole somewhere lower down. Then it prevents
the nvme device dma mapping in the checking path of phys_to_dma() and
causes the hangs at boot.

E.g. On an Armv8 Ampere server, the dsdt ACPI table is:
 Method (_DMA, 0, Serialized)  // _DMA: Direct Memory Access
            {
                Name (RBUF, ResourceTemplate ()
                {
                    QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
                        0x0000000000000000, // Granularity
                        0x0000000000000000, // Range Minimum
                        0x00000000FFFFFFFF, // Range Maximum
                        0x0000000000000000, // Translation Offset
                        0x0000000100000000, // Length
                        ,, , AddressRangeMemory, TypeStatic)
                    QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
                        0x0000000000000000, // Granularity
                        0x0000006010200000, // Range Minimum
                        0x000000602FFFFFFF, // Range Maximum
                        0x0000000000000000, // Translation Offset
                        0x000000001FE00000, // Length
                        ,, , AddressRangeMemory, TypeStatic)
                    QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
                        0x0000000000000000, // Granularity
                        0x00000060F0000000, // Range Minimum
                        0x00000060FFFFFFFF, // Range Maximum
                        0x0000000000000000, // Translation Offset
                        0x0000000010000000, // Length
                        ,, , AddressRangeMemory, TypeStatic)
                    QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
                        0x0000000000000000, // Granularity
                        0x0000007000000000, // Range Minimum
                        0x000003FFFFFFFFFF, // Range Maximum
                        0x0000000000000000, // Translation Offset
                        0x0000039000000000, // Length
                        ,, , AddressRangeMemory, TypeStatic)
                })

But the System RAM ranges are:
cat /proc/iomem |grep -i ram
90000000-91ffffff : System RAM
92900000-fffbffff : System RAM
880000000-fffffffff : System RAM
8800000000-bff5990fff : System RAM
bff59d0000-bff5a4ffff : System RAM
bff8000000-bfffffffff : System RAM
So some RAM ranges are out of dma_range_map.

Fix it by checking whether each of the system RAM resources can be
properly encompassed within the dma_range_map.

Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-06 08:38:16 +01:00