Pull scheduler fixes from Ingo Molnar:
"Misc fixes all over the place:
- Fix NUMA over-balancing between lightly loaded nodes. This is
fallout of the big load-balancer rewrite.
- Fix the NOHZ remote loadavg update logic, which fixes anomalies
like reported 150 loadavg on mostly idle CPUs.
- Fix XFS performance/scalability
- Fix throttled groups unbound task-execution bug
- Fix PSI procfs boundary condition
- Fix the cpu.uclamp.{min,max} cgroup configuration write checks
- Fix DocBook annotations
- Fix RCU annotations
- Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
counts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
sched/core: Annotate curr pointer in rq with __rcu
sched/psi: Fix OOB write when writing 0 bytes to PSI files
sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
sched/fair: Prevent unlimited runtime on throttled group
sched/nohz: Optimize get_nohz_timer_target()
sched/uclamp: Reject negative values in cpu_uclamp_write()
sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
timers/nohz: Update NOHZ load in remote tick
sched/core: Don't skip remote tick for idle CPUs
The following XFS commit:
8ab39f11d9 ("xfs: prevent CIL push holdoff in log recovery")
changed the logic from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the
time that the dbench performance dropped quite a lot and CPU migrations
were increased.
The current pattern of the task migration is straight-forward. With XFS,
an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker.
This runs on a nearby CPU and on completion, dbench wakes up on its old CPU
as it is still idle and no migration occurs. dbench then queues the real
IO on the blk_mq_requeue_work() work item which runs on a bound kworker
which is forced to run on the same CPU as dbench. When IO completes,
the bound kworker wakes dbench but as the kworker is a bound but,
real task, the CPU is not considered idle and dbench gets migrated by
select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs
for a while but ultimately it starts a round-robin of all CPUs sharing
the same LLC. High-frequency migration on each IO completion has poor
performance overall. It has negative implications both in commication
costs and power management. mpstat confirmed that at low thread counts
that all CPUs sharing an LLC has low level of activity.
Note that even if the CIL patch was reverted, there still would
be migrations but the impact is less noticeable. It turns out that
individually the scheduler, XFS, blk-mq and workqueues all made sensible
decisions but in combination, the overall effect was sub-optimal.
This patch special cases the IO issue/completion pattern and allows
a bound kworker waker and a task wakee to stack on the same CPU if
there is a strong chance they are directly related. The expectation
is that the kworker is likely going back to sleep shortly. This is not
guaranteed as the IO could be queued asynchronously but there is a very
strong relationship between the task and kworker in this case that would
justify stacking on the same CPU instead of migrating. There should be
few concerns about kworker starvation given that the special casing is
only when the kworker is the waker.
DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem
UMA machine with 8 cores sharing LLC
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack
Amean 1 22.63 ( 0.00%) 20.54 * 9.23%*
Amean 2 25.56 ( 0.00%) 23.40 * 8.44%*
Amean 4 28.63 ( 0.00%) 27.85 * 2.70%*
Amean 8 37.66 ( 0.00%) 37.68 ( -0.05%)
Amean 64 469.47 ( 0.00%) 468.26 ( 0.26%)
Stddev 1 1.00 ( 0.00%) 0.72 ( 28.12%)
Stddev 2 1.62 ( 0.00%) 1.97 ( -21.54%)
Stddev 4 2.53 ( 0.00%) 3.58 ( -41.19%)
Stddev 8 5.30 ( 0.00%) 5.20 ( 1.92%)
Stddev 64 86.36 ( 0.00%) 94.53 ( -9.46%)
NUMA machine, 48 CPUs total, 24 CPUs share cache
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack-v1r2
Amean 1 58.69 ( 0.00%) 30.21 * 48.53%*
Amean 2 60.90 ( 0.00%) 35.29 * 42.05%*
Amean 4 66.77 ( 0.00%) 46.55 * 30.28%*
Amean 8 81.41 ( 0.00%) 68.46 * 15.91%*
Amean 16 113.29 ( 0.00%) 107.79 * 4.85%*
Amean 32 199.10 ( 0.00%) 198.22 * 0.44%*
Amean 64 478.99 ( 0.00%) 477.06 * 0.40%*
Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%*
Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%)
Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%)
Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%)
Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%)
Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%)
Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%)
Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%)
Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%)
Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related
Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.
dbench4 Throughput (misleading but traditional)
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack-v1r2
Hmean 1 321.41 ( 0.00%) 617.82 * 92.22%*
Hmean 2 622.87 ( 0.00%) 1066.80 * 71.27%*
Hmean 4 1134.56 ( 0.00%) 1623.74 * 43.12%*
Hmean 8 1869.96 ( 0.00%) 2212.67 * 18.33%*
Hmean 16 2673.11 ( 0.00%) 2806.13 * 4.98%*
Hmean 32 3032.74 ( 0.00%) 3039.54 ( 0.22%)
Hmean 64 2514.25 ( 0.00%) 2498.96 * -0.61%*
Hmean 128 1778.49 ( 0.00%) 1746.05 * -1.82%*
Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.
We have the following sequence:
sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.
Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
same socket are in nohz_full mode. We can observe huge time burn in the
loop for seaching nearest busy housekeeper cpu by ftrace.
2) | get_nohz_timer_target() {
2) 0.240 us | housekeeping_test_cpu();
2) 0.458 us | housekeeping_test_cpu();
...
2) 0.292 us | housekeeping_test_cpu();
2) 0.240 us | housekeeping_test_cpu();
2) 0.227 us | housekeeping_any_cpu();
2) + 43.460 us | }
This patch optimizes the searching logic by finding a nearest housekeeper
CPU in the housekeeping cpumask, it can minimize the worst searching time
from ~44us to < 10us in my testing. In addition, the last iterated busy
housekeeper can become a random candidate while current CPU is a better
fallback if it is a housekeeper.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison
7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }
# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96
Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().
# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range
Fixes: 2480c09313 ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.
This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.
The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.
2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
baseline lbnuma-v3
Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.
Ops NUMA base-page range updates 19796.00 292.00
Ops NUMA PTE updates 19796.00 292.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 16113.00 143.00
Ops NUMA hint local faults % 8407.00 142.00
Ops NUMA hint local percent 52.18 99.30
Ops NUMA pages migrated 4244.00 1.00
Without the patch, only 52.18% of sampled accesses are local. In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates". The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries. A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.
A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.
Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.
Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.
An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.
In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small. For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.
Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.
[ swood: Added changelog and removed recheck of stopped tick. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
The affinity of managed interrupts is completely handled in the kernel and
cannot be changed via the /proc/irq/* interfaces from user space. As the
kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
vector exhaustion, it can happen that a managed interrupt whose affinity
mask contains both isolated and housekeeping CPUs is routed to an isolated
CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
on the isolated CPU.
Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
logic in the interrupt affinity selection code.
The subparameter indicates to the interrupt affinity selection logic that
it should try to avoid the above scenario.
This isolation is best effort and only effective if the automatically
assigned interrupt mask of a device queue contains isolated and
housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
directed to the housekeeping CPU so that IO submitted on the housekeeping
CPU cannot disturb the isolated CPU.
If a queue's affinity mask contains only isolated CPUs then this parameter
has no effect on the interrupt routing decision, though interrupts are only
happening when tasks running on those isolated CPUs submit IO. IO submitted
on housekeeping CPUs has no influence on those queues.
If the affinity mask contains both housekeeping and isolated CPUs, but none
of the contained housekeeping CPUs is online, then the interrupt is also
routed to an isolated CPU. Interrupts are only delivered when one of the
isolated CPUs in the affinity mask submits IO. If one of the contained
housekeeping CPUs comes online, the CPU hotplug logic migrates the
interrupt automatically back to the upcoming housekeeping CPU. Depending on
the type of interrupt controller, this can require that at least one
interrupt is delivered to the isolated CPU in order to complete the
migration.
[ tglx: Removed unused parameter, added and edited comments/documentation
and rephrased the changelog so it contains more details. ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.
In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7
The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:
3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'
And a later pass through get_group(4, sdd@MC) will "corrupt" that to:
3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'
which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.
There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.
Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.
Testing
=======
Dietmar managed to reproduce this using the following qemu incantation:
$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1
alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):
8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
8<---
[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com
Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
With commit
bef69dd878 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()")
update_load_avg() has become the central point for calling cpufreq
(not including the update of blocked load). This change helps to
simplify further the number of calls to cpufreq_update_util() and to
remove last redundant ones. With update_load_avg(), we are now sure
that cpufreq_update_util() will be called after every task attachment
to a cfs_rq and especially after propagating this event down to the
util_avg of the root cfs_rq, which is the level that is used by
cpufreq governors like schedutil to set the frequency of a CPU.
The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when
the migration happens in a cgroup whereas util_avg of root cfs_rq is
not yet updated and this call is duplicated with the one that happens
immediately after when the migration event reaches the root cfs_rq.
The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be
removed. The interface of attach_entity_load_avg() can also be
simplified accordingly.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.
In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.
Signed-off-by: Wang Long <w@laoqinren.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
commit bf475ce0a3 ("sched/fair: Add per-CPU min capacity to
sched_group_capacity") introduced per-cpu min_capacity.
commit e3d6d0cb66 ("sched/fair: Add sched_group per-CPU max capacity")
introduced per-cpu max_capacity.
In the SD_OVERLAP case, the local variable 'capacity' represents the sum
of CPU capacity of all CPUs in the first sched group (sg) of the sched
domain (sd).
It is erroneously used to calculate sg's min and max CPU capacity.
To fix this use capacity_of(cpu) instead of 'capacity'.
The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity
(for rq->sd != NULL) can be removed since it delivers the same value as
capacity_of(cpu) which is currently only used for the (!rq->sd) case
(see update_cpu_capacity()).
An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single
CPU (and hence sg->sgc->capacity == capacity_of(cpu)).
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
Every time we call irqtime_account_process_tick() is in a interrupt,
Every caller will get and assign a parameter rq = this_rq(), This is
unnecessary and increase the code size a little bit. Move the rq getting
action to irqtime_account_process_tick internally is better.
base with this patch
cputime.o 578792 bytes 577888 bytes
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1577959674-255537-1-git-send-email-alex.shi@linux.alibaba.com
Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.
So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.
Add to sysrq_sched_debug_show() as what we did in show_state_filter().
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.
Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.
Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.
The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.
By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.
Fixes: 0b60ba2dd3 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/