Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_
push_{front,back}.
This patch fix this issue. Without this patch, if we use bpf_list kfunc
in scx, the BPF verifier would complain:
libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or
module BTFs
libbpf: failed to load object 'scx_foo'
libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL
With this patch, the bpf list kfunc will work as expected.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Fixes: 2a52ca7c98 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.
If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).
WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
Sched_ext: layered (enabled+all), task: runnable_at=-10ms
RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
...
<TASK>
consume_dispatch_q+0xab/0x220
scx_bpf_dsq_move_to_local+0x58/0xd0
bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
bpf__sched_ext_ops_dispatch+0x4b/0xab
balance_one+0x1fe/0x3b0
balance_scx+0x61/0x1d0
prev_balance+0x46/0xc0
__pick_next_task+0x73/0x1c0
__schedule+0x206/0x1730
schedule+0x3a/0x160
__do_sys_sched_yield+0xe/0x20
do_syscall_64+0xbb/0x1e0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.
While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.
task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.
Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:
- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
the same CPU as the target CPU.
- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
if the task is on a different CPU than the target CPU as the preceding
task_allowed_on_cpu() test should fail beforehand. Convert the test into
SCHED_WARN_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
When (s64)(after - before) > 0, the code returns the result of
(s64)(after - before) > 0 while the intended result should be
(s64)(after - before). That happens because the middle operand of
the ternary operator was omitted incorrectly, returning the result of
(s64)(after - before) > 0. Thus, add the middle operand
-- (s64)(after - before) -- to return the correct time calculation.
Fixes: d07be814fc ("sched_ext: Add time helpers for BPF schedulers")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.
Fix by properly tracking the currently locked rq.
Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In UP systems p->migration_disabled is not available. Fix this by using
the portable helper is_migration_disabled(p).
Fixes: e9fe182772 ("sched_ext: selftests/dsp_local_on: Fix sporadic failures")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Introduce a new helper for BPF schedulers to determine whether a task
can migrate or not (supporting both SMP and UP systems).
Fixes: e9fe182772 ("sched_ext: selftests/dsp_local_on: Fix sporadic failures")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.
This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:
WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().
Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
dsp_local_on has several incorrect assumptions, one of which is that
p->nr_cpus_allowed always tracks p->cpus_ptr. This is not true when a task
is scheduled out while migration is disabled - p->cpus_ptr is temporarily
overridden to the previous CPU while p->nr_cpus_allowed remains unchanged.
This led to sporadic test faliures when dsp_local_on_dispatch() tries to put
a migration disabled task to a different CPU. Fix it by keeping the previous
CPU when migration is disabled.
There are SCX schedulers that make use of p->nr_cpus_allowed. They should
also implement explicit handling for p->migration_disabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Changwoo Min <changwoo@igalia.com>
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.
This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull auxdisplay updates from Andy Shevchenko:
- A couple of cleanups to img-ascii-lcd driver
* tag 'auxdisplay-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
auxdisplay: img-ascii-lcd: Constify struct img_ascii_lcd_config
auxdisplay: img-ascii-lcd: Remove an unused field in struct img_ascii_lcd_ctx
Pull sound updates from Takashi Iwai:
"This was a relatively calm cycle, and most of changes are rather small
device-specific fixes. Here are highlights:
Core:
- Further enhancements of ALSA rawmidi and sequencer APIs for MIDI
2.0
- compress-offload API extensions for ASRC support
ASoC:
- Allow clocking on each DAI in an audio graph card to be configured
separately
- Improved power management for Renesas RZ-SSI
- KUnit testing for the Cirrus DSP framework
- Memory to meory operation support for Freescale/NXP platforms
- Support for pause operations in SOF
- Support for Allwinner suinv F1C100s, Awinc AW88083, Realtek
ALC5682I-VE
HD- and USB-audio:
- Add support for Focusrite Scarlett 4th Gen 16i16, 18i16, and 18i20
interfaces via new FCP driver
- TAS2781 SPI HD-audio sub-codec support
- Various device-specific quirks as usual"
* tag 'sound-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (235 commits)
ALSA: hda: tas2781-spi: Fix bogus error handling in tas2781_hda_spi_probe()
ALSA: hda: tas2781-spi: Fix error code in tas2781_read_acpi()
ALSA: hda: tas2781-spi: Delete some dead code
ALSA: usb: fcp: Fix return code from poll ops
ALSA: usb: fcp: Fix incorrect resp->opcode retrieval
ALSA: usb: fcp: Fix meter_levels type to __le32
ALSA: hda/realtek: Enable Mute LED on HP Laptop 14s-fq1xxx
ALSA: hda: tas2781-spi: Fix -Wsometimes-uninitialized in tasdevice_spi_switch_book()
ALSA: ctxfi: Simplify dao_clear_{left,right}_input() functions
ALSA: hda: tas2781-spi: select CRC32 instead of CRC32_SARWATE
ALSA: usb: fcp: Fix hwdep read ops types
ALSA: scarlett2: Add device_setup option to use FCP driver
ALSA: FCP: Add Focusrite Control Protocol driver
ALSA: hda/tas2781: Add tas2781 hda SPI driver
ALSA: hda/realtek - Fixed headphone distorted sound on Acer Aspire A115-31 laptop
ASoC: xilinx: xlnx_spdif: Simpify using devm_clk_get_enabled()
ALSA: hda: Support for Ideapad hotkey mute LEDs
ASoC: Intel: sof_sdw: Fix DMI match for Lenovo 83JX, 83MC and 83NM
ASoC: Intel: sof_sdw: Fix DMI match for Lenovo 83LC
ASoC: dapm: add support for preparing streams
...
Pull TPM update from Jarkko Sakkinen.
* tag 'tpmdd-next-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: Change to kvalloc() in eventlog/acpi.c
Pull pmdomain updates from Ulf Hansson:
"pmdomain core:
- Add support for naming idlestates through DT
pmdomain providers:
- arm: Explicitly request the current state at init for the SCMI PM
domain
- mediatek: Add Airoha CPU PM Domain support for CPU frequency
scaling
- ti: Add per-device latency constraint management to the ti_sci PM
domain
cpuidle-psci:
- Enable system-wakeup through GENPD_FLAG_ACTIVE_WAKEUP"
* tag 'pmdomain-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: airoha: Fix compilation error with Clang-20 and Thumb2 mode
pmdomain: arm: scmi_pm_domain: Send an explicit request to set the current state
pmdomain: airoha: Add Airoha CPU PM Domain support
pmdomain: ti_sci: handle wake IRQs for IO daisy chain wakeups
pmdomain: ti_sci: add wakeup constraint management
pmdomain: ti_sci: add per-device latency constraint management
pmdomain: imx-gpcv2: Suppress bind attrs
pmdomain: imx8m[p]-blk-ctrl: Suppress bind attrs
pmdomain: core: Support naming idle states
dt-bindings: power: domain-idle-state: Allow idle-state-name
cpuidle: psci: Activate GENPD_FLAG_ACTIVE_WAKEUP with OSI
Pull pin control updates from Linus Walleij:
"No core changes this time
New drivers:
- New subdriver for the Qualcomm MSM8917 SoC TLMM
- New subdriver for the Mediatek MT7988 SoC
- New subdriver for the Rockchip RK3562 SoC
- New subdriver for the Renesas RZ/G3E SoC
Improvements:
- Fix some missing pins in the Qualcomm IPQ5424 TLMM
- Fix some missing LVDS pins in the Sunxi A100/A133
- Support Sunxi V853 (simple compatible string)
- Cleanups in the Samsung driver
- Fix some AMD suspend behaviour
- Cleanups"
* tag 'pinctrl-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (29 commits)
dt-bindings: pinctrl: sunxi: add compatible for V853
pinctrl: Use str_enable_disable-like helpers
dt-bindings: pinctrl: Correct indentation and style in DTS example
pinctrl: amd: Take suspend type into consideration which pins are non-wake
pinctrl: stm32: Add check for clk_enable()
pinctrl: renesas: rzg2l: Fix PFC_MASK for RZ/V2H and RZ/G3E
pinctrl: sunxi: add missed lvds pins for a100/a133
pinctrl: mediatek: Drop mtk_pinconf_bias_set_pd()
pinctrl: renesas: rzg2l: Add support for RZ/G3E SoC
pinctrl: renesas: rzg2l: Update r9a09g057_variable_pin_cfg table
dt-bindings: pinctrl: renesas: Document RZ/G3E SoC
dt-bindings: pinctrl: renesas: Add alpha-numerical port support for RZ/V2H
pinctrl: rockchip: add rk3562 support
dt-bindings: pinctrl: Add rk3562 pinctrl support
pinctrl: Fix the clean up on pinconf_apply_setting failure
dt-bindings: pinctrl: add binding for MT7988 SoC
pinctrl: mediatek: add MT7988 pinctrl driver
pinctrl: mediatek: add support for MTK_PULL_PD_TYPE
pinctrl: ocelot: Constify some structures
pinctrl: renesas: rzg2l: Add audio clock pins on RZ/G3S
...
Pull iommu updates from Joerg Roedel:
"Core changes:
- PASID support for the blocked_domain
ARM-SMMU Updates:
- SMMUv2:
- Implement per-client prefetcher configuration on Qualcomm SoCs
- Support for the Adreno SMMU on Qualcomm's SDM670 SOC
- SMMUv3:
- Pretty-printing of event records
- Drop the ->domain_alloc_paging implementation in favour of
domain_alloc_paging_flags(flags==0)
- IO-PGTable:
- Generalisation of the page-table walker to enable external
walkers (e.g. for debugging unexpected page-faults from the GPU)
- Minor fix for handling concatenated PGDs at stage-2 with 16KiB
pages
- Misc:
- Clean-up device probing and replace the crufty probe-deferral
hack with a more robust implementation of
arm_smmu_get_by_fwnode()
- Device-tree binding updates for a bunch of Qualcomm platforms
Intel VT-d Updates:
- Remove domain_alloc_paging()
- Remove capability audit code
- Draining PRQ in sva unbind path when FPD bit set
- Link cache tags of same iommu unit together
AMD-Vi Updates:
- Use CMPXCHG128 to update DTE
- Cleanups of the domain_alloc_paging() path
RiscV IOMMU:
- Platform MSI support
- Shutdown support
Rockchip IOMMU:
- Add DT bindings for Rockchip RK3576
More smaller fixes and cleanups"
* tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits)
iommu: Use str_enable_disable-like helpers
iommu/amd: Fully decode all combinations of alloc_paging_flags
iommu/amd: Move the nid to pdom_setup_pgtable()
iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
iommu/amd: Remove dev == NULL checks
iommu/amd: Remove domain_alloc()
iommu/amd: Remove unused amd_iommu_domain_update()
iommu/riscv: Fixup compile warning
iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h
iommu/arm-smmu-v3: Use str_read_write helper w/ logs
iommu/io-pgtable-arm: Add way to debug pgtable walk
iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys
iommu/io-pgtable-arm: Make pgtable walker more generic
iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500
iommu/arm-smmu: Introduce ACTLR custom prefetcher settings
iommu/arm-smmu: Add support for PRR bit setup
iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer
iommu/arm-smmu: Re-enable context caching in smmu reset operation
iommu/vt-d: Link cache tags of same iommu unit together
...
Pull x86 platform driver updates from Ilpo Järvinen:
"acer-wmi:
- Add support for PH14-51, PH16-72, and Nitro AN515-58
- Add proper hwmon support
- Improve error handling when reading "gaming system info"
- Replace direct EC reads for the current platform profile with WMI
calls to handle EC address variations
- Replace custom platform_profile cycling with the generic one
ACPI:
- platform_profile: Major refactoring and improvements
- Support registering multiple platform_profile handlers concurrently
to avoid the need to quirk which handler takes precedence
- Support reporting "custom" profile for cases where the current
profile is ambiguous or when settings tweaks are done outside the
pre-defined profile
- Abstract and layer platform_profile API better using the class_dev
and drvdata
- Various minor improvements
- Add Documentation and kerneldoc
amd/hsmp:
- Add support for HSMP protocol v7
amd/pmc:
- Support AMD 1Ah family 70h
- Support STB with Ryzen desktop SoCs
amd/pmf:
- Support Custom BIOS inputs for PMF TA
- Support passing SRA sensor data from AMD SFH (HID) to PMF TA
dell-smo8800:
- Move SMO88xx quirk away from the generic i2c-i801 driver
- Add accelerometer support for Dell Latitude E6330/E6430 and XPS
9550
- Support probing accelerometer for models yet to be listed in the
DMI mapping table because ACPI lacks i2c-address for the
accelerometer (behind a module parameter because probing might be
dangerous)
HID:
- amd_sfh: Add support for exporting SRA sensor data
hp-wmi:
- Add fan and thermal support for Victus 16-s1000
input:
- Add key for phone linking
- i8042: Add context for the i8042 filter to enable cleaning up the
filter related global variables from pdx86 drivers
lenovo-wmi-camera:
- Use SW_CAMERA_LENS_COVER instead of KEY_CAMERA_ACCESS
mellanox mlxbf-pmc:
- Add support for monitoring cycle count
- Add Documentation
thinkpad_acpi:
- Add support for phone link key
tools/power/x86/intel-speed-select:
- Fix Turbo Ratio Limit restore
x86-android-tables:
- Add support for Vexia EDU ATLA 10 Bluetooth and EC battery driver
And miscellaneous cleanups / refactoring / improvements"
* tag 'platform-drivers-x86-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (133 commits)
platform/x86: acer-wmi: Fix initialization of last_non_turbo_profile
platform/x86: acer-wmi: Ignore AC events
platform/mellanox: mlxreg-io: use sysfs_emit() instead of sprintf()
platform/mellanox: mlxreg-hotplug: use sysfs_emit() instead of sprintf()
platform/mellanox: mlxbf-bootctl: use sysfs_emit() instead of sprintf()
platform/x86: hp-wmi: Add fan and thermal profile support for Victus 16-s1000
ACPI: platform_profile: Add a prefix to log messages
ACPI: platform_profile: Add documentation
ACPI: platform_profile: Clean platform_profile_handler
ACPI: platform_profile: Move platform_profile_handler
ACPI: platform_profile: Remove platform_profile_handler from exported symbols
platform/x86: thinkpad_acpi: Use devm_platform_profile_register()
platform/x86: inspur_platform_profile: Use devm_platform_profile_register()
platform/x86: hp-wmi: Use devm_platform_profile_register()
platform/x86: ideapad-laptop: Use devm_platform_profile_register()
platform/x86: dell-pc: Use devm_platform_profile_register()
platform/x86: asus-wmi: Use devm_platform_profile_register()
platform/x86: amd: pmf: sps: Use devm_platform_profile_register()
platform/x86: acer-wmi: Use devm_platform_profile_register()
platform/surface: surface_platform_profile: Use devm_platform_profile_register()
...
Pull x86 TDX updates from Dave Hansen:
"Intel Trust Domain updates.
The existing TDX code needs a _bit_ of metadata from the TDX module.
But KVM is going to need a bunch more very shortly. Rework the
interface with the TDX module to be more consistent and handle the new
higher volume.
The TDX module has added a few new features. The first is a promise
not to clobber RBP under any circumstances. Basically the kernel now
will refuse to use any modules that don't have this promise. Second,
enable the new "REDUCE_VE" feature. This ensures that the TDX module
will not send some silly virtualization exceptions that the guest had
no good way to handle anyway.
- Centralize global metadata infrastructure
- Use new TDX module features for exception suppression and RBP
clobbering"
* tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation
x86/virt/tdx: Switch to use auto-generated global metadata reading code
x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
x86/virt/tdx: Use auto-generated code to read global metadata
x86/virt/tdx: Start to track all global metadata in one structure
x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better
x86/tdx: Dump attributes and TD_CTLS on boot
x86/tdx: Disable unnecessary virtualization exceptions
Pull x86 boot updates from Ingo Molnar:
- A large and involved preparatory series to pave the way to add
exception handling for relocate_kernel - which will be a debugging
facility that has aided in the field to debug an exceptionally hard
to debug early boot bug. Plus assorted cleanups and fixes that were
discovered along the way, by David Woodhouse:
- Clean up and document register use in relocate_kernel_64.S
- Use named labels in swap_pages in relocate_kernel_64.S
- Only swap pages for ::preserve_context mode
- Allocate PGD for x86_64 transition page tables separately
- Copy control page into place in machine_kexec_prepare()
- Invoke copy of relocate_kernel() instead of the original
- Move relocate_kernel to kernel .data section
- Add data section to relocate_kernel
- Drop page_list argument from relocate_kernel()
- Eliminate writes through kernel mapping of relocate_kernel page
- Clean up register usage in relocate_kernel()
- Mark relocate_kernel page as ROX instead of RWX
- Disable global pages before writing to control page
- Ensure preserve_context flag is set on return to kernel
- Use correct swap page in swap_pages function
- Fix stack and handling of re-entry point for ::preserve_context
- Mark machine_kexec() with __nocfi
- Cope with relocate_kernel() not being at the start of the page
- Use typedef for relocate_kernel_fn function prototype
- Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)
- A series to remove the last remaining absolute symbol references from
.head.text, and enforce this at build time, by Ard Biesheuvel:
- Avoid WARN()s and panic()s in early boot code
- Don't hang but terminate on failure to remap SVSM CA
- Determine VA/PA offset before entering C code
- Avoid intentional absolute symbol references in .head.text
- Disable UBSAN in early boot code
- Move ENTRY_TEXT to the start of the image
- Move .head.text into its own output section
- Reject absolute references in .head.text
- The above build-time enforcement uncovered a handful of bugs of
essentially non-working code, and a wrokaround for a toolchain bug,
fixed by Ard Biesheuvel as well:
- Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
- Disable UBSAN on SEV code that may execute very early
- Disable ftrace branch profiling in SEV startup code
- And miscellaneous cleanups:
- kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
- x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)"
* tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
x86/sev: Disable ftrace branch profiling in SEV startup code
x86/kexec: Use typedef for relocate_kernel_fn function prototype
x86/kexec: Cope with relocate_kernel() not being at the start of the page
kexec_core: Add and update comments regarding the KEXEC_JUMP flow
x86/kexec: Mark machine_kexec() with __nocfi
x86/kexec: Fix location of relocate_kernel with -ffunction-sections
x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
x86/kexec: Use correct swap page in swap_pages function
x86/kexec: Ensure preserve_context flag is set on return to kernel
x86/kexec: Disable global pages before writing to control page
x86/sev: Don't hang but terminate on failure to remap SVSM CA
x86/sev: Disable UBSAN on SEV code that may execute very early
x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
x86/sysfs: Constify 'struct bin_attribute'
x86/kexec: Mark relocate_kernel page as ROX instead of RWX
x86/kexec: Clean up register usage in relocate_kernel()
x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
x86/kexec: Drop page_list argument from relocate_kernel()
x86/kexec: Add data section to relocate_kernel
x86/kexec: Move relocate_kernel to kernel .data section
...
Pull perf-tools updates from Namhyung Kim:
"There are a lot of changes in the perf tools in this cycle.
build:
- Use generic syscall table to generate syscall numbers on supported
archs
- This also enables to get rid of libaudit which was used for syscall
numbers
- Remove python2 support as it's deprecated for years
- Fix issues on static build with libzstd
perf record:
- Intel-PT supports "aux-action" config term to pause or resume
tracing in the aux-buffer. Users can start the intel_pt event as
"started-paused" and configure other events to control the Intel-PT
tracing:
# perf record --kcore -e intel_pt/aux-action=start-paused/ \
-e syscalls:sys_enter_newuname/aux-action=resume/ \
-e syscalls:sys_exit_newuname/aux-action=pause/ -- uname
This requires kernel support (which was added in v6.13)
perf lock:
- 'perf lock contention' command has an ability to symbolize locks in
dynamically allocated objects using slab cache name when it runs
with BPF. Those dynamic locks would have "&" prefix in the name to
distinguish them from ordinary (static) locks
# perf lock con -abl -E 5 sleep 1
contended total wait max wait avg wait address symbol
2 1.95 us 1.77 us 975 ns ffff9d5e852d3498 &task_struct (mutex)
1 1.18 us 1.18 us 1.18 us ffff9d5e852d3538 &task_struct (mutex)
4 1.12 us 354 ns 279 ns ffff9d5e841ca800 &kmalloc-cg-512 (mutex)
2 859 ns 617 ns 429 ns ffffffffa41c3620 delayed_uprobe_lock (mutex)
3 691 ns 388 ns 230 ns ffffffffa41c0940 pack_mutex (mutex)
This also requires kernel/BPF support (which was added in v6.13)
perf ftrace:
- 'perf ftrace latency' command gets a couple of options to support
linear buckets instead of exponential. Also it's possible to
specify max and min latency for the linear buckets:
# perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100 \
--min-latency=200 --max-latency=800 -- sleep 1
# DURATION | COUNT | GRAPH |
0 - 200 ns | 186 | ### |
200 - 300 ns | 256 | ##### |
300 - 400 ns | 364 | ####### |
400 - 500 ns | 223 | #### |
500 - 600 ns | 111 | ## |
600 - 700 ns | 41 | |
700 - 800 ns | 141 | ## |
800 - ... ns | 169 | ### |
# statistics (in nsec)
total time: 2162212
avg time: 967
max time: 16817
min time: 132
count: 2236
- As you can see in the above example, it nows shows the statistics
at the end so that users can see the avg/max/min latencies easily
- 'perf ftrace profile' command has --graph-opts option like 'perf
ftrace trace' so that it can control the tracing behaviors in the
same way. For example, it can limit the function call depth or
threshold
perf script:
- Improve physical memory resolution in 'mem-phys-addr' script by
parsing /proc/iomem file
# perf script mem-phys-addr -- find /
...
Event: mem_inst_retired.all_loads:P
Memory type count percentage
---------------------------------------- ---------- ----------
100000000-85f7fffff : System RAM 8929 69.7
547600000-54785d23f : Kernel data 1240 9.7
546a00000-5474bdfff : Kernel rodata 490 3.8
5480ce000-5485fffff : Kernel bss 121 0.9
0-fff : Reserved 3860 30.1
100000-89c01fff : System RAM 18 0.1
8a22c000-8df6efff : System RAM 5 0.0
Others:
- 'perf test' gets --runs-per-test option to run the test cases
repeatedly. This would be helpful to see if it's flaky
- Add 'parse_events' method to Python perf extension module, so that
users can use the same event parsing logic in the python code. One
more step towards implementing perf tools in Python. :)
- Support opening tracepoint events without libtraceevent. This will
be helpful if it won't use the tracing data like in 'perf stat'
- Update ARM Neoverse N2/V2 JSON events and metrics"
* tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (176 commits)
perf test: Update event_groups test to use instructions
perf bench: Fix undefined behavior in cmpworker()
perf annotate: Prefer passing evsel to evsel->core.idx
perf lock: Rename fields in lock_type_table
perf lock: Add percpu-rwsem for type filter
perf lock: Fix parse_lock_type which only retrieve one lock flag
perf lock: Fix return code for functions in __cmd_contention
perf hist: Fix width calculation in hpp__fmt()
perf hist: Fix bogus profiles when filters are enabled
perf hist: Deduplicate cmp/sort/collapse code
perf test: Improve verbose documentation
perf test: Add a runs-per-test flag
perf test: Fix parallel/sequential option documentation
perf test: Send list output to stdout rather than stderr
perf test: Rename functions and variables for better clarity
perf tools: Expose quiet/verbose variables in Makefile.perf
perf config: Add a function to set one variable in .perfconfig
perf test perftool_testsuite: Return correct value for skipping
perf test perftool_testsuite: Add missing description
perf test record+probe_libc_inet_pton: Make test resilient
...