Commit bf5835bcdb ("intel_idle: Disable IBRS during long idle")
disables IBRS when the cstate is 6 or lower. However, there are
some use cases where a customer may want to use max_cstate=1 to
lower latency. Such use cases will suffer from the performance
degradation caused by the enabling of IBRS in the sibling idle thread.
Add a "ibrs_off" module parameter to force disable IBRS and the
CPUIDLE_FLAG_IRQ_ENABLE flag if set.
In the case of a Skylake server with max_cstate=1, this new ibrs_off
option will likely increase the IRQ response latency as IRQ will now
be disabled.
When running SPECjbb2015 with cstates set to C1 on a Skylake system.
First test when the kernel is booted with: "intel_idle.ibrs_off":
max-jOPS = 117828, critical-jOPS = 66047
Then retest when the kernel is booted without the "intel_idle.ibrs_off"
added:
max-jOPS = 116408, critical-jOPS = 58958
That means booting with "intel_idle.ibrs_off" improves performance by:
max-jOPS: +1.2%, which could be considered noise range.
critical-jOPS: +12%, which is definitely a solid improvement.
The admin-guide/pm/intel_idle.rst file is updated to add a description
about the new "ibrs_off" module parameter.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-5-longman@redhat.com
When intel_idle_ibrs() is called, it modifies the SPEC_CTRL MSR to 0
in order disable IBRS. However, the new MSR value isn't reflected in
x86_spec_ctrl_current which is at odd with the other code that keep track
of its state in that percpu variable. Use the new __update_spec_ctrl()
to have the x86_spec_ctrl_current percpu value properly updated.
Since spec-ctrl.h includes both msr.h and nospec-branch.h, we can remove
those from the include file list.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-4-longman@redhat.com
This reverts commit b2918089d5 ("intel_idle: Add __init annotation to
matchup_vm_state_with_baremetal()"), because the commit fixed by it will
be reverted.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The caller of (recently added) matchup_vm_state_with_baremetal() is an
__init function and it uses some __initdata data structures, so add the
__init annotation to it for consistency.
This addresses the following build warnings:
WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x51 (section: .text) -> intel_idle_max_cstate_reached (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x62 (section: .text) -> cpuidle_state_table (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x79 (section: .text) -> icpu (section: .init.data)
Fixes: 0fac214bb7 ("intel_idle: Add a "Long HLT" C1 state for the VM guest mode")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
intel_idle will, for the bare metal case, usually have one or more deep
power states that have the CPUIDLE_FLAG_TLB_FLUSHED flag set. When
a state with this flag is selected by the cpuidle framework, it will also
flush the TLBs as part of entering this state. The benefit of doing this is
that the kernel does not need to wake the cpu out of this deep power state
just to flush the TLBs... for which the latency can be very high due to
the exit latency of deep power states.
In a VM guest currently, this benefit of avoiding the wakeup does not exist,
while the problem (long exit latency) is even more severe. Linux will need
to wake up a vCPU (causing the host to either come out of a deep C state,
or the VMM to have to deschedule something else to schedule the vCPU) which
can take a very long time.. adding a lot of latency to tlb flush operations
(including munmap and others).
To solve this, add a "Long HLT" C state to the state table for the VM guest
case that has the CPUIDLE_FLAG_TLB_FLUSHED flag set. The result of that is
that for long idle periods (where the VMM is likely to do things that cause
large latency) the cpuidle framework will flush the TLBs (and avoid the
wakeups), while for short/quick idle durations, the existing behavior is
retained.
Now, there is still only "hlt" available in the guest, but for long idle,
the host can go to a deeper state (say C6). There is a reasonable debate
one can have to what to set for the exit_latency and break even point for
this "Long HLT" state. The good news is that intel_idle has these values
available for the underlying CPU (even when mwait is not exposed). The
solution thus is to just use the latency and break even of the deepest state
from the bare metal CPU. This is under the assumption that this is a pretty
reasonable estimate of what the VMM would do to cause latency.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In a typical VM guest, the mwait instruction is not available, leaving
only the 'hlt' instruction (which causes a VMEXIT to the host).
So for this common case, intel_idle will detect the lack of mwait, and
fail to initialize (after which another idle method would step in which
will just use hlt always).
Other (non-common) cases exist; the table below shows the before/after
for these:
+------------+--------------------------+-------------------------+
| Hypervisor | Idle method before patch | Idle method after patch |
| exposes | | |
+============+==========================+=========================+
| nothing | default_idle fallback | intel_idle VM table |
| (common) | (straight "hlt") | |
+------------+--------------------------+-------------------------+
| mwait | intel_idle mwait table | intel_idle mwait table |
+------------+--------------------------+-------------------------+
| ACPI | ACPI C1 state ("hlt") | intel_idle VM table |
+------------+--------------------------+-------------------------+
This is only applicable to CPUs known by intel_idle. For the bare metal
case, unknown CPU models will use the ACPI tables (when available) to
get estimates for latency and break even point for longer idle states.
In guests, the common case is that ACPI tables are not available, but
even when they are available, they can't and don't provide the latency
information for the longer (mwait based) states. For this scenario
(unknown CPU model), the default_idle mode (no ACPI) or ACPI C1 (ACPI
avaible) will be used.
By providing capability to do this with the intel_idle driver, we can
do better than the fallback or ACPI table methods. While this current
change only gets us to the existing behavior, later patches in this
series will add new capabilities such as optimized TLB flushing.
In order to do this, a simplified version of the initialization
function for VM guests is created, and this will be called if the CPU
is recognized, but mwait is not supported, and we're in a VM guest.
One thing to note is that the max latency (and break even) of this C1
state is higher than the typical bare metal C1 state. Because hlt causes
a vmexit, and the cost of vmexit + hypervisor overhead + vmenter is
typically in the order of upto 5 microseconds... even if the hypervisor
does not actually goes into a hardware power saving state.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
[ rjw: Dropped redundant checks from should_verify_mwait() ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that the logic for state_update_enter_method() is in its own
function, the long if .. else if .. else if .. else if chain
can be simplified by just returning from the function
at the various places. This does not change functionality,
but it makes the logic much simpler to read or modify later.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the 6.4 kernel, the logic for updating a state's enter method
based on "environmental conditions" (command line options, cpu sidechannel
workarounds etc etc) has gotten pretty complex.
This patch refactors this into a seperate small, self contained function
(no behavior changes) for improved readability and to make future
changes to this logic easier to do and understand.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The intention is to clean up the code and make it look a bit more
consistent.
Mark all unitialized module parameter variables as __read_mostly,
not just one of them. The other parameters are read-mostly too.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is a cleanup which improves code consistency. Move the force_irq_on
module parameter variable and definition to the same place where we have
variables and definitions for other module parameters.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
By default, all non-POLL C-states are entered with interrupts disabled.
There are 2 ways to make 'intel_idle' enter C-states with interrupts
enabled:
1. Mark the C-state with the CPUIDLE_FLAG_IRQ_ENABLE flag.
2. Use the force_irq_on module parameter.
The former is the "proper" way of doing it, it is per-C-state and
per-platform. The latter is for debugging purposes only.
The problem is that intel_idle prints the "forced intel_idle_irq"
message in both cases, even though the former case does not needed
this message, because nothing is forced there. This patch addresses the
problem.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The following C-state flags are currently mutually-exclusive and should not
be combined:
* IRQ_ENABLE
* IBRS
* XSTATE
There is a warning for the situation when the IRQ_ENABLE flag
is combined with the IBRS flag, but no warnings for other combinations.
This is inconsistent and prone to errors.
Improve the situation by adding warnings for all the unexpected
combinations. Add a couple of helpful commentaries too.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Introduce a temporary 'state' variable for referencing the currently
processed C-state in the intel_idle_init_cstates_icpu() function.
This makes code lines shorter and easier to read.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The intel_idle_init_cstates_icpu() function includes a loop that iterates
over every C-state. Inside the loop, the same C-state data is referenced 2
ways:
1. as cpuidle_state_table[cstate]
2. as drv->states[drv->state_count] (but it is a copy of #1, not the same
object).
Make the code be more consistent and easier to read by using only the 2nd
way. So the code structure would be as follows:
1. Use cpuidle_state_table[cstate]
2. Copy cpuidle_state_table[cstate] to drv->states[drv->state_count]
3. Use only drv->states[drv->state_count] from this point.
Note, this change introduces a checkpatch.pl warning (too long line), but it
will be addressed in the next patch.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Substitute 'printk()' with 'pr_info()', because 'intel_idle' already uses
'pr_debug()', so using 'pr_info()' will be more consistent.
In addition to this, this patch addresses the following checkpatch.pl
warning:
WARNING: printk() should include KERN_<LEVEL> facility level
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull power management updates from Rafael Wysocki:
"These add EPP support to the AMD P-state cpufreq driver, add support
for new platforms to the Intel RAPL power capping driver, intel_idle
and the Qualcomm cpufreq driver, enable thermal cooling for Tegra194,
drop the custom cpufreq driver for loongson1 that is not necessary any
more (and the corresponding cpufreq platform device), fix assorted
issues and clean up code.
Specifics:
- Add EPP support to the AMD P-state cpufreq driver (Perry Yuan, Wyes
Karny, Arnd Bergmann, Bagas Sanjaya)
- Drop the custom cpufreq driver for loongson1 that is not necessary
any more and the corresponding cpufreq platform device (Keguang
Zhang)
- Remove "select SRCU" from system sleep, cpufreq and OPP Kconfig
entries (Paul E. McKenney)
- Enable thermal cooling for Tegra194 (Yi-Wei Wang)
- Register module device table and add missing compatibles for
cpufreq-qcom-hw (Nícolas F. R. A. Prado, Abel Vesa and Luca Weiss)
- Various dt binding updates for qcom-cpufreq-nvmem and
opp-v2-kryo-cpu (Christian Marangi)
- Make kobj_type structure in the cpufreq core constant (Thomas
Weißschuh)
- Make cpufreq_unregister_driver() return void (Uwe Kleine-König)
- Make the TEO cpuidle governor check CPU utilization in order to
refine idle state selection (Kajetan Puchalski)
- Make Kconfig select the haltpoll cpuidle governor when the haltpoll
cpuidle driver is selected and replace a default_idle() call in
that driver with arch_cpu_idle() to allow MWAIT to be used (Li
RongQing)
- Add Emerald Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy)
- Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
avoid randconfig build failures (Arnd Bergmann)
- Make kobj_type structures used in the cpuidle sysfs interface
constant (Thomas Weißschuh)
- Make the cpuidle driver registration code update microsecond values
of idle state parameters in accordance with their nanosecond values
if they are provided (Rafael Wysocki)
- Make the PSCI cpuidle driver prevent topology CPUs from being
suspended on PREEMPT_RT (Krzysztof Kozlowski)
- Document that pm_runtime_force_suspend() cannot be used with
DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald)
- Add EXPORT macros for exporting PM functions from drivers (Richard
Fitzgerald)
- Remove /** from non-kernel-doc comments in hibernation code (Randy
Dunlap)
- Fix possible name leak in powercap_register_zone() (Yang Yingliang)
- Add Meteor Lake and Emerald Rapids support to the intel_rapl power
capping driver (Zhang Rui)
- Modify the idle_inject power capping facility to support 100% idle
injection (Srinivas Pandruvada)
- Fix large time windows handling in the intel_rapl power capping
driver (Zhang Rui)
- Fix memory leaks with using debugfs_lookup() in the generic PM
domains and Energy Model code (Greg Kroah-Hartman)
- Add missing 'cache-unified' property in the example for kryo OPP
bindings (Rob Herring)
- Fix error checking in opp_migrate_dentry() (Qi Zheng)
- Let qcom,opp-fuse-level be a 2-long array for qcom SoCs (Konrad
Dybcio)
- Modify some power management utilities to use the canonical ftrace
path (Ross Zwisler)
- Correct spelling problems for Documentation/power/ as reported by
codespell (Randy Dunlap)"
* tag 'pm-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits)
Documentation: amd-pstate: disambiguate user space sections
cpufreq: amd-pstate: Fix invalid write to MSR_AMD_CPPC_REQ
dt-bindings: opp: opp-v2-kryo-cpu: enlarge opp-supported-hw maximum
dt-bindings: cpufreq: qcom-cpufreq-nvmem: make cpr bindings optional
dt-bindings: cpufreq: qcom-cpufreq-nvmem: specify supported opp tables
PM: Add EXPORT macros for exporting PM functions
cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
MIPS: loongson32: Drop obsolete cpufreq platform device
powercap: intel_rapl: Fix handling for large time window
cpuidle: driver: Update microsecond values of state parameters as needed
cpuidle: sysfs: make kobj_type structures constant
cpuidle: add ARCH_SUSPEND_POSSIBLE dependencies
PM: EM: fix memory leak with using debugfs_lookup()
PM: domains: fix memory leak with using debugfs_lookup()
cpufreq: Make kobj_type structure constant
cpufreq: davinci: Fix clk use after free
cpufreq: amd-pstate: avoid uninitialized variable use
cpufreq: Make cpufreq_unregister_driver() return void
OPP: fix error checking in opp_migrate_dentry()
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SM8550 compatible
...
Emerald Rapids (EMR) is the next Intel Xeon processor after Sapphire
Rapids (SPR).
EMR C-states are the same as SPR C-states, and we expect that EMR
C-state characteristics (latency and target residency) will be the
same as in SPR. Therefore, add EMR support by using SPR C-states table.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
So objtool found this bug:
vmlinux.o: warning: objtool: intel_idle_irq+0x10c: call to trace_hardirqs_off() leaves .noinstr.text section
As per commit 32d4fd5751 ("cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE"):
"must not have tracing in idle functions"
Clearly people can't read and tinker along until splat dissapears.
This straight up reverts commit d295ad34f2 ("intel_idle: Fix false
positive RCU splats due to incorrect hardirqs state").
It doesn't re-introduce the problem because preceding patches fixed it
properly.
Fixes: d295ad34f2 ("intel_idle: Fix false positive RCU splats due to incorrect hardirqs state")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.434302128@infradead.org
Similar to the other other AlderLake platforms, the C1 and C1E states on
ADL-N are mutually exclusive. Only one of them can be enabled at a time.
C1E is preferred on ADL-N for better energy efficiency.
C6S is also supported on this platform. Its latency is far bigger than
C6, but really close to C8 (PC8), thus it is not exposed as a separate
state.
Suggested-by: Baieswara Reddy Sagili <baieswara.reddy.sagili@intel.com>
Suggested-by: Vinay Kumar <vinay.kumar@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>