Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete. Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.
In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.
Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.
The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should not be significant.
Based on an earlier patch from Srinivas Pandruvada.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Disable HWP Interrupt notification before enabling HWP. Since we don't
have HWP interrupt handling for possible performance interrupts, there
is not much use of enabling HWP interrupts.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the processor supports HWP, enable it by default without checking
for the cpu model. This will allow to enable HWP in all supported
processors without driver change.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The intel-pstate driver is using intel_pstate_hwp_set() from two
separate paths, i.e. ->set_policy() callback and sysfs update path for
the files present in /sys/devices/system/cpu/intel_pstate/ directory.
While an update to the sysfs path applies to all the CPUs being managed
by the driver (which essentially means all the online CPUs), the update
via the ->set_policy() callback applies to a smaller group of CPUs
managed by the policy for which ->set_policy() is called.
And so, intel_pstate_hwp_set() should update frequencies of only the
CPUs that are part of policy->cpus mask, while it is called from
->set_policy() callback.
In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
and apply the frequency changes only to the concerned CPUs.
For ->set_policy() path, we are only concerned about policy->cpus, and
so policy->rwsem lock taken by the core prior to calling ->set_policy()
is enough to take care of any races. The larger lock acquired by
get_online_cpus() is required only for the updates to sysfs files.
Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
from the sysfs update paths.
This also fixes a lockdep reported recently, where policy->rwsem and
get_online_cpus() could have been acquired in any order causing an ABBA
deadlock. The sequence of events leading to that was:
intel_pstate_init(...)
...cpufreq_online(...)
down_write(&policy->rwsem); // Locks policy->rwsem
...
cpufreq_init_policy(policy);
...intel_pstate_hwp_set();
get_online_cpus(); // Temporarily locks cpu_hotplug.lock
...
up_write(&policy->rwsem);
pm_suspend(...)
...disable_nonboot_cpus()
_cpu_down()
cpu_hotplug_begin(); // Locks cpu_hotplug.lock
__cpu_notify(CPU_DOWN_PREPARE, ...);
...cpufreq_offline_prepare();
down_write(&policy->rwsem); // Locks policy->rwsem
Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
785ee27 ("cpufreq: intel_pstate: Fix limits->max_perf rounding error")
hardcodes the value of FRAC_BITS. This patch fixes that minor issue.
Fixes: 785ee27881 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In cases where we have many IOs, the global load becomes low and the
load algorithm will decrease the requested P-State. Because of that,
the IOs overheads will increase and impact the IO performances.
To improve IO bound work, we can count the io-wait time as busy time
in calculating CPU busy.
This change uses get_cpu_iowait_time_us() to obtain the IO wait time value
and converts time into number of cycles spent waiting on IO at the TSC
rate. At the moment, this trick is only used for Atom.
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current function to calculate cpu utilization uses the average P-state
ratio (APerf/Mperf) scaled by the ratio of the current P-state to the
max available non-turbo one. This leads to an overestimation of
utilization which causes higher-performance P-states to be selected more
often and that leads to increased energy consumption.
This is a problem for low-power systems, so it is better to use a
different utilization calculation algorithm for them.
Namely, the Percent Busy value (or load) can be estimated as the ratio of the
MPERF counter that runs at a constant rate only during active periods (C0) to
the time stamp counter (TSC) that also runs (at the same rate) during idle.
That is:
Percent Busy = 100 * (delta_mperf / delta_tsc)
Use this algorithm for platforms with SoCs based on the Airmont and Silvermont
Atom cores.
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Target systems using different cpus have different power and performance
requirements. They may use different algorithms to get the next P-state
based on their power or performance preference.
For example, power-constrained systems may not want to use
high-performance P-states as aggressively as a full-size desktop or a
server platform. A server platform may want to run close to the max to
achieve better performance, while laptop-like systems may prefer
sacrificing performance for longer battery lifes.
For the above reasons, modify intel_pstate to allow the target P-state
selection algorithm to be depend on the CPU ID.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If hardware-driven P-state selection (HWP) is enabled, the
"performance" mode of intel_pstate should only allow the processor
to use the highest-performance P-state available. That is not
the case currently, so make it actually happen.
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A rounding error was found in the calculation of limits->max_perf
in intel_pstate_set_policy(), which is used to calculate the max and min
pstate values in intel_pstate_get_min_max(). In that code,
limits->max_perf is truncated to 2 hex digits such that, for example,
0x169 was incorrectly calculated to 0x16 instead of 0x17. This resulted in
the pstate being set one level too low. This patch rounds the value of
limits->max_perf up instead of down so that the correct max pstate can
be reached.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
I have a Intel (6,63) processor with a "marketing" frequency (from
/proc/cpuinfo) of 2100MHz, and a max turbo frequency of 2600MHz. I
can execute
cpupower frequency-set -g powersave --min 1200MHz --max 2100MHz
and the max_freq_pct is set to 80. When adding load to the system I noticed
that the cpu frequency only reached 2000MHZ and not 2100MHz as expected.
This is because limits->max_policy_pct is calculated as 2100 * 100 /2600 = 80.7
and is rounded down to 80 when it should be rounded up to 81. This patch
adds a DIV_ROUND_UP() which will return the correct value.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are two flavors of Atom cores to be supported by intel_pstate,
Silvermont and Airmont, so make the driver distinguish between them by
adding separate frequency tables.
Separate the CPU defaults params for each of them and match the CPU IDs
against them as appropriate.
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename symbol and function names starting with "BYT" or "byt" to
start with "ATOM" or "atom", respectively, so as to make it clear
that they may apply to Atom in general and not just to Baytrail
(the goal is to support several Atoms architectures eventually).
This should not lead to any functional changes.
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 37afb00032 (cpufreq: intel_pstate: Use ACPI perf
configuration) that is reported to cause a regression to happen
on a system where invalid data are returned by the ACPI _PSS object.
Since that commit makes assumptions regarding the _PSS output
correctness that may turn out to be overly optimistic in general,
there is a concern that it may introduce regression on more
systems, so it's better to revert it now and we'll revisit the
underlying issue in the next cycle with a more robust solution.
Conflicts:
drivers/cpufreq/intel_pstate.c
Fixes: 37afb00032 (cpufreq: intel_pstate: Use ACPI perf configuration)
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 4ef4514870 (cpufreq: intel_pstate: Avoid calculation for
max/min) as it depends on commit 37afb00032 (cpufreq: intel_pstate: Use
ACPI perf configuration) that causes problems to happen and needs to be
reverted.
Conflicts:
drivers/cpufreq/intel_pstate.c
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When booting an HWP enabled system the kernel displays one "HWP enabled"
message for each cpu. The messages are superfluous since HWP is globally
enabled across all CPUs. This patch also adds an informational message
when HWP is disabled via intel_pstate=no_hwp.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On systems that initialize the intel_pstate driver with the performance
governor, and then switch to the powersave governor will not transition to
lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
is set to a low value.
The behavior of governor switching changed after commit a04759924e
("[cpufreq] intel_pstate: honor user space min_perf_pct override on
resume"). The commit introduced tracking of performance percentage
changes via sysfs in order to restore userspace changes during
suspend/resume. The problem occurs because the global values of the newly
introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
change and this causes the powersave governor to inherit the performance
governor's settings.
A simple change would have been to reset max_sysfs_pct to 100 and
min_sysfs_pct to 0 on a governor change, which fixes the problem with
governor switching. However, since we cannot break userspace[1] the fix
is now to give each governor its own limits storage area so that governor
specific changes are tracked.
I successfully tested this by booting with both the performance governor
and the powersave governor by default, and switching between the two
governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
and looking at the output of cpupower frequency-info). Suspend/Resume
testing was performed by Doug Smythies.
[1] Systems which suspend/resume using the unmaintained pm-utils package
will always transition to the performance governor before the suspend and
after the resume. This means a system using the powersave governor will
go from powersave to performance, then suspend/resume, performance to
powersave. The simple change during governor changes would have been
overwritten when the governor changed before and after the suspend/resume.
I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
against Fedora to remove the 94cpufreq file that causes the problem. It
should be noted that pm-utils is obsoleted with newer versions of systemd.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is a workaround for KNL platform, where in some cases MPERF counter
will not have updated value before next read of MSR_IA32_MPERF. In this
case divide by zero will occur. This change ignores current sample for
busy calculation in this case.
Fixes: b34ef932d7 (intel_pstate: Knights Landing support)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When requested from cpufreq to set policy, look into _pss and get
control values, instead of using max/min perf calculations. These
calculation misses next control state in boundary conditions.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use ACPI _PSS to limit the Intel P State turbo, max and min ratios.
This driver uses acpi processor perf lib calls to register performance.
The following logic is used to adjust Intel P state driver limits:
- If there is no turbo entry in _PSS, then disable Intel P state turbo
and limit to non turbo max
- If the non turbo max ratio is more than _PSS max non turbo value, then
set the max non turbo ratio to _PSS non turbo max
- If the min ratio is less than _PSS min then change the min ratio
matching _PSS min
- Scale the _PSS turbo frequency to max turbo frequency based on control
value.
This feature can be disabled by using kernel parameters:
intel_pstate=no_acpi
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Systems with configurable TDP have multiple max non turbo p state. Intel
P state uses max non turbo P state for scaling. But using the real max
non turbo p state causes underestimation of next P state. So using
the physical max non turbo P state as before for scaling.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>