docs: power: convert docs to ReST and rename to *.rst

Convert the PM documents to ReST, in order to allow them to
build with Sphinx.

The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
This commit is contained in:
Mauro Carvalho Chehab
2019-06-13 07:10:36 -03:00
committed by Bjorn Helgaas
parent 9595aee2a3
commit 151f4e2bdc
51 changed files with 2127 additions and 1706 deletions

View File

@@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org
Description:
The powercap/ class sub directory belongs to the power cap
subsystem. Refer to
Documentation/power/powercap/powercap.txt for details.
Documentation/power/powercap/powercap.rst for details.
What: /sys/class/powercap/<control type>
Date: September 2013

View File

@@ -13,7 +13,7 @@
For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
are available
See also Documentation/power/runtime_pm.txt, pci=noacpi
See also Documentation/power/runtime_pm.rst, pci=noacpi
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
@@ -223,7 +223,7 @@
acpi_sleep= [HW,ACPI] Sleep options
Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
old_ordering, nonvs, sci_force_enable, nobl }
See Documentation/power/video.txt for information on
See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
@@ -4108,7 +4108,7 @@
Specify the offset from the beginning of the partition
given by "resume=" at which the swap header is located,
in <PAGE_SIZE> units (needed only for swap files).
See Documentation/power/swsusp-and-swap-files.txt
See Documentation/power/swsusp-and-swap-files.rst
resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to
read the resume files

View File

@@ -95,7 +95,7 @@ flags - flags of the cpufreq driver
3. CPUFreq Table Generation with Operating Performance Point (OPP)
==================================================================
For details about OPP, see Documentation/power/opp.txt
For details about OPP, see Documentation/power/opp.rst
dev_pm_opp_init_cpufreq_table -
This function provides a ready to use conversion routine to translate

View File

@@ -225,7 +225,7 @@ system-wide transition to a sleep state even though its :c:member:`runtime_auto`
flag is clear.
For more information about the runtime power management framework, refer to
:file:`Documentation/power/runtime_pm.txt`.
:file:`Documentation/power/runtime_pm.rst`.
Calling Drivers to Enter and Leave System Sleep States
@@ -728,7 +728,7 @@ it into account in any way.
Devices may be defined as IRQ-safe which indicates to the PM core that their
runtime PM callbacks may be invoked with disabled interrupts (see
:file:`Documentation/power/runtime_pm.txt` for more information). If an
:file:`Documentation/power/runtime_pm.rst` for more information). If an
IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be
disallowed, unless the domain itself is defined as IRQ-safe. However, it
makes sense to define a PM domain as IRQ-safe only if all the devices in it
@@ -795,7 +795,7 @@ so on) and the final state of the device must reflect the "active" runtime PM
status in that case.
During system-wide resume from a sleep state it's easiest to put devices into
the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`.
[Refer to that document for more information regarding this particular issue as
well as for information on the device runtime power management framework in
general.]

View File

@@ -46,7 +46,7 @@ device is turned off while the system as a whole remains running, we
call it a "dynamic suspend" (also known as a "runtime suspend" or
"selective suspend"). This document concentrates mostly on how
dynamic PM is implemented in the USB subsystem, although system PM is
covered to some extent (see ``Documentation/power/*.txt`` for more
covered to some extent (see ``Documentation/power/*.rst`` for more
information about system PM).
System PM support is present only if the kernel was built with

View File

@@ -1,5 +1,7 @@
============
APM or ACPI?
------------
============
If you have a relatively recent x86 mobile, desktop, or server system,
odds are it supports either Advanced Power Management (APM) or
Advanced Configuration and Power Interface (ACPI). ACPI is the newer
@@ -28,5 +30,7 @@ and be sure that they are started sometime in the system boot process.
Go ahead and start both. If ACPI or APM is not available on your
system the associated daemon will exit gracefully.
apmd: http://ftp.debian.org/pool/main/a/apmd/
acpid: http://acpid.sf.net/
===== =======================================
apmd http://ftp.debian.org/pool/main/a/apmd/
acpid http://acpid.sf.net/
===== =======================================

View File

@@ -1,12 +1,16 @@
=================================
Debugging hibernation and suspend
=================================
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
1. Testing hibernation (aka suspend to disk or STD)
===================================================
To check if hibernation works, you can try to hibernate in the "reboot" mode:
To check if hibernation works, you can try to hibernate in the "reboot" mode::
# echo reboot > /sys/power/disk
# echo disk > /sys/power/state
# echo reboot > /sys/power/disk
# echo disk > /sys/power/state
and the system should create a hibernation image, reboot, resume and get back to
the command prompt where you have started the transition. If that happens,
@@ -15,20 +19,21 @@ test at least a couple of times in a row for confidence. [This is necessary,
because some problems only show up on a second attempt at suspending and
resuming the system.] Moreover, hibernating in the "reboot" and "shutdown"
modes causes the PM core to skip some platform-related callbacks which on ACPI
systems might be necessary to make hibernation work. Thus, if your machine fails
to hibernate or resume in the "reboot" mode, you should try the "platform" mode:
systems might be necessary to make hibernation work. Thus, if your machine
fails to hibernate or resume in the "reboot" mode, you should try the
"platform" mode::
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
which is the default and recommended mode of hibernation.
Unfortunately, the "platform" mode of hibernation does not work on some systems
with broken BIOSes. In such cases the "shutdown" mode of hibernation might
work:
work::
# echo shutdown > /sys/power/disk
# echo disk > /sys/power/state
# echo shutdown > /sys/power/disk
# echo disk > /sys/power/state
(it is similar to the "reboot" mode, but it requires you to press the power
button to make the system resume).
@@ -37,6 +42,7 @@ If neither "platform" nor "shutdown" hibernation mode works, you will need to
identify what goes wrong.
a) Test modes of hibernation
----------------------------
To find out why hibernation fails on your system, you can use a special testing
facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then,
@@ -44,36 +50,38 @@ there is the file /sys/power/pm_test that can be used to make the hibernation
core run in a test mode. There are 5 test modes available:
freezer
- test the freezing of processes
- test the freezing of processes
devices
- test the freezing of processes and suspending of devices
- test the freezing of processes and suspending of devices
platform
- test the freezing of processes, suspending of devices and platform
global control methods(*)
- test the freezing of processes, suspending of devices and platform
global control methods [1]_
processors
- test the freezing of processes, suspending of devices, platform
global control methods(*) and the disabling of nonboot CPUs
- test the freezing of processes, suspending of devices, platform
global control methods [1]_ and the disabling of nonboot CPUs
core
- test the freezing of processes, suspending of devices, platform global
control methods(*), the disabling of nonboot CPUs and suspending of
platform/system devices
- test the freezing of processes, suspending of devices, platform global
control methods\ [1]_, the disabling of nonboot CPUs and suspending
of platform/system devices
(*) the platform global control methods are only available on ACPI systems
.. [1]
the platform global control methods are only available on ACPI systems
and are only tested if the hibernation mode is set to "platform"
To use one of them it is necessary to write the corresponding string to
/sys/power/pm_test (eg. "devices" to test the freezing of processes and
suspending devices) and issue the standard hibernation commands. For example,
to use the "devices" test mode along with the "platform" mode of hibernation,
you should do the following:
you should do the following::
# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
Then, the kernel will try to freeze processes, suspend devices, wait a few
seconds (5 by default, but configurable by the suspend.pm_test_delay module
@@ -108,11 +116,12 @@ If the "devices" test fails, most likely there is a driver that cannot suspend
or resume its device (in the latter case the system may hang or become unstable
after the test, so please take that into consideration). To find this driver,
you can carry out a binary search according to the rules:
- if the test fails, unload a half of the drivers currently loaded and repeat
(that would probably involve rebooting the system, so always note what drivers
have been loaded before the test),
(that would probably involve rebooting the system, so always note what drivers
have been loaded before the test),
- if the test succeeds, load a half of the drivers you have unloaded most
recently and repeat.
recently and repeat.
Once you have found the failing driver (there can be more than just one of
them), you have to unload it every time before hibernation. In that case please
@@ -146,6 +155,7 @@ indicates a serious problem that very well may be related to the hardware, but
please report it anyway.
b) Testing minimal configuration
--------------------------------
If all of the hibernation test modes work, you can boot the system with the
"init=/bin/bash" command line parameter and attempt to hibernate in the
@@ -165,14 +175,15 @@ Again, if you find the offending module(s), it(they) must be unloaded every time
before hibernation, and please report the problem with it(them).
c) Using the "test_resume" hibernation option
---------------------------------------------
/sys/power/disk generally tells the kernel what to do after creating a
hibernation image. One of the available options is "test_resume" which
causes the just created image to be used for immediate restoration. Namely,
after doing:
after doing::
# echo test_resume > /sys/power/disk
# echo disk > /sys/power/state
# echo test_resume > /sys/power/disk
# echo disk > /sys/power/state
a hibernation image will be created and a resume from it will be triggered
immediately without involving the platform firmware in any way.
@@ -190,6 +201,7 @@ to resume may be related to the differences between the restore and image
kernels.
d) Advanced debugging
---------------------
In case that hibernation does not work on your system even in the minimal
configuration and compiling more drivers as modules is not practical or some
@@ -200,9 +212,10 @@ kernel messages using the serial console. This may provide you with some
information about the reasons of the suspend (resume) failure. Alternatively,
it may be possible to use a FireWire port for debugging with firescope
(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to
use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst .
2. Testing suspend to RAM (STR)
===============================
To verify that the STR works, it is generally more convenient to use the s2ram
tool available from http://suspend.sf.net and documented at
@@ -230,7 +243,8 @@ you will have to unload them every time before an STR transition (ie. before
you run s2ram), and please report the problems with them.
There is a debugfs entry which shows the suspend to RAM statistics. Here is an
example of its output.
example of its output::
# mount -t debugfs none /sys/kernel/debug
# cat /sys/kernel/debug/suspend_stats
success: 20
@@ -248,6 +262,7 @@ example of its output.
-16
last_failed_step: suspend
suspend
Field success means the success number of suspend to RAM, and field fail means
the failure number. Others are the failure number of different steps of suspend
to RAM. suspend_stats just lists the last 2 failed devices, error number and

View File

@@ -1,4 +1,7 @@
===============
Charger Manager
===============
(C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL
Charger Manager provides in-kernel battery charger management that
@@ -55,41 +58,39 @@ Charger Manager supports the following:
notification to users with UEVENT.
2. Global Charger-Manager Data related with suspend_again
========================================================
=========================================================
In order to setup Charger Manager with suspend-again feature
(in-suspend monitoring), the user should provide charger_global_desc
with setup_charger_manager(struct charger_global_desc *).
with setup_charger_manager(`struct charger_global_desc *`).
This charger_global_desc data for in-suspend monitoring is global
as the name suggests. Thus, the user needs to provide only once even
if there are multiple batteries. If there are multiple batteries, the
multiple instances of Charger Manager share the same charger_global_desc
and it will manage in-suspend monitoring for all instances of Charger Manager.
The user needs to provide all the three entries properly in order to activate
in-suspend monitoring:
The user needs to provide all the three entries to `struct charger_global_desc`
properly in order to activate in-suspend monitoring:
struct charger_global_desc {
char *rtc_name;
: The name of rtc (e.g., "rtc0") used to wakeup the system from
`char *rtc_name;`
The name of rtc (e.g., "rtc0") used to wakeup the system from
suspend for Charger Manager. The alarm interrupt (AIE) of the rtc
should be able to wake up the system from suspend. Charger Manager
saves and restores the alarm value and use the previously-defined
alarm if it is going to go off earlier than Charger Manager so that
Charger Manager does not interfere with previously-defined alarms.
bool (*rtc_only_wakeup)(void);
: This callback should let CM know whether
`bool (*rtc_only_wakeup)(void);`
This callback should let CM know whether
the wakeup-from-suspend is caused only by the alarm of "rtc" in the
same struct. If there is any other wakeup source triggered the
wakeup, it should return false. If the "rtc" is the only wakeup
reason, it should return true.
bool assume_timer_stops_in_suspend;
: if true, Charger Manager assumes that
`bool assume_timer_stops_in_suspend;`
if true, Charger Manager assumes that
the timer (CM uses jiffies as timer) stops during suspend. Then, CM
assumes that the suspend-duration is same as the alarm length.
};
3. How to setup suspend_again
=============================
@@ -109,26 +110,28 @@ if the system was woken up by Charger Manager and the polling
=============================================
For each battery charged independently from other batteries (if a series of
batteries are charged by a single charger, they are counted as one independent
battery), an instance of Charger Manager is attached to it.
battery), an instance of Charger Manager is attached to it. The following
struct charger_desc {
struct charger_desc elements:
char *psy_name;
: The power-supply-class name of the battery. Default is
`char *psy_name;`
The power-supply-class name of the battery. Default is
"battery" if psy_name is NULL. Users can access the psy entries
at "/sys/class/power_supply/[psy_name]/".
enum polling_modes polling_mode;
: CM_POLL_DISABLE: do not poll this battery.
CM_POLL_ALWAYS: always poll this battery.
CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if
an external power source is attached.
CM_POLL_CHARGING_ONLY: poll this battery if and only if the
battery is being charged.
`enum polling_modes polling_mode;`
CM_POLL_DISABLE:
do not poll this battery.
CM_POLL_ALWAYS:
always poll this battery.
CM_POLL_EXTERNAL_POWER_ONLY:
poll this battery if and only if an external power
source is attached.
CM_POLL_CHARGING_ONLY:
poll this battery if and only if the battery is being charged.
unsigned int fullbatt_vchkdrop_ms;
unsigned int fullbatt_vchkdrop_uV;
: If both have non-zero values, Charger Manager will check the
`unsigned int fullbatt_vchkdrop_ms; / unsigned int fullbatt_vchkdrop_uV;`
If both have non-zero values, Charger Manager will check the
battery voltage drop fullbatt_vchkdrop_ms after the battery is fully
charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger
Manager will try to recharge the battery by disabling and enabling
@@ -136,50 +139,52 @@ unsigned int fullbatt_vchkdrop_uV;
condition) is needed to be implemented with hardware interrupts from
fuel gauges or charger devices/chips.
unsigned int fullbatt_uV;
: If specified with a non-zero value, Charger Manager assumes
`unsigned int fullbatt_uV;`
If specified with a non-zero value, Charger Manager assumes
that the battery is full (capacity = 100) if the battery is not being
charged and the battery voltage is equal to or greater than
fullbatt_uV.
unsigned int polling_interval_ms;
: Required polling interval in ms. Charger Manager will poll
`unsigned int polling_interval_ms;`
Required polling interval in ms. Charger Manager will poll
this battery every polling_interval_ms or more frequently.
enum data_source battery_present;
: CM_BATTERY_PRESENT: assume that the battery exists.
CM_NO_BATTERY: assume that the battery does not exists.
CM_FUEL_GAUGE: get battery presence information from fuel gauge.
CM_CHARGER_STAT: get battery presence from chargers.
`enum data_source battery_present;`
CM_BATTERY_PRESENT:
assume that the battery exists.
CM_NO_BATTERY:
assume that the battery does not exists.
CM_FUEL_GAUGE:
get battery presence information from fuel gauge.
CM_CHARGER_STAT:
get battery presence from chargers.
char **psy_charger_stat;
: An array ending with NULL that has power-supply-class names of
`char **psy_charger_stat;`
An array ending with NULL that has power-supply-class names of
chargers. Each power-supply-class should provide "PRESENT" (if
battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an
external power source is attached or not), and "STATUS" (shows whether
the battery is {"FULL" or not FULL} or {"FULL", "Charging",
"Discharging", "NotCharging"}).
int num_charger_regulators;
struct regulator_bulk_data *charger_regulators;
: Regulators representing the chargers in the form for
`int num_charger_regulators; / struct regulator_bulk_data *charger_regulators;`
Regulators representing the chargers in the form for
regulator framework's bulk functions.
char *psy_fuel_gauge;
: Power-supply-class name of the fuel gauge.
`char *psy_fuel_gauge;`
Power-supply-class name of the fuel gauge.
int (*temperature_out_of_range)(int *mC);
bool measure_battery_temp;
: This callback returns 0 if the temperature is safe for charging,
`int (*temperature_out_of_range)(int *mC); / bool measure_battery_temp;`
This callback returns 0 if the temperature is safe for charging,
a positive number if it is too hot to charge, and a negative number
if it is too cold to charge. With the variable mC, the callback returns
the temperature in 1/1000 of centigrade.
The source of temperature can be battery or ambient one according to
the value of measure_battery_temp.
};
5. Notify Charger-Manager of charger events: cm_notify_event()
=========================================================
==============================================================
If there is an charger event is required to notify
Charger Manager, a charger device driver that triggers the event can call
cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager.

View File

@@ -1,7 +1,11 @@
====================================================
Testing suspend and resume support in device drivers
====================================================
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
1. Preparing the test system
============================
Unfortunately, to effectively test the support for the system-wide suspend and
resume transitions in a driver, it is necessary to suspend and resume a fully
@@ -14,19 +18,20 @@ the machine's BIOS.
Of course, for this purpose the test system has to be known to suspend and
resume without the driver being tested. Thus, if possible, you should first
resolve all suspend/resume-related problems in the test system before you start
testing the new driver. Please see Documentation/power/basic-pm-debugging.txt
testing the new driver. Please see Documentation/power/basic-pm-debugging.rst
for more information about the debugging of suspend/resume functionality.
2. Testing the driver
=====================
Once you have resolved the suspend/resume-related problems with your test system
without the new driver, you are ready to test it:
a) Build the driver as a module, load it and try the test modes of hibernation
(see: Documentation/power/basic-pm-debugging.txt, 1).
(see: Documentation/power/basic-pm-debugging.rst, 1).
b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and
"platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1).
"platform" modes (see: Documentation/power/basic-pm-debugging.rst, 1).
c) Compile the driver directly into the kernel and try the test modes of
hibernation.
@@ -34,12 +39,12 @@ c) Compile the driver directly into the kernel and try the test modes of
d) Attempt to hibernate with the driver compiled directly into the kernel
in the "reboot", "shutdown" and "platform" modes.
e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt,
e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst,
2). [As far as the STR tests are concerned, it should not matter whether or
not the driver is built as a module.]
f) Attempt to suspend to RAM using the s2ram tool with the driver loaded
(see: Documentation/power/basic-pm-debugging.txt, 2).
(see: Documentation/power/basic-pm-debugging.rst, 2).
Each of the above tests should be repeated several times and the STD tests
should be mixed with the STR tests. If any of them fails, the driver cannot be

View File

@@ -1,6 +1,6 @@
====================
Energy Model of CPUs
====================
====================
Energy Model of CPUs
====================
1. Overview
-----------
@@ -20,7 +20,7 @@ kernel, hence enabling to avoid redundant work.
The figure below depicts an example of drivers (Arm-specific here, but the
approach is applicable to any architecture) providing power costs to the EM
framework, and interested clients reading the data from it.
framework, and interested clients reading the data from it::
+---------------+ +-----------------+ +---------------+
| Thermal (IPA) | | Scheduler (EAS) | | Other |
@@ -58,15 +58,17 @@ micro-architectures.
2. Core APIs
------------
2.1 Config options
2.1 Config options
^^^^^^^^^^^^^^^^^^
CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
2.2 Registration of performance domains
2.2 Registration of performance domains
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Drivers are expected to register performance domains into the EM framework by
calling the following API:
calling the following API::
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb);
@@ -80,7 +82,8 @@ callback, and kernel/power/energy_model.c for further documentation on this
API.
2.3 Accessing performance domains
2.3 Accessing performance domains
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Subsystems interested in the energy model of a CPU can retrieve it using the
em_cpu_get() API. The energy model tables are allocated once upon creation of
@@ -99,46 +102,46 @@ More details about the above APIs can be found in include/linux/energy_model.h.
This section provides a simple example of a CPUFreq driver registering a
performance domain in the Energy Model framework using the (fake) 'foo'
protocol. The driver implements an est_power() function to be provided to the
EM framework.
EM framework::
-> drivers/cpufreq/foo_cpufreq.c
-> drivers/cpufreq/foo_cpufreq.c
01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
02 {
03 long freq, power;
04
05 /* Use the 'foo' protocol to ceil the frequency */
06 freq = foo_get_freq_ceil(cpu, *KHz);
07 if (freq < 0);
08 return freq;
09
10 /* Estimate the power cost for the CPU at the relevant freq. */
11 power = foo_estimate_power(cpu, freq);
12 if (power < 0);
13 return power;
14
15 /* Return the values to the EM framework */
16 *mW = power;
17 *KHz = freq;
18
19 return 0;
20 }
21
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
23 {
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
25 int nr_opp, ret;
26
27 /* Do the actual CPUFreq init work ... */
28 ret = do_foo_cpufreq_init(policy);
29 if (ret)
30 return ret;
31
32 /* Find the number of OPPs for this policy */
33 nr_opp = foo_get_nr_opp(policy);
34
35 /* And register the new performance domain */
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
37
38 return 0;
39 }
01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
02 {
03 long freq, power;
04
05 /* Use the 'foo' protocol to ceil the frequency */
06 freq = foo_get_freq_ceil(cpu, *KHz);
07 if (freq < 0);
08 return freq;
09
10 /* Estimate the power cost for the CPU at the relevant freq. */
11 power = foo_estimate_power(cpu, freq);
12 if (power < 0);
13 return power;
14
15 /* Return the values to the EM framework */
16 *mW = power;
17 *KHz = freq;
18
19 return 0;
20 }
21
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
23 {
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
25 int nr_opp, ret;
26
27 /* Do the actual CPUFreq init work ... */
28 ret = do_foo_cpufreq_init(policy);
29 if (ret)
30 return ret;
31
32 /* Find the number of OPPs for this policy */
33 nr_opp = foo_get_nr_opp(policy);
34
35 /* And register the new performance domain */
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
37
38 return 0;
39 }

View File

@@ -1,13 +1,18 @@
=================
Freezing of tasks
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
=================
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
I. What is the freezing of tasks?
=================================
The freezing of tasks is a mechanism by which user space processes and some
kernel threads are controlled during hibernation or system-wide suspend (on some
architectures).
II. How does it work?
=====================
There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN
and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have
@@ -41,7 +46,7 @@ explicitly in suitable places or use the wait_event_freezable() or
wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
that combine interruptible sleep with checking if the task is to be frozen and
calling try_to_freeze(). The main loop of a freezable kernel thread may look
like the following one:
like the following one::
set_freezable();
do {
@@ -65,7 +70,7 @@ order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that
have been frozen leave __refrigerator() and continue running.
Rationale behind the functions dealing with freezing and thawing of tasks:
Rationale behind the functions dealing with freezing and thawing of tasks
-------------------------------------------------------------------------
freeze_processes():
@@ -86,6 +91,7 @@ thaw_processes():
III. Which kernel threads are freezable?
========================================
Kernel threads are not freezable by default. However, a kernel thread may clear
PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE
@@ -93,37 +99,39 @@ directly is not allowed). From this point it is regarded as freezable
and must call try_to_freeze() in a suitable place.
IV. Why do we do that?
======================
Generally speaking, there is a couple of reasons to use the freezing of tasks:
1. The principal reason is to prevent filesystems from being damaged after
hibernation. At the moment we have no simple means of checkpointing
filesystems, so if there are any modifications made to filesystem data and/or
metadata on disks, we cannot bring them back to the state from before the
modifications. At the same time each hibernation image contains some
filesystem-related information that must be consistent with the state of the
on-disk data and metadata after the system memory state has been restored from
the image (otherwise the filesystems will be damaged in a nasty way, usually
making them almost impossible to repair). We therefore freeze tasks that might
cause the on-disk filesystems' data and metadata to be modified after the
hibernation image has been created and before the system is finally powered off.
The majority of these are user space processes, but if any of the kernel threads
may cause something like this to happen, they have to be freezable.
hibernation. At the moment we have no simple means of checkpointing
filesystems, so if there are any modifications made to filesystem data and/or
metadata on disks, we cannot bring them back to the state from before the
modifications. At the same time each hibernation image contains some
filesystem-related information that must be consistent with the state of the
on-disk data and metadata after the system memory state has been restored
from the image (otherwise the filesystems will be damaged in a nasty way,
usually making them almost impossible to repair). We therefore freeze
tasks that might cause the on-disk filesystems' data and metadata to be
modified after the hibernation image has been created and before the
system is finally powered off. The majority of these are user space
processes, but if any of the kernel threads may cause something like this
to happen, they have to be freezable.
2. Next, to create the hibernation image we need to free a sufficient amount of
memory (approximately 50% of available RAM) and we need to do that before
devices are deactivated, because we generally need them for swapping out. Then,
after the memory for the image has been freed, we don't want tasks to allocate
additional memory and we prevent them from doing that by freezing them earlier.
[Of course, this also means that device drivers should not allocate substantial
amounts of memory from their .suspend() callbacks before hibernation, but this
is a separate issue.]
memory (approximately 50% of available RAM) and we need to do that before
devices are deactivated, because we generally need them for swapping out.
Then, after the memory for the image has been freed, we don't want tasks
to allocate additional memory and we prevent them from doing that by
freezing them earlier. [Of course, this also means that device drivers
should not allocate substantial amounts of memory from their .suspend()
callbacks before hibernation, but this is a separate issue.]
3. The third reason is to prevent user space processes and some kernel threads
from interfering with the suspending and resuming of devices. A user space
process running on a second CPU while we are suspending devices may, for
example, be troublesome and without the freezing of tasks we would need some
safeguards against race conditions that might occur in such a case.
from interfering with the suspending and resuming of devices. A user space
process running on a second CPU while we are suspending devices may, for
example, be troublesome and without the freezing of tasks we would need some
safeguards against race conditions that might occur in such a case.
Although Linus Torvalds doesn't like the freezing of tasks, he said this in one
of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608):
@@ -132,7 +140,7 @@ of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608):
Linus: In many ways, 'at all'.
I _do_ realize the IO request queue issues, and that we cannot actually do
I **do** realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
@@ -150,17 +158,18 @@ thawed after the driver's .resume() callback has run, so it won't be accessing
the device while it's suspended.
4. Another reason for freezing tasks is to prevent user space processes from
realizing that hibernation (or suspend) operation takes place. Ideally, user
space processes should not notice that such a system-wide operation has occurred
and should continue running without any problems after the restore (or resume
from suspend). Unfortunately, in the most general case this is quite difficult
to achieve without the freezing of tasks. Consider, for example, a process
that depends on all CPUs being online while it's running. Since we need to
disable nonboot CPUs during the hibernation, if this process is not frozen, it
may notice that the number of CPUs has changed and may start to work incorrectly
because of that.
realizing that hibernation (or suspend) operation takes place. Ideally, user
space processes should not notice that such a system-wide operation has
occurred and should continue running without any problems after the restore
(or resume from suspend). Unfortunately, in the most general case this
is quite difficult to achieve without the freezing of tasks. Consider,
for example, a process that depends on all CPUs being online while it's
running. Since we need to disable nonboot CPUs during the hibernation,
if this process is not frozen, it may notice that the number of CPUs has
changed and may start to work incorrectly because of that.
V. Are there any problems related to the freezing of tasks?
===========================================================
Yes, there are.
@@ -172,11 +181,12 @@ may be undesirable. That's why kernel threads are not freezable by default.
Second, there are the following two problems related to the freezing of user
space processes:
1. Putting processes into an uninterruptible sleep distorts the load average.
2. Now that we have FUSE, plus the framework for doing device drivers in
userspace, it gets even more complicated because some userspace processes are
now doing the sorts of things that kernel threads do
(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html).
userspace, it gets even more complicated because some userspace processes are
now doing the sorts of things that kernel threads do
(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html).
The problem 1. seems to be fixable, although it hasn't been fixed so far. The
other one is more serious, but it seems that we can work around it by using
@@ -201,6 +211,7 @@ requested early enough using the suspend notifier API described in
Documentation/driver-api/pm/notifiers.rst.
VI. Are there any precautions to be taken to prevent freezing failures?
=======================================================================
Yes, there are.
@@ -226,6 +237,8 @@ So, to summarize, use [un]lock_system_sleep() instead of directly using
mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures.
V. Miscellaneous
================
/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze
all user space processes or all freezable kernel threads, in unit of millisecond.
The default value is 20000, with range of unsigned integer.

View File

@@ -0,0 +1,46 @@
:orphan:
================
Power Management
================
.. toctree::
:maxdepth: 1
apm-acpi
basic-pm-debugging
charger-manager
drivers-testing
energy-model
freezing-of-tasks
interface
opp
pci
pm_qos_interface
power_supply_class
runtime_pm
s2ram
suspend-and-cpuhotplug
suspend-and-interrupts
swsusp-and-swap-files
swsusp-dmcrypt
swsusp
video
tricks
userland-swsusp
powercap/powercap
regulator/consumer
regulator/design
regulator/machine
regulator/overview
regulator/regulator
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -1,4 +1,6 @@
===========================================
Power Management Interface for System Sleep
===========================================
Copyright (c) 2016 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
@@ -11,10 +13,10 @@ mounted at /sys).
Reading from it returns a list of supported sleep states, encoded as:
'freeze' (Suspend-to-Idle)
'standby' (Power-On Suspend)
'mem' (Suspend-to-RAM)
'disk' (Suspend-to-Disk)
- 'freeze' (Suspend-to-Idle)
- 'standby' (Power-On Suspend)
- 'mem' (Suspend-to-RAM)
- 'disk' (Suspend-to-Disk)
Suspend-to-Idle is always supported. Suspend-to-Disk is always supported
too as long the kernel has been configured to support hibernation at all
@@ -32,18 +34,18 @@ Specifically, it tells the kernel what to do after creating a hibernation image.
Reading from it returns a list of supported options encoded as:
'platform' (put the system into sleep using a platform-provided method)
'shutdown' (shut the system down)
'reboot' (reboot the system)
'suspend' (trigger a Suspend-to-RAM transition)
'test_resume' (resume-after-hibernation test mode)
- 'platform' (put the system into sleep using a platform-provided method)
- 'shutdown' (shut the system down)
- 'reboot' (reboot the system)
- 'suspend' (trigger a Suspend-to-RAM transition)
- 'test_resume' (resume-after-hibernation test mode)
The currently selected option is printed in square brackets.
The 'platform' option is only available if the platform provides a special
mechanism to put the system to sleep after creating a hibernation image (ACPI
does that, for example). The 'suspend' option is available if Suspend-to-RAM
is supported. Refer to Documentation/power/basic-pm-debugging.txt for the
is supported. Refer to Documentation/power/basic-pm-debugging.rst for the
description of the 'test_resume' option.
To select an option, write the string representing it to /sys/power/disk.
@@ -71,7 +73,7 @@ If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume
event point in turn will be stored in the RTC memory (overwriting the actual
RTC information), so it will survive a system crash if one occurs right after
storing it and it can be used later to identify the driver that caused the crash
to happen (see Documentation/power/s2ram.txt for more information).
to happen (see Documentation/power/s2ram.rst for more information).
Initially it contains '0' which may be changed to '1' by writing a string
representing a nonzero integer into it.

View File

@@ -1,20 +1,23 @@
==========================================
Operating Performance Points (OPP) Library
==========================================
(C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated
Contents
--------
1. Introduction
2. Initial OPP List Registration
3. OPP Search Functions
4. OPP Availability Control Functions
5. OPP Data Retrieval Functions
6. Data Structures
.. Contents
1. Introduction
2. Initial OPP List Registration
3. OPP Search Functions
4. OPP Availability Control Functions
5. OPP Data Retrieval Functions
6. Data Structures
1. Introduction
===============
1.1 What is an Operating Performance Point (OPP)?
-------------------------------------------------
Complex SoCs of today consists of a multiple sub-modules working in conjunction.
In an operational system executing varied use cases, not all modules in the SoC
@@ -28,16 +31,19 @@ the device will support per domain are called Operating Performance Points or
OPPs.
As an example:
Let us consider an MPU device which supports the following:
{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V},
{1GHz at minimum voltage of 1.3V}
We can represent these as three OPPs as the following {Hz, uV} tuples:
{300000000, 1000000}
{800000000, 1200000}
{1000000000, 1300000}
- {300000000, 1000000}
- {800000000, 1200000}
- {1000000000, 1300000}
1.2 Operating Performance Points Library
----------------------------------------
OPP library provides a set of helper functions to organize and query the OPP
information. The library is located in drivers/base/power/opp.c and the header
@@ -46,9 +52,10 @@ CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on
CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to
optionally boot at a certain OPP without needing cpufreq.
Typical usage of the OPP library is as follows:
(users) -> registers a set of default OPPs -> (library)
SoC framework -> modifies on required cases certain OPPs -> OPP layer
Typical usage of the OPP library is as follows::
(users) -> registers a set of default OPPs -> (library)
SoC framework -> modifies on required cases certain OPPs -> OPP layer
-> queries to search/retrieve information ->
OPP layer expects each domain to be represented by a unique device pointer. SoC
@@ -57,8 +64,9 @@ list is expected to be an optimally small number typically around 5 per device.
This initial list contains a set of OPPs that the framework expects to be safely
enabled by default in the system.
Note on OPP Availability:
------------------------
Note on OPP Availability
^^^^^^^^^^^^^^^^^^^^^^^^
As the system proceeds to operate, SoC framework may choose to make certain
OPPs available or not available on each device based on various external
factors. Example usage: Thermal management or other exceptional situations where
@@ -88,7 +96,8 @@ registering the OPPs is maintained by OPP library throughout the device
operation. The SoC framework can subsequently control the availability of the
OPPs dynamically using the dev_pm_opp_enable / disable functions.
dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer.
dev_pm_opp_add
Add a new OPP for a specific domain represented by the device pointer.
The OPP is defined using the frequency and voltage. Once added, the OPP
is assumed to be available and control of it's availability can be done
with the dev_pm_opp_enable/disable functions. OPP library internally stores
@@ -96,9 +105,11 @@ dev_pm_opp_add - Add a new OPP for a specific domain represented by the device p
used by SoC framework to define a optimal list as per the demands of
SoC usage environment.
WARNING: Do not use this function in interrupt context.
WARNING:
Do not use this function in interrupt context.
Example::
Example:
soc_pm_init()
{
/* Do things */
@@ -125,12 +136,15 @@ Callers of these functions shall call dev_pm_opp_put() after they have used the
OPP. Otherwise the memory for the OPP will never get freed and result in
memleak.
dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
dev_pm_opp_find_freq_exact
Search for an OPP based on an *exact* frequency and
availability. This function is especially useful to enable an OPP which
is not available by default.
Example: In a case when SoC framework detects a situation where a
higher frequency could be made available, it can use this function to
find the OPP prior to call the dev_pm_opp_enable to actually make it available.
find the OPP prior to call the dev_pm_opp_enable to actually make
it available::
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
dev_pm_opp_put(opp);
/* dont operate on the pointer.. just do a sanity check.. */
@@ -141,27 +155,34 @@ dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
dev_pm_opp_enable(dev,1000000000);
}
NOTE: This is the only search function that operates on OPPs which are
not available.
NOTE:
This is the only search function that operates on OPPs which are
not available.
dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the
dev_pm_opp_find_freq_floor
Search for an available OPP which is *at most* the
provided frequency. This function is useful while searching for a lesser
match OR operating on OPP information in the order of decreasing
frequency.
Example: To find the highest opp for a device:
Example: To find the highest opp for a device::
freq = ULONG_MAX;
opp = dev_pm_opp_find_freq_floor(dev, &freq);
dev_pm_opp_put(opp);
dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the
dev_pm_opp_find_freq_ceil
Search for an available OPP which is *at least* the
provided frequency. This function is useful while searching for a
higher match OR operating on OPP information in the order of increasing
frequency.
Example 1: To find the lowest opp for a device:
Example 1: To find the lowest opp for a device::
freq = 0;
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
dev_pm_opp_put(opp);
Example 2: A simplified implementation of a SoC cpufreq_driver->target:
Example 2: A simplified implementation of a SoC cpufreq_driver->target::
soc_cpufreq_target(..)
{
/* Do stuff like policy checks etc. */
@@ -184,12 +205,15 @@ fine grained dynamic control of which sets of OPPs are operationally available.
These functions are intended to *temporarily* remove an OPP in conditions such
as thermal considerations (e.g. don't use OPPx until the temperature drops).
WARNING: Do not use these functions in interrupt context.
WARNING:
Do not use these functions in interrupt context.
dev_pm_opp_enable - Make a OPP available for operation.
dev_pm_opp_enable
Make a OPP available for operation.
Example: Lets say that 1GHz OPP is to be made available only if the
SoC temperature is lower than a certain threshold. The SoC framework
implementation might choose to do something as follows:
implementation might choose to do something as follows::
if (cur_temp < temp_low_thresh) {
/* Enable 1GHz if it was disabled */
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
@@ -201,10 +225,12 @@ dev_pm_opp_enable - Make a OPP available for operation.
goto try_something_else;
}
dev_pm_opp_disable - Make an OPP to be not available for operation
dev_pm_opp_disable
Make an OPP to be not available for operation
Example: Lets say that 1GHz OPP is to be disabled if the temperature
exceeds a threshold value. The SoC framework implementation might
choose to do something as follows:
choose to do something as follows::
if (cur_temp > temp_high_thresh) {
/* Disable 1GHz if it was enabled */
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
@@ -223,11 +249,13 @@ information from the OPP structure is necessary. Once an OPP pointer is
retrieved using the search functions, the following functions can be used by SoC
framework to retrieve the information represented inside the OPP layer.
dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer.
dev_pm_opp_get_voltage
Retrieve the voltage represented by the opp pointer.
Example: At a cpufreq transition to a different frequency, SoC
framework requires to set the voltage represented by the OPP using
the regulator framework to the Power Management chip providing the
voltage.
voltage::
soc_switch_to_freq_voltage(freq)
{
/* do things */
@@ -239,10 +267,12 @@ dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer.
/* do other things */
}
dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer.
dev_pm_opp_get_freq
Retrieve the freq represented by the opp pointer.
Example: Lets say the SoC framework uses a couple of helper functions
we could pass opp pointers instead of doing additional parameters to
handle quiet a bit of data parameters.
handle quiet a bit of data parameters::
soc_cpufreq_target(..)
{
/* do things.. */
@@ -264,9 +294,11 @@ dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer.
/* do things.. */
}
dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device
dev_pm_opp_get_opp_count
Retrieve the number of available opps for a device
Example: Lets say a co-processor in the SoC needs to know the available
frequencies in a table, the main processor can notify as following:
frequencies in a table, the main processor can notify as following::
soc_notify_coproc_available_frequencies()
{
/* Do things */
@@ -289,54 +321,59 @@ dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device
==================
Typically an SoC contains multiple voltage domains which are variable. Each
domain is represented by a device pointer. The relationship to OPP can be
represented as follows:
SoC
|- device 1
| |- opp 1 (availability, freq, voltage)
| |- opp 2 ..
... ...
| `- opp n ..
|- device 2
...
`- device m
represented as follows::
SoC
|- device 1
| |- opp 1 (availability, freq, voltage)
| |- opp 2 ..
... ...
| `- opp n ..
|- device 2
...
`- device m
OPP library maintains a internal list that the SoC framework populates and
accessed by various functions as described above. However, the structures
representing the actual OPPs and domains are internal to the OPP library itself
to allow for suitable abstraction reusable across systems.
struct dev_pm_opp - The internal data structure of OPP library which is used to
struct dev_pm_opp
The internal data structure of OPP library which is used to
represent an OPP. In addition to the freq, voltage, availability
information, it also contains internal book keeping information required
for the OPP library to operate on. Pointer to this structure is
provided back to the users such as SoC framework to be used as a
identifier for OPP in the interactions with OPP layer.
WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the
users. The defaults of for an instance is populated by dev_pm_opp_add, but the
availability of the OPP can be modified by dev_pm_opp_enable/disable functions.
WARNING:
The struct dev_pm_opp pointer should not be parsed or modified by the
users. The defaults of for an instance is populated by
dev_pm_opp_add, but the availability of the OPP can be modified
by dev_pm_opp_enable/disable functions.
struct device - This is used to identify a domain to the OPP layer. The
struct device
This is used to identify a domain to the OPP layer. The
nature of the device and it's implementation is left to the user of
OPP library such as the SoC framework.
Overall, in a simplistic view, the data structure operations is represented as
following:
following::
Initialization / modification:
+-----+ /- dev_pm_opp_enable
dev_pm_opp_add --> | opp | <-------
| +-----+ \- dev_pm_opp_disable
\-------> domain_info(device)
Initialization / modification:
+-----+ /- dev_pm_opp_enable
dev_pm_opp_add --> | opp | <-------
| +-----+ \- dev_pm_opp_disable
\-------> domain_info(device)
Search functions:
/-- dev_pm_opp_find_freq_ceil ---\ +-----+
domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
\-- dev_pm_opp_find_freq_floor ---/ +-----+
Search functions:
/-- dev_pm_opp_find_freq_ceil ---\ +-----+
domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
\-- dev_pm_opp_find_freq_floor ---/ +-----+
Retrieval functions:
+-----+ /- dev_pm_opp_get_voltage
| opp | <---
+-----+ \- dev_pm_opp_get_freq
Retrieval functions:
+-----+ /- dev_pm_opp_get_voltage
| opp | <---
+-----+ \- dev_pm_opp_get_freq
domain_info <- dev_pm_opp_get_opp_count
domain_info <- dev_pm_opp_get_opp_count

View File

@@ -1,4 +1,6 @@
====================
PCI Power Management
====================
Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
@@ -9,14 +11,14 @@ management. Based on previous work by Patrick Mochel <mochel@transmeta.com>
This document only covers the aspects of power management specific to PCI
devices. For general description of the kernel's interfaces related to device
power management refer to Documentation/driver-api/pm/devices.rst and
Documentation/power/runtime_pm.txt.
Documentation/power/runtime_pm.rst.
---------------------------------------------------------------------------
.. contents:
1. Hardware and Platform Support for PCI Power Management
2. PCI Subsystem and Device Power Management
3. PCI Device Drivers and Power Management
4. Resources
1. Hardware and Platform Support for PCI Power Management
2. PCI Subsystem and Device Power Management
3. PCI Device Drivers and Power Management
4. Resources
1. Hardware and Platform Support for PCI Power Management
@@ -24,6 +26,7 @@ Documentation/power/runtime_pm.txt.
1.1. Native and Platform-Based Power Management
-----------------------------------------------
In general, power management is a feature allowing one to save energy by putting
devices into states in which they draw less power (low-power states) at the
price of reduced functionality or performance.
@@ -67,6 +70,7 @@ mechanisms have to be used simultaneously to obtain the desired result.
1.2. Native PCI Power Management
--------------------------------
The PCI Bus Power Management Interface Specification (PCI PM Spec) was
introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a
standard interface for performing various operations related to power
@@ -134,6 +138,7 @@ sufficiently active to generate a wakeup signal.
1.3. ACPI Device Power Management
---------------------------------
The platform firmware support for the power management of PCI devices is
system-specific. However, if the system in question is compliant with the
Advanced Configuration and Power Interface (ACPI) Specification, like the
@@ -194,6 +199,7 @@ enabled for the device to be able to generate wakeup signals.
1.4. Wakeup Signaling
---------------------
Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
a result of the execution of the _DSW (or _PSW) ACPI control method before
putting the device into a low-power state, have to be caught and handled as
@@ -265,14 +271,15 @@ the native PCI Express PME signaling cannot be used by the kernel in that case.
2.1. Device Power Management Callbacks
--------------------------------------
The PCI Subsystem participates in the power management of PCI devices in a
number of ways. First of all, it provides an intermediate code layer between
the device power management core (PM core) and PCI device drivers.
Specifically, the pm field of the PCI subsystem's struct bus_type object,
pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
pointers to several device power management callbacks:
pointers to several device power management callbacks::
const struct dev_pm_ops pci_dev_pm_ops = {
const struct dev_pm_ops pci_dev_pm_ops = {
.prepare = pci_pm_prepare,
.complete = pci_pm_complete,
.suspend = pci_pm_suspend,
@@ -290,7 +297,7 @@ const struct dev_pm_ops pci_dev_pm_ops = {
.runtime_suspend = pci_pm_runtime_suspend,
.runtime_resume = pci_pm_runtime_resume,
.runtime_idle = pci_pm_runtime_idle,
};
};
These callbacks are executed by the PM core in various situations related to
device power management and they, in turn, execute power management callbacks
@@ -299,9 +306,9 @@ involving some standard configuration registers of PCI devices that device
drivers need not know or care about.
The structure representing a PCI device, struct pci_dev, contains several fields
that these callbacks operate on:
that these callbacks operate on::
struct pci_dev {
struct pci_dev {
...
pci_power_t current_state; /* Current operating state. */
int pm_cap; /* PM capability offset in the
@@ -315,13 +322,14 @@ struct pci_dev {
unsigned int wakeup_prepared:1; /* Device prepared for wake up */
unsigned int d3_delay; /* D3->D0 transition time in ms */
...
};
};
They also indirectly use some fields of the struct device that is embedded in
struct pci_dev.
2.2. Device Initialization
--------------------------
The PCI subsystem's first task related to device power management is to
prepare the device for power management and initialize the fields of struct
pci_dev used for this purpose. This happens in two functions defined in
@@ -348,10 +356,11 @@ during system-wide transitions to a sleep state and back to the working state.
2.3. Runtime Device Power Management
------------------------------------
The PCI subsystem plays a vital role in the runtime power management of PCI
devices. For this purpose it uses the general runtime power management
(runtime PM) framework described in Documentation/power/runtime_pm.txt.
Namely, it provides subsystem-level callbacks:
(runtime PM) framework described in Documentation/power/runtime_pm.rst.
Namely, it provides subsystem-level callbacks::
pci_pm_runtime_suspend()
pci_pm_runtime_resume()
@@ -425,13 +434,14 @@ to the given subsystem before the next phase begins. These phases always run
after tasks have been frozen.
2.4.1. System Suspend
^^^^^^^^^^^^^^^^^^^^^
When the system is going into a sleep state in which the contents of memory will
be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
prepare, suspend, suspend_noirq.
The following PCI bus type's callbacks, respectively, are used in these phases:
The following PCI bus type's callbacks, respectively, are used in these phases::
pci_pm_prepare()
pci_pm_suspend()
@@ -492,6 +502,7 @@ this purpose). PCI device drivers are not encouraged to do that, but in some
rare cases doing that in the driver may be the optimum approach.
2.4.2. System Resume
^^^^^^^^^^^^^^^^^^^^
When the system is undergoing a transition from a sleep state in which the
contents of memory have been preserved, such as one of the ACPI sleep states
@@ -500,7 +511,7 @@ S1-S3, into the working state (ACPI S0), the phases are:
resume_noirq, resume, complete.
The following PCI bus type's callbacks, respectively, are executed in these
phases:
phases::
pci_pm_resume_noirq()
pci_pm_resume()
@@ -539,6 +550,7 @@ The pci_pm_complete() routine only executes the device driver's pm->complete()
callback, if defined.
2.4.3. System Hibernation
^^^^^^^^^^^^^^^^^^^^^^^^^
System hibernation is more complicated than system suspend, because it requires
a system image to be created and written into a persistent storage medium. The
@@ -551,7 +563,7 @@ to be free) in the following three phases:
prepare, freeze, freeze_noirq
that correspond to the PCI bus type's callbacks:
that correspond to the PCI bus type's callbacks::
pci_pm_prepare()
pci_pm_freeze()
@@ -580,7 +592,7 @@ back to the fully functional state and this is done in the following phases:
thaw_noirq, thaw, complete
using the following PCI bus type's callbacks:
using the following PCI bus type's callbacks::
pci_pm_thaw_noirq()
pci_pm_thaw()
@@ -608,7 +620,7 @@ three phases:
where the prepare phase is exactly the same as for system suspend. The other
two phases are analogous to the suspend and suspend_noirq phases, respectively.
The PCI subsystem-level callbacks they correspond to
The PCI subsystem-level callbacks they correspond to::
pci_pm_poweroff()
pci_pm_poweroff_noirq()
@@ -618,6 +630,7 @@ although they don't attempt to save the device's standard configuration
registers.
2.4.4. System Restore
^^^^^^^^^^^^^^^^^^^^^
System restore requires a hibernation image to be loaded into memory and the
pre-hibernation memory contents to be restored before the pre-hibernation system
@@ -653,7 +666,7 @@ phases:
The first two of these are analogous to the resume_noirq and resume phases
described above, respectively, and correspond to the following PCI subsystem
callbacks:
callbacks::
pci_pm_restore_noirq()
pci_pm_restore()
@@ -671,6 +684,7 @@ resume.
3.1. Power Management Callbacks
-------------------------------
PCI device drivers participate in power management by providing callbacks to be
executed by the PCI subsystem's power management routines described above and by
controlling the runtime power management of their devices.
@@ -698,6 +712,7 @@ defined, though, they are expected to behave as described in the following
subsections.
3.1.1. prepare()
^^^^^^^^^^^^^^^^
The prepare() callback is executed during system suspend, during hibernation
(when a hibernation image is about to be created), during power-off after
@@ -716,6 +731,7 @@ preallocated earlier, for example in a suspend/hibernate notifier as described
in Documentation/driver-api/pm/notifiers.rst).
3.1.2. suspend()
^^^^^^^^^^^^^^^^
The suspend() callback is only executed during system suspend, after prepare()
callbacks have been executed for all devices in the system.
@@ -742,6 +758,7 @@ operations relying on the driver's ability to handle interrupts should be
carried out in this callback.
3.1.3. suspend_noirq()
^^^^^^^^^^^^^^^^^^^^^^
The suspend_noirq() callback is only executed during system suspend, after
suspend() callbacks have been executed for all devices in the system and
@@ -753,6 +770,7 @@ suspend_noirq() can carry out operations that would cause race conditions to
arise if they were performed in suspend().
3.1.4. freeze()
^^^^^^^^^^^^^^^
The freeze() callback is hibernation-specific and is executed in two situations,
during hibernation, after prepare() callbacks have been executed for all devices
@@ -770,6 +788,7 @@ or put it into a low-power state. Still, either it or freeze_noirq() should
save the device's standard configuration registers using pci_save_state().
3.1.5. freeze_noirq()
^^^^^^^^^^^^^^^^^^^^^
The freeze_noirq() callback is hibernation-specific. It is executed during
hibernation, after prepare() and freeze() callbacks have been executed for all
@@ -786,6 +805,7 @@ The difference between freeze_noirq() and freeze() is analogous to the
difference between suspend_noirq() and suspend().
3.1.6. poweroff()
^^^^^^^^^^^^^^^^^
The poweroff() callback is hibernation-specific. It is executed when the system
is about to be powered off after saving a hibernation image to a persistent
@@ -802,6 +822,7 @@ into a low-power state, respectively, but it need not save the device's standard
configuration registers.
3.1.7. poweroff_noirq()
^^^^^^^^^^^^^^^^^^^^^^^
The poweroff_noirq() callback is hibernation-specific. It is executed after
poweroff() callbacks have been executed for all devices in the system.
@@ -814,6 +835,7 @@ The difference between poweroff_noirq() and poweroff() is analogous to the
difference between suspend_noirq() and suspend().
3.1.8. resume_noirq()
^^^^^^^^^^^^^^^^^^^^^
The resume_noirq() callback is only executed during system resume, after the
PM core has enabled the non-boot CPUs. The driver's interrupt handler will not
@@ -827,6 +849,7 @@ it should only be used for performing operations that would lead to race
conditions if carried out by resume().
3.1.9. resume()
^^^^^^^^^^^^^^^
The resume() callback is only executed during system resume, after
resume_noirq() callbacks have been executed for all devices in the system and
@@ -837,6 +860,7 @@ device and bringing it back to the fully functional state. The device should be
able to process I/O in a usual way after resume() has returned.
3.1.10. thaw_noirq()
^^^^^^^^^^^^^^^^^^^^
The thaw_noirq() callback is hibernation-specific. It is executed after a
system image has been created and the non-boot CPUs have been enabled by the PM
@@ -851,6 +875,7 @@ freeze() and freeze_noirq(), so in general it does not need to modify the
contents of the device's registers.
3.1.11. thaw()
^^^^^^^^^^^^^^
The thaw() callback is hibernation-specific. It is executed after thaw_noirq()
callbacks have been executed for all devices in the system and after device
@@ -860,6 +885,7 @@ This callback is responsible for restoring the pre-freeze configuration of
the device, so that it will work in a usual way after thaw() has returned.
3.1.12. restore_noirq()
^^^^^^^^^^^^^^^^^^^^^^^
The restore_noirq() callback is hibernation-specific. It is executed in the
restore_noirq phase of hibernation, when the boot kernel has passed control to
@@ -875,6 +901,7 @@ For the vast majority of PCI device drivers there is no difference between
resume_noirq() and restore_noirq().
3.1.13. restore()
^^^^^^^^^^^^^^^^^
The restore() callback is hibernation-specific. It is executed after
restore_noirq() callbacks have been executed for all devices in the system and
@@ -888,14 +915,17 @@ For the vast majority of PCI device drivers there is no difference between
resume() and restore().
3.1.14. complete()
^^^^^^^^^^^^^^^^^^
The complete() callback is executed in the following situations:
- during system resume, after resume() callbacks have been executed for all
devices,
- during hibernation, before saving the system image, after thaw() callbacks
have been executed for all devices,
- during system restore, when the system is going back to its pre-hibernation
state, after restore() callbacks have been executed for all devices.
It also may be executed if the loading of a hibernation image into memory fails
(in that case it is run after thaw() callbacks have been executed for all
devices that have drivers in the boot kernel).
@@ -904,6 +934,7 @@ This callback is entirely optional, although it may be necessary if the
prepare() callback performs operations that need to be reversed.
3.1.15. runtime_suspend()
^^^^^^^^^^^^^^^^^^^^^^^^^
The runtime_suspend() callback is specific to device runtime power management
(runtime PM). It is executed by the PM core's runtime PM framework when the
@@ -915,6 +946,7 @@ put into a low-power state, but it must allow the PCI subsystem to perform all
of the PCI-specific actions necessary for suspending the device.
3.1.16. runtime_resume()
^^^^^^^^^^^^^^^^^^^^^^^^
The runtime_resume() callback is specific to device runtime PM. It is executed
by the PM core's runtime PM framework when the device is about to be resumed
@@ -927,6 +959,7 @@ The device is expected to be able to process I/O in the usual way after
runtime_resume() has returned.
3.1.17. runtime_idle()
^^^^^^^^^^^^^^^^^^^^^^
The runtime_idle() callback is specific to device runtime PM. It is executed
by the PM core's runtime PM framework whenever it may be desirable to suspend
@@ -939,6 +972,7 @@ PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
cause the driver's runtime_suspend() callback to be executed.
3.1.18. Pointing Multiple Callback Pointers to One Routine
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although in principle each of the callbacks described in the previous
subsections can be defined as a separate function, it often is convenient to
@@ -962,6 +996,7 @@ dev_pm_ops to indicate that one suspend routine is to be pointed to by the
be pointed to by the .resume(), .thaw(), and .restore() members.
3.1.19. Driver Flags for Power Management
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The PM core allows device drivers to set flags that influence the handling of
power management for the devices by the core itself and by middle layer code
@@ -1007,6 +1042,7 @@ it.
3.2. Device Runtime Power Management
------------------------------------
In addition to providing device power management callbacks PCI device drivers
are responsible for controlling the runtime power management (runtime PM) of
their devices.
@@ -1073,22 +1109,27 @@ device the PM core automatically queues a request to check if the device is
idle), device drivers are generally responsible for queuing power management
requests for their devices. For this purpose they should use the runtime PM
helper functions provided by the PM core, discussed in
Documentation/power/runtime_pm.txt.
Documentation/power/runtime_pm.rst.
Devices can also be suspended and resumed synchronously, without placing a
request into pm_wq. In the majority of cases this also is done by their
drivers that use helper functions provided by the PM core for this purpose.
For more information on the runtime PM of devices refer to
Documentation/power/runtime_pm.txt.
Documentation/power/runtime_pm.rst.
4. Resources
============
PCI Local Bus Specification, Rev. 3.0
PCI Bus Power Management Interface Specification, Rev. 1.2
Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
PCI Express Base Specification, Rev. 2.0
Documentation/driver-api/pm/devices.rst
Documentation/power/runtime_pm.txt
Documentation/power/runtime_pm.rst

View File

@@ -1,4 +1,6 @@
PM Quality Of Service Interface.
===============================
PM Quality Of Service Interface
===============================
This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
@@ -11,6 +13,7 @@ memory_bandwidth.
constraints and PM QoS flags.
Each parameters have defined units:
* latency: usec
* timeout: usec
* throughput: kbs (kilo bit / sec)
@@ -18,6 +21,7 @@ Each parameters have defined units:
1. PM QoS framework
===================
The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
@@ -37,38 +41,39 @@ reading the aggregated value does not require any locking mechanism.
From kernel mode the use of this interface is simple:
void pm_qos_add_request(handle, param_class, target_value):
Will insert an element into the list for that identified PM QoS class with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of pm_qos need to save the returned handle for future use in other
pm_qos API functions.
Will insert an element into the list for that identified PM QoS class with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of pm_qos need to save the returned handle for future use in other
pm_qos API functions.
void pm_qos_update_request(handle, new_target_value):
Will update the list element pointed to by the handle with the new target value
and recompute the new aggregated target, calling the notification tree if the
target is changed.
Will update the list element pointed to by the handle with the new target value
and recompute the new aggregated target, calling the notification tree if the
target is changed.
void pm_qos_remove_request(handle):
Will remove the element. After removal it will update the aggregate target and
call the notification tree if the target was changed as a result of removing
the request.
Will remove the element. After removal it will update the aggregate target and
call the notification tree if the target was changed as a result of removing
the request.
int pm_qos_request(param_class):
Returns the aggregated value for a given PM QoS class.
Returns the aggregated value for a given PM QoS class.
int pm_qos_request_active(handle):
Returns if the request is still active, i.e. it has not been removed from a
PM QoS class constraints list.
Returns if the request is still active, i.e. it has not been removed from a
PM QoS class constraints list.
int pm_qos_add_notifier(param_class, notifier):
Adds a notification callback function to the PM QoS class. The callback is
called when the aggregated value for the PM QoS class is changed.
Adds a notification callback function to the PM QoS class. The callback is
called when the aggregated value for the PM QoS class is changed.
int pm_qos_remove_notifier(int param_class, notifier):
Removes the notification callback function for the PM QoS class.
Removes the notification callback function for the PM QoS class.
From user mode:
Only processes can register a pm_qos request. To provide for automatic
cleanup of a process, the interface requires the process to register its
parameter requests in the following way:
@@ -89,6 +94,7 @@ node.
2. PM QoS per-device latency and flags framework
================================================
For each device, there are three lists of PM QoS requests. Two of them are
maintained along with the aggregated targets of resume latency and active
@@ -107,73 +113,80 @@ the aggregated value does not require any locking mechanism.
From kernel mode the use of this interface is the following:
int dev_pm_qos_add_request(device, handle, type, value):
Will insert an element into the list for that identified device with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of dev_pm_qos need to save the handle for future use in other
dev_pm_qos API functions.
Will insert an element into the list for that identified device with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of dev_pm_qos need to save the handle for future use in other
dev_pm_qos API functions.
int dev_pm_qos_update_request(handle, new_value):
Will update the list element pointed to by the handle with the new target value
and recompute the new aggregated target, calling the notification trees if the
target is changed.
Will update the list element pointed to by the handle with the new target
value and recompute the new aggregated target, calling the notification
trees if the target is changed.
int dev_pm_qos_remove_request(handle):
Will remove the element. After removal it will update the aggregate target and
call the notification trees if the target was changed as a result of removing
the request.
Will remove the element. After removal it will update the aggregate target
and call the notification trees if the target was changed as a result of
removing the request.
s32 dev_pm_qos_read_value(device):
Returns the aggregated value for a given device's constraints list.
Returns the aggregated value for a given device's constraints list.
enum pm_qos_flags_status dev_pm_qos_flags(device, mask)
Check PM QoS flags of the given device against the given mask of flags.
The meaning of the return values is as follows:
PM_QOS_FLAGS_ALL: All flags from the mask are set
PM_QOS_FLAGS_SOME: Some flags from the mask are set
PM_QOS_FLAGS_NONE: No flags from the mask are set
PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been
initialized or the list of requests is empty.
Check PM QoS flags of the given device against the given mask of flags.
The meaning of the return values is as follows:
PM_QOS_FLAGS_ALL:
All flags from the mask are set
PM_QOS_FLAGS_SOME:
Some flags from the mask are set
PM_QOS_FLAGS_NONE:
No flags from the mask are set
PM_QOS_FLAGS_UNDEFINED:
The device's PM QoS structure has not been initialized
or the list of requests is empty.
int dev_pm_qos_add_ancestor_request(dev, handle, type, value)
Add a PM QoS request for the first direct ancestor of the given device whose
power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests)
or whose power.set_latency_tolerance callback pointer is not NULL (for
DEV_PM_QOS_LATENCY_TOLERANCE requests).
Add a PM QoS request for the first direct ancestor of the given device whose
power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests)
or whose power.set_latency_tolerance callback pointer is not NULL (for
DEV_PM_QOS_LATENCY_TOLERANCE requests).
int dev_pm_qos_expose_latency_limit(device, value)
Add a request to the device's PM QoS list of resume latency constraints and
create a sysfs attribute pm_qos_resume_latency_us under the device's power
directory allowing user space to manipulate that request.
Add a request to the device's PM QoS list of resume latency constraints and
create a sysfs attribute pm_qos_resume_latency_us under the device's power
directory allowing user space to manipulate that request.
void dev_pm_qos_hide_latency_limit(device)
Drop the request added by dev_pm_qos_expose_latency_limit() from the device's
PM QoS list of resume latency constraints and remove sysfs attribute
pm_qos_resume_latency_us from the device's power directory.
Drop the request added by dev_pm_qos_expose_latency_limit() from the device's
PM QoS list of resume latency constraints and remove sysfs attribute
pm_qos_resume_latency_us from the device's power directory.
int dev_pm_qos_expose_flags(device, value)
Add a request to the device's PM QoS list of flags and create sysfs attribute
pm_qos_no_power_off under the device's power directory allowing user space to
change the value of the PM_QOS_FLAG_NO_POWER_OFF flag.
Add a request to the device's PM QoS list of flags and create sysfs attribute
pm_qos_no_power_off under the device's power directory allowing user space to
change the value of the PM_QOS_FLAG_NO_POWER_OFF flag.
void dev_pm_qos_hide_flags(device)
Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
of flags and remove sysfs attribute pm_qos_no_power_off from the device's power
directory.
Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
of flags and remove sysfs attribute pm_qos_no_power_off from the device's power
directory.
Notification mechanisms:
The per-device PM QoS framework has a per-device notification tree.
int dev_pm_qos_add_notifier(device, notifier):
Adds a notification callback function for the device.
The callback is called when the aggregated value of the device constraints list
is changed (for resume latency device PM QoS only).
Adds a notification callback function for the device.
The callback is called when the aggregated value of the device constraints list
is changed (for resume latency device PM QoS only).
int dev_pm_qos_remove_notifier(device, notifier):
Removes the notification callback function for the device.
Removes the notification callback function for the device.
Active state latency tolerance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This device PM QoS type is used to support systems in which hardware may switch
to energy-saving operation modes on the fly. In those systems, if the operation

View File

@@ -0,0 +1,282 @@
========================
Linux power supply class
========================
Synopsis
~~~~~~~~
Power supply class used to represent battery, UPS, AC or DC power supply
properties to user-space.
It defines core set of attributes, which should be applicable to (almost)
every power supply out there. Attributes are available via sysfs and uevent
interfaces.
Each attribute has well defined meaning, up to unit of measure used. While
the attributes provided are believed to be universally applicable to any
power supply, specific monitoring hardware may not be able to provide them
all, so any of them may be skipped.
Power supply class is extensible, and allows to define drivers own attributes.
The core attribute set is subject to the standard Linux evolution (i.e.
if it will be found that some attribute is applicable to many power supply
types or their drivers, it can be added to the core set).
It also integrates with LED framework, for the purpose of providing
typically expected feedback of battery charging/fully charged status and
AC/USB power supply online status. (Note that specific details of the
indication (including whether to use it at all) are fully controllable by
user and/or specific machine defaults, per design principles of LED
framework).
Attributes/properties
~~~~~~~~~~~~~~~~~~~~~
Power supply class has predefined set of attributes, this eliminates code
duplication across drivers. Power supply class insist on reusing its
predefined attributes *and* their units.
So, userspace gets predictable set of attributes and their units for any
kind of power supply, and can process/present them to a user in consistent
manner. Results for different power supplies and machines are also directly
comparable.
See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c
for the example how to declare and handle attributes.
Units
~~~~~
Quoting include/linux/power_supply.h:
All voltages, currents, charges, energies, time and temperatures in µV,
µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise
stated. It's driver's job to convert its raw values to units in which
this class operates.
Attributes/properties detailed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------------------------------------------------------------------+
| **Charge/Energy/Capacity - how to not confuse** |
+--------------------------------------------------------------------------+
| **Because both "charge" (µAh) and "energy" (µWh) represents "capacity" |
| of battery, this class distinguish these terms. Don't mix them!** |
| |
| - `CHARGE_*` |
| attributes represents capacity in µAh only. |
| - `ENERGY_*` |
| attributes represents capacity in µWh only. |
| - `CAPACITY` |
| attribute represents capacity in *percents*, from 0 to 100. |
+--------------------------------------------------------------------------+
Postfixes:
_AVG
*hardware* averaged value, use it if your hardware is really able to
report averaged values.
_NOW
momentary/instantaneous values.
STATUS
this attribute represents operating status (charging, full,
discharging (i.e. powering a load), etc.). This corresponds to
`BATTERY_STATUS_*` values, as defined in battery.h.
CHARGE_TYPE
batteries can typically charge at different rates.
This defines trickle and fast charges. For batteries that
are already charged or discharging, 'n/a' can be displayed (or
'unknown', if the status is not known).
AUTHENTIC
indicates the power supply (battery or charger) connected
to the platform is authentic(1) or non authentic(0).
HEALTH
represents health of the battery, values corresponds to
POWER_SUPPLY_HEALTH_*, defined in battery.h.
VOLTAGE_OCV
open circuit voltage of the battery.
VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN
design values for maximal and minimal power supply voltages.
Maximal/minimal means values of voltages when battery considered
"full"/"empty" at normal conditions. Yes, there is no direct relation
between voltage and battery capacity, but some dumb
batteries use voltage for very approximated calculation of capacity.
Battery driver also can use this attribute just to inform userspace
about maximal and minimal voltage thresholds of a given battery.
VOLTAGE_MAX, VOLTAGE_MIN
same as _DESIGN voltage values except that these ones should be used
if hardware could only guess (measure and retain) the thresholds of a
given power supply.
VOLTAGE_BOOT
Reports the voltage measured during boot
CURRENT_BOOT
Reports the current measured during boot
CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN
design charge values, when battery considered full/empty.
ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN
same as above but for energy.
CHARGE_FULL, CHARGE_EMPTY
These attributes means "last remembered value of charge when battery
became full/empty". It also could mean "value of charge when battery
considered full/empty at given conditions (temperature, age)".
I.e. these attributes represents real thresholds, not design values.
ENERGY_FULL, ENERGY_EMPTY
same as above but for energy.
CHARGE_COUNTER
the current charge counter (in µAh). This could easily
be negative; there is no empty or full value. It is only useful for
relative, time-based measurements.
PRECHARGE_CURRENT
the maximum charge current during precharge phase of charge cycle
(typically 20% of battery capacity).
CHARGE_TERM_CURRENT
Charge termination current. The charge cycle terminates when battery
voltage is above recharge threshold, and charge current is below
this setting (typically 10% of battery capacity).
CONSTANT_CHARGE_CURRENT
constant charge current programmed by charger.
CONSTANT_CHARGE_CURRENT_MAX
maximum charge current supported by the power supply object.
CONSTANT_CHARGE_VOLTAGE
constant charge voltage programmed by charger.
CONSTANT_CHARGE_VOLTAGE_MAX
maximum charge voltage supported by the power supply object.
INPUT_CURRENT_LIMIT
input current limit programmed by charger. Indicates
the current drawn from a charging source.
CHARGE_CONTROL_LIMIT
current charge control limit setting
CHARGE_CONTROL_LIMIT_MAX
maximum charge control limit setting
CALIBRATE
battery or coulomb counter calibration status
CAPACITY
capacity in percents.
CAPACITY_ALERT_MIN
minimum capacity alert value in percents.
CAPACITY_ALERT_MAX
maximum capacity alert value in percents.
CAPACITY_LEVEL
capacity level. This corresponds to POWER_SUPPLY_CAPACITY_LEVEL_*.
TEMP
temperature of the power supply.
TEMP_ALERT_MIN
minimum battery temperature alert.
TEMP_ALERT_MAX
maximum battery temperature alert.
TEMP_AMBIENT
ambient temperature.
TEMP_AMBIENT_ALERT_MIN
minimum ambient temperature alert.
TEMP_AMBIENT_ALERT_MAX
maximum ambient temperature alert.
TEMP_MIN
minimum operatable temperature
TEMP_MAX
maximum operatable temperature
TIME_TO_EMPTY
seconds left for battery to be considered empty
(i.e. while battery powers a load)
TIME_TO_FULL
seconds left for battery to be considered full
(i.e. while battery is charging)
Battery <-> external power supply interaction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often power supplies are acting as supplies and supplicants at the same
time. Batteries are good example. So, batteries usually care if they're
externally powered or not.
For that case, power supply class implements notification mechanism for
batteries.
External power supply (AC) lists supplicants (batteries) names in
"supplied_to" struct member, and each power_supply_changed() call
issued by external power supply will notify supplicants via
external_power_changed callback.
Devicetree battery characteristics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Drivers should call power_supply_get_battery_info() to obtain battery
characteristics from a devicetree battery node, defined in
Documentation/devicetree/bindings/power/supply/battery.txt. This is
implemented in drivers/power/supply/bq27xxx_battery.c.
Properties in struct power_supply_battery_info and their counterparts in the
battery node have names corresponding to elements in enum power_supply_property,
for naming consistency between sysfs attributes and battery node properties.
QA
~~
Q:
Where is POWER_SUPPLY_PROP_XYZ attribute?
A:
If you cannot find attribute suitable for your driver needs, feel free
to add it and send patch along with your driver.
The attributes available currently are the ones currently provided by the
drivers written.
Good candidates to add in future: model/part#, cycle_time, manufacturer,
etc.
Q:
I have some very specific attribute (e.g. battery color), should I add
this attribute to standard ones?
A:
Most likely, no. Such attribute can be placed in the driver itself, if
it is useful. Of course, if the attribute in question applicable to
large set of batteries, provided by many drivers, and/or comes from
some general battery specification/standard, it may be a candidate to
be added to the core attribute set.
Q:
Suppose, my battery monitoring chip/firmware does not provides capacity
in percents, but provides charge_{now,full,empty}. Should I calculate
percentage capacity manually, inside the driver, and register CAPACITY
attribute? The same question about time_to_empty/time_to_full.
A:
Most likely, no. This class is designed to export properties which are
directly measurable by the specific hardware available.
Inferring not available properties using some heuristics or mathematical
model is not subject of work for a battery driver. Such functionality
should be factored out, and in fact, apm_power, the driver to serve
legacy APM API on top of power supply class, uses a simple heuristic of
approximating remaining battery capacity based on its charge, current,
voltage and so on. But full-fledged battery model is likely not subject
for kernel at all, as it would require floating point calculation to deal
with things like differential equations and Kalman filters. This is
better be handled by batteryd/libbattery, yet to be written.

View File

@@ -1,231 +0,0 @@
Linux power supply class
========================
Synopsis
~~~~~~~~
Power supply class used to represent battery, UPS, AC or DC power supply
properties to user-space.
It defines core set of attributes, which should be applicable to (almost)
every power supply out there. Attributes are available via sysfs and uevent
interfaces.
Each attribute has well defined meaning, up to unit of measure used. While
the attributes provided are believed to be universally applicable to any
power supply, specific monitoring hardware may not be able to provide them
all, so any of them may be skipped.
Power supply class is extensible, and allows to define drivers own attributes.
The core attribute set is subject to the standard Linux evolution (i.e.
if it will be found that some attribute is applicable to many power supply
types or their drivers, it can be added to the core set).
It also integrates with LED framework, for the purpose of providing
typically expected feedback of battery charging/fully charged status and
AC/USB power supply online status. (Note that specific details of the
indication (including whether to use it at all) are fully controllable by
user and/or specific machine defaults, per design principles of LED
framework).
Attributes/properties
~~~~~~~~~~~~~~~~~~~~~
Power supply class has predefined set of attributes, this eliminates code
duplication across drivers. Power supply class insist on reusing its
predefined attributes *and* their units.
So, userspace gets predictable set of attributes and their units for any
kind of power supply, and can process/present them to a user in consistent
manner. Results for different power supplies and machines are also directly
comparable.
See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c
for the example how to declare and handle attributes.
Units
~~~~~
Quoting include/linux/power_supply.h:
All voltages, currents, charges, energies, time and temperatures in µV,
µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise
stated. It's driver's job to convert its raw values to units in which
this class operates.
Attributes/properties detailed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~
~ ~
~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~
~ of battery, this class distinguish these terms. Don't mix them! ~
~ ~
~ CHARGE_* attributes represents capacity in µAh only. ~
~ ENERGY_* attributes represents capacity in µWh only. ~
~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~
~ ~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Postfixes:
_AVG - *hardware* averaged value, use it if your hardware is really able to
report averaged values.
_NOW - momentary/instantaneous values.
STATUS - this attribute represents operating status (charging, full,
discharging (i.e. powering a load), etc.). This corresponds to
BATTERY_STATUS_* values, as defined in battery.h.
CHARGE_TYPE - batteries can typically charge at different rates.
This defines trickle and fast charges. For batteries that
are already charged or discharging, 'n/a' can be displayed (or
'unknown', if the status is not known).
AUTHENTIC - indicates the power supply (battery or charger) connected
to the platform is authentic(1) or non authentic(0).
HEALTH - represents health of the battery, values corresponds to
POWER_SUPPLY_HEALTH_*, defined in battery.h.
VOLTAGE_OCV - open circuit voltage of the battery.
VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and
minimal power supply voltages. Maximal/minimal means values of voltages
when battery considered "full"/"empty" at normal conditions. Yes, there is
no direct relation between voltage and battery capacity, but some dumb
batteries use voltage for very approximated calculation of capacity.
Battery driver also can use this attribute just to inform userspace
about maximal and minimal voltage thresholds of a given battery.
VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that
these ones should be used if hardware could only guess (measure and
retain) the thresholds of a given power supply.
VOLTAGE_BOOT - Reports the voltage measured during boot
CURRENT_BOOT - Reports the current measured during boot
CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when
battery considered full/empty.
ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy.
CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value
of charge when battery became full/empty". It also could mean "value of
charge when battery considered full/empty at given conditions (temperature,
age)". I.e. these attributes represents real thresholds, not design values.
ENERGY_FULL, ENERGY_EMPTY - same as above but for energy.
CHARGE_COUNTER - the current charge counter (in µAh). This could easily
be negative; there is no empty or full value. It is only useful for
relative, time-based measurements.
PRECHARGE_CURRENT - the maximum charge current during precharge phase
of charge cycle (typically 20% of battery capacity).
CHARGE_TERM_CURRENT - Charge termination current. The charge cycle
terminates when battery voltage is above recharge threshold, and charge
current is below this setting (typically 10% of battery capacity).
CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger.
CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the
power supply object.
CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger.
CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the
power supply object.
INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates
the current drawn from a charging source.
CHARGE_CONTROL_LIMIT - current charge control limit setting
CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting
CALIBRATE - battery or coulomb counter calibration status
CAPACITY - capacity in percents.
CAPACITY_ALERT_MIN - minimum capacity alert value in percents.
CAPACITY_ALERT_MAX - maximum capacity alert value in percents.
CAPACITY_LEVEL - capacity level. This corresponds to
POWER_SUPPLY_CAPACITY_LEVEL_*.
TEMP - temperature of the power supply.
TEMP_ALERT_MIN - minimum battery temperature alert.
TEMP_ALERT_MAX - maximum battery temperature alert.
TEMP_AMBIENT - ambient temperature.
TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert.
TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert.
TEMP_MIN - minimum operatable temperature
TEMP_MAX - maximum operatable temperature
TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e.
while battery powers a load)
TIME_TO_FULL - seconds left for battery to be considered full (i.e.
while battery is charging)
Battery <-> external power supply interaction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often power supplies are acting as supplies and supplicants at the same
time. Batteries are good example. So, batteries usually care if they're
externally powered or not.
For that case, power supply class implements notification mechanism for
batteries.
External power supply (AC) lists supplicants (batteries) names in
"supplied_to" struct member, and each power_supply_changed() call
issued by external power supply will notify supplicants via
external_power_changed callback.
Devicetree battery characteristics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Drivers should call power_supply_get_battery_info() to obtain battery
characteristics from a devicetree battery node, defined in
Documentation/devicetree/bindings/power/supply/battery.txt. This is
implemented in drivers/power/supply/bq27xxx_battery.c.
Properties in struct power_supply_battery_info and their counterparts in the
battery node have names corresponding to elements in enum power_supply_property,
for naming consistency between sysfs attributes and battery node properties.
QA
~~
Q: Where is POWER_SUPPLY_PROP_XYZ attribute?
A: If you cannot find attribute suitable for your driver needs, feel free
to add it and send patch along with your driver.
The attributes available currently are the ones currently provided by the
drivers written.
Good candidates to add in future: model/part#, cycle_time, manufacturer,
etc.
Q: I have some very specific attribute (e.g. battery color), should I add
this attribute to standard ones?
A: Most likely, no. Such attribute can be placed in the driver itself, if
it is useful. Of course, if the attribute in question applicable to
large set of batteries, provided by many drivers, and/or comes from
some general battery specification/standard, it may be a candidate to
be added to the core attribute set.
Q: Suppose, my battery monitoring chip/firmware does not provides capacity
in percents, but provides charge_{now,full,empty}. Should I calculate
percentage capacity manually, inside the driver, and register CAPACITY
attribute? The same question about time_to_empty/time_to_full.
A: Most likely, no. This class is designed to export properties which are
directly measurable by the specific hardware available.
Inferring not available properties using some heuristics or mathematical
model is not subject of work for a battery driver. Such functionality
should be factored out, and in fact, apm_power, the driver to serve
legacy APM API on top of power supply class, uses a simple heuristic of
approximating remaining battery capacity based on its charge, current,
voltage and so on. But full-fledged battery model is likely not subject
for kernel at all, as it would require floating point calculation to deal
with things like differential equations and Kalman filters. This is
better be handled by batteryd/libbattery, yet to be written.

View File

@@ -0,0 +1,257 @@
=======================
Power Capping Framework
=======================
The power capping framework provides a consistent interface between the kernel
and the user space that allows power capping drivers to expose the settings to
user space in a uniform way.
Terminology
===========
The framework exposes power capping devices to user space via sysfs in the
form of a tree of objects. The objects at the root level of the tree represent
'control types', which correspond to different methods of power capping. For
example, the intel-rapl control type represents the Intel "Running Average
Power Limit" (RAPL) technology, whereas the 'idle-injection' control type
corresponds to the use of idle injection for controlling power.
Power zones represent different parts of the system, which can be controlled and
monitored using the power capping method determined by the control type the
given zone belongs to. They each contain attributes for monitoring power, as
well as controls represented in the form of power constraints. If the parts of
the system represented by different power zones are hierarchical (that is, one
bigger part consists of multiple smaller parts that each have their own power
controls), those power zones may also be organized in a hierarchy with one
parent power zone containing multiple subzones and so on to reflect the power
control topology of the system. In that case, it is possible to apply power
capping to a set of devices together using the parent power zone and if more
fine grained control is required, it can be applied through the subzones.
Example sysfs interface tree::
/sys/devices/virtual/powercap
└──intel-rapl
├──intel-rapl:0
│   ├──constraint_0_name
│   ├──constraint_0_power_limit_uw
│   ├──constraint_0_time_window_us
│   ├──constraint_1_name
│   ├──constraint_1_power_limit_uw
│   ├──constraint_1_time_window_us
│   ├──device -> ../../intel-rapl
│   ├──energy_uj
│   ├──intel-rapl:0:0
│   │   ├──constraint_0_name
│   │   ├──constraint_0_power_limit_uw
│   │   ├──constraint_0_time_window_us
│   │   ├──constraint_1_name
│   │   ├──constraint_1_power_limit_uw
│   │   ├──constraint_1_time_window_us
│   │   ├──device -> ../../intel-rapl:0
│   │   ├──energy_uj
│   │   ├──max_energy_range_uj
│   │   ├──name
│   │   ├──enabled
│   │   ├──power
│   │   │   ├──async
│   │   │   []
│   │   ├──subsystem -> ../../../../../../class/power_cap
│   │   └──uevent
│   ├──intel-rapl:0:1
│   │   ├──constraint_0_name
│   │   ├──constraint_0_power_limit_uw
│   │   ├──constraint_0_time_window_us
│   │   ├──constraint_1_name
│   │   ├──constraint_1_power_limit_uw
│   │   ├──constraint_1_time_window_us
│   │   ├──device -> ../../intel-rapl:0
│   │   ├──energy_uj
│   │   ├──max_energy_range_uj
│   │   ├──name
│   │   ├──enabled
│   │   ├──power
│   │   │   ├──async
│   │   │   []
│   │   ├──subsystem -> ../../../../../../class/power_cap
│   │   └──uevent
│   ├──max_energy_range_uj
│   ├──max_power_range_uw
│   ├──name
│   ├──enabled
│   ├──power
│   │   ├──async
│   │   []
│   ├──subsystem -> ../../../../../class/power_cap
│   ├──enabled
│   ├──uevent
├──intel-rapl:1
│   ├──constraint_0_name
│   ├──constraint_0_power_limit_uw
│   ├──constraint_0_time_window_us
│   ├──constraint_1_name
│   ├──constraint_1_power_limit_uw
│   ├──constraint_1_time_window_us
│   ├──device -> ../../intel-rapl
│   ├──energy_uj
│   ├──intel-rapl:1:0
│   │   ├──constraint_0_name
│   │   ├──constraint_0_power_limit_uw
│   │   ├──constraint_0_time_window_us
│   │   ├──constraint_1_name
│   │   ├──constraint_1_power_limit_uw
│   │   ├──constraint_1_time_window_us
│   │   ├──device -> ../../intel-rapl:1
│   │   ├──energy_uj
│   │   ├──max_energy_range_uj
│   │   ├──name
│   │   ├──enabled
│   │   ├──power
│   │   │   ├──async
│   │   │   []
│   │   ├──subsystem -> ../../../../../../class/power_cap
│   │   └──uevent
│   ├──intel-rapl:1:1
│   │   ├──constraint_0_name
│   │   ├──constraint_0_power_limit_uw
│   │   ├──constraint_0_time_window_us
│   │   ├──constraint_1_name
│   │   ├──constraint_1_power_limit_uw
│   │   ├──constraint_1_time_window_us
│   │   ├──device -> ../../intel-rapl:1
│   │   ├──energy_uj
│   │   ├──max_energy_range_uj
│   │   ├──name
│   │   ├──enabled
│   │   ├──power
│   │   │   ├──async
│   │   │   []
│   │   ├──subsystem -> ../../../../../../class/power_cap
│   │   └──uevent
│   ├──max_energy_range_uj
│   ├──max_power_range_uw
│   ├──name
│   ├──enabled
│   ├──power
│   │   ├──async
│   │   []
│   ├──subsystem -> ../../../../../class/power_cap
│   ├──uevent
├──power
│   ├──async
│   []
├──subsystem -> ../../../../class/power_cap
├──enabled
└──uevent
The above example illustrates a case in which the Intel RAPL technology,
available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one
control type called intel-rapl which contains two power zones, intel-rapl:0 and
intel-rapl:1, representing CPU packages. Each of these power zones contains
two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the
"core" and the "uncore" parts of the given CPU package, respectively. All of
the zones and subzones contain energy monitoring attributes (energy_uj,
max_energy_range_uj) and constraint attributes (constraint_*) allowing controls
to be applied (the constraints in the 'package' power zones apply to the whole
CPU packages and the subzone constraints only apply to the respective parts of
the given package individually). Since Intel RAPL doesn't provide instantaneous
power value, there is no power_uw attribute.
In addition to that, each power zone contains a name attribute, allowing the
part of the system represented by that zone to be identified.
For example::
cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name
package-0
---------
The Intel RAPL technology allows two constraints, short term and long term,
with two different time windows to be applied to each power zone. Thus for
each zone there are 2 attributes representing the constraint names, 2 power
limits and 2 attributes representing the sizes of the time windows. Such that,
constraint_j_* attributes correspond to the jth constraint (j = 0,1).
For example::
constraint_0_name
constraint_0_power_limit_uw
constraint_0_time_window_us
constraint_1_name
constraint_1_power_limit_uw
constraint_1_time_window_us
Power Zone Attributes
=====================
Monitoring attributes
---------------------
energy_uj (rw)
Current energy counter in micro joules. Write "0" to reset.
If the counter can not be reset, then this attribute is read only.
max_energy_range_uj (ro)
Range of the above energy counter in micro-joules.
power_uw (ro)
Current power in micro watts.
max_power_range_uw (ro)
Range of the above power value in micro-watts.
name (ro)
Name of this power zone.
It is possible that some domains have both power ranges and energy counter ranges;
however, only one is mandatory.
Constraints
-----------
constraint_X_power_limit_uw (rw)
Power limit in micro watts, which should be applicable for the
time window specified by "constraint_X_time_window_us".
constraint_X_time_window_us (rw)
Time window in micro seconds.
constraint_X_name (ro)
An optional name of the constraint
constraint_X_max_power_uw(ro)
Maximum allowed power in micro watts.
constraint_X_min_power_uw(ro)
Minimum allowed power in micro watts.
constraint_X_max_time_window_us(ro)
Maximum allowed time window in micro seconds.
constraint_X_min_time_window_us(ro)
Minimum allowed time window in micro seconds.
Except power_limit_uw and time_window_us other fields are optional.
Common zone and control type attributes
---------------------------------------
enabled (rw): Enable/Disable controls at zone level or for all zones using
a control type.
Power Cap Client Driver Interface
=================================
The API summary:
Call powercap_register_control_type() to register control type object.
Call powercap_register_zone() to register a power zone (under a given
control type), either as a top-level power zone or as a subzone of another
power zone registered earlier.
The number of constraints in a power zone and the corresponding callbacks have
to be defined prior to calling powercap_register_zone() to register that zone.
To Free a power zone call powercap_unregister_zone().
To free a control type object call powercap_unregister_control_type().
Detailed API can be generated using kernel-doc on include/linux/powercap.h.

View File

@@ -1,236 +0,0 @@
Power Capping Framework
==================================
The power capping framework provides a consistent interface between the kernel
and the user space that allows power capping drivers to expose the settings to
user space in a uniform way.
Terminology
=========================
The framework exposes power capping devices to user space via sysfs in the
form of a tree of objects. The objects at the root level of the tree represent
'control types', which correspond to different methods of power capping. For
example, the intel-rapl control type represents the Intel "Running Average
Power Limit" (RAPL) technology, whereas the 'idle-injection' control type
corresponds to the use of idle injection for controlling power.
Power zones represent different parts of the system, which can be controlled and
monitored using the power capping method determined by the control type the
given zone belongs to. They each contain attributes for monitoring power, as
well as controls represented in the form of power constraints. If the parts of
the system represented by different power zones are hierarchical (that is, one
bigger part consists of multiple smaller parts that each have their own power
controls), those power zones may also be organized in a hierarchy with one
parent power zone containing multiple subzones and so on to reflect the power
control topology of the system. In that case, it is possible to apply power
capping to a set of devices together using the parent power zone and if more
fine grained control is required, it can be applied through the subzones.
Example sysfs interface tree:
/sys/devices/virtual/powercap
??? intel-rapl
??? intel-rapl:0
?   ??? constraint_0_name
?   ??? constraint_0_power_limit_uw
?   ??? constraint_0_time_window_us
?   ??? constraint_1_name
?   ??? constraint_1_power_limit_uw
?   ??? constraint_1_time_window_us
?   ??? device -> ../../intel-rapl
?   ??? energy_uj
?   ??? intel-rapl:0:0
?   ?   ??? constraint_0_name
?   ?   ??? constraint_0_power_limit_uw
?   ?   ??? constraint_0_time_window_us
?   ?   ??? constraint_1_name
?   ?   ??? constraint_1_power_limit_uw
?   ?   ??? constraint_1_time_window_us
?   ?   ??? device -> ../../intel-rapl:0
?   ?   ??? energy_uj
?   ?   ??? max_energy_range_uj
?   ?   ??? name
?   ?   ??? enabled
?   ?   ??? power
?   ?   ?   ??? async
?   ?   ?   []
?   ?   ??? subsystem -> ../../../../../../class/power_cap
?   ?   ??? uevent
?   ??? intel-rapl:0:1
?   ?   ??? constraint_0_name
?   ?   ??? constraint_0_power_limit_uw
?   ?   ??? constraint_0_time_window_us
?   ?   ??? constraint_1_name
?   ?   ??? constraint_1_power_limit_uw
?   ?   ??? constraint_1_time_window_us
?   ?   ??? device -> ../../intel-rapl:0
?   ?   ??? energy_uj
?   ?   ??? max_energy_range_uj
?   ?   ??? name
?   ?   ??? enabled
?   ?   ??? power
?   ?   ?   ??? async
?   ?   ?   []
?   ?   ??? subsystem -> ../../../../../../class/power_cap
?   ?   ??? uevent
?   ??? max_energy_range_uj
?   ??? max_power_range_uw
?   ??? name
?   ??? enabled
?   ??? power
?   ?   ??? async
?   ?   []
?   ??? subsystem -> ../../../../../class/power_cap
?   ??? enabled
?   ??? uevent
??? intel-rapl:1
?   ??? constraint_0_name
?   ??? constraint_0_power_limit_uw
?   ??? constraint_0_time_window_us
?   ??? constraint_1_name
?   ??? constraint_1_power_limit_uw
?   ??? constraint_1_time_window_us
?   ??? device -> ../../intel-rapl
?   ??? energy_uj
?   ??? intel-rapl:1:0
?   ?   ??? constraint_0_name
?   ?   ??? constraint_0_power_limit_uw
?   ?   ??? constraint_0_time_window_us
?   ?   ??? constraint_1_name
?   ?   ??? constraint_1_power_limit_uw
?   ?   ??? constraint_1_time_window_us
?   ?   ??? device -> ../../intel-rapl:1
?   ?   ??? energy_uj
?   ?   ??? max_energy_range_uj
?   ?   ??? name
?   ?   ??? enabled
?   ?   ??? power
?   ?   ?   ??? async
?   ?   ?   []
?   ?   ??? subsystem -> ../../../../../../class/power_cap
?   ?   ??? uevent
?   ??? intel-rapl:1:1
?   ?   ??? constraint_0_name
?   ?   ??? constraint_0_power_limit_uw
?   ?   ??? constraint_0_time_window_us
?   ?   ??? constraint_1_name
?   ?   ??? constraint_1_power_limit_uw
?   ?   ??? constraint_1_time_window_us
?   ?   ??? device -> ../../intel-rapl:1
?   ?   ??? energy_uj
?   ?   ??? max_energy_range_uj
?   ?   ??? name
?   ?   ??? enabled
?   ?   ??? power
?   ?   ?   ??? async
?   ?   ?   []
?   ?   ??? subsystem -> ../../../../../../class/power_cap
?   ?   ??? uevent
?   ??? max_energy_range_uj
?   ??? max_power_range_uw
?   ??? name
?   ??? enabled
?   ??? power
?   ?   ??? async
?   ?   []
?   ??? subsystem -> ../../../../../class/power_cap
?   ??? uevent
??? power
?   ??? async
?   []
??? subsystem -> ../../../../class/power_cap
??? enabled
??? uevent
The above example illustrates a case in which the Intel RAPL technology,
available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one
control type called intel-rapl which contains two power zones, intel-rapl:0 and
intel-rapl:1, representing CPU packages. Each of these power zones contains
two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the
"core" and the "uncore" parts of the given CPU package, respectively. All of
the zones and subzones contain energy monitoring attributes (energy_uj,
max_energy_range_uj) and constraint attributes (constraint_*) allowing controls
to be applied (the constraints in the 'package' power zones apply to the whole
CPU packages and the subzone constraints only apply to the respective parts of
the given package individually). Since Intel RAPL doesn't provide instantaneous
power value, there is no power_uw attribute.
In addition to that, each power zone contains a name attribute, allowing the
part of the system represented by that zone to be identified.
For example:
cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name
package-0
The Intel RAPL technology allows two constraints, short term and long term,
with two different time windows to be applied to each power zone. Thus for
each zone there are 2 attributes representing the constraint names, 2 power
limits and 2 attributes representing the sizes of the time windows. Such that,
constraint_j_* attributes correspond to the jth constraint (j = 0,1).
For example:
constraint_0_name
constraint_0_power_limit_uw
constraint_0_time_window_us
constraint_1_name
constraint_1_power_limit_uw
constraint_1_time_window_us
Power Zone Attributes
=================================
Monitoring attributes
----------------------
energy_uj (rw): Current energy counter in micro joules. Write "0" to reset.
If the counter can not be reset, then this attribute is read only.
max_energy_range_uj (ro): Range of the above energy counter in micro-joules.
power_uw (ro): Current power in micro watts.
max_power_range_uw (ro): Range of the above power value in micro-watts.
name (ro): Name of this power zone.
It is possible that some domains have both power ranges and energy counter ranges;
however, only one is mandatory.
Constraints
----------------
constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be
applicable for the time window specified by "constraint_X_time_window_us".
constraint_X_time_window_us (rw): Time window in micro seconds.
constraint_X_name (ro): An optional name of the constraint
constraint_X_max_power_uw(ro): Maximum allowed power in micro watts.
constraint_X_min_power_uw(ro): Minimum allowed power in micro watts.
constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds.
constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds.
Except power_limit_uw and time_window_us other fields are optional.
Common zone and control type attributes
----------------------------------------
enabled (rw): Enable/Disable controls at zone level or for all zones using
a control type.
Power Cap Client Driver Interface
==================================
The API summary:
Call powercap_register_control_type() to register control type object.
Call powercap_register_zone() to register a power zone (under a given
control type), either as a top-level power zone or as a subzone of another
power zone registered earlier.
The number of constraints in a power zone and the corresponding callbacks have
to be defined prior to calling powercap_register_zone() to register that zone.
To Free a power zone call powercap_unregister_zone().
To free a control type object call powercap_unregister_control_type().
Detailed API can be generated using kernel-doc on include/linux/powercap.h.

Some files were not shown because too many files have changed in this diff Show More