Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for this cycle.
The highlight is new driver-core infrastructure and CXL subsystem
changes for allowing lockdep to validate device_lock() usage. Thanks
to PeterZ for setting me straight on the current capabilities of the
lockdep API, and Greg acked it as well.
On the CXL ACPI side this update adds support for CXL _OSC so that
platform firmware knows that it is safe to still grant Linux native
control of PCIe hotplug and error handling in the presence of CXL
devices. A circular dependency problem was discovered between suspend
and CXL memory for cases where the suspend image might be stored in
CXL memory where that image also contains the PCI register state to
restore to re-enable the device. Disable suspend for now until an
architecture is defined to clarify that conflict.
Lastly a collection of reworks, fixes, and cleanups to the CXL
subsystem where support for snooping mailbox commands and properly
handling the "mem_enable" flow are the highlights.
Summary:
- Add driver-core infrastructure for lockdep validation of
device_lock(), and fixup a deadlock report that was previously
hidden behind the 'lockdep no validate' policy.
- Add CXL _OSC support for claiming native control of CXL hotplug and
error handling.
- Disable suspend in the presence of CXL memory unless and until a
protocol is identified for restoring PCI device context from memory
hosted on CXL PCI devices.
- Add support for snooping CXL mailbox commands to protect against
inopportune changes, like set-partition with the 'immediate' flag
set.
- Rework how the driver detects legacy CXL 1.1 configurations (CXL
DVSEC / 'mem_enable') before enabling new CXL 2.0 decode
configurations (CXL HDM Capability).
- Miscellaneous cleanups and fixes from -next exposure"
* tag 'cxl-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (47 commits)
cxl/port: Enable HDM Capability after validating DVSEC Ranges
cxl/port: Reuse 'struct cxl_hdm' context for hdm init
cxl/port: Move endpoint HDM Decoder Capability init to port driver
cxl/pci: Drop @info argument to cxl_hdm_decode_init()
cxl/mem: Merge cxl_dvsec_ranges() and cxl_hdm_decode_init()
cxl/mem: Skip range enumeration if mem_enable clear
cxl/mem: Consolidate CXL DVSEC Range enumeration in the core
cxl/pci: Move cxl_await_media_ready() to the core
cxl/mem: Validate port connectivity before dvsec ranges
cxl/mem: Fix cxl_mem_probe() error exit
cxl/pci: Drop wait_for_valid() from cxl_await_media_ready()
cxl/pci: Consolidate wait_for_media() and wait_for_media_ready()
cxl/mem: Drop mem_enabled check from wait_for_media()
nvdimm: Fix firmware activation deadlock scenarios
device-core: Kill the lockdep_mutex
nvdimm: Drop nd_device_lock()
ACPI: NFIT: Drop nfit_device_lock()
nvdimm: Replace lockdep_mutex with local lock classes
cxl: Drop cxl_device_lock()
cxl/acpi: Add root device lockdep validation
...
The CXL specification claims S3 support at a hardware level, but at a
system software level there are some missing pieces. Section 9.4 (CXL
2.0) rightly claims that "CXL mem adapters may need aux power to retain
memory context across S3", but there is no enumeration mechanism for the
OS to determine if a given adapter has that support. Moreover the save
state and resume image for the system may inadvertantly end up in a CXL
device that needs to be restored before the save state is recoverable.
I.e. a circular dependency that is not resolvable without a third party
save-area.
Arrange for the cxl_mem driver to fail S3 attempts. This still nominaly
allows for suspend, but requires unbinding all CXL memory devices before
the suspend to ensure the typical DRAM flow is taken. The cxl_mem unbind
flow is intended to also tear down all CXL memory regions associated
with a given cxl_memdev.
It is reasonable to assume that any device participating in a System RAM
range published in the EFI memory map is covered by aux power and
save-area outside the device itself. So this restriction can be
minimized in the future once pre-existing region enumeration support
arrives, and perhaps a spec update to clarify if the EFI memory map is
sufficent for determining the range of devices managed by
platform-firmware for S3 support.
Per Rafael, if the CXL configuration prevents suspend then it should
fail early before tasks are frozen, and mem_sleep should stop showing
'mem' as an option [1]. Effectively CXL augments the platform suspend
->valid() op since, for example, the ACPI ops are not aware of the CXL /
PCI dependencies. Given the split role of platform firmware vs OS
provisioned CXL memory it is up to the cxl_mem driver to determine if
the CXL configuration has elements that platform firmware may not be
prepared to restore.
Link: https://lore.kernel.org/r/CAJZ5v0hGVN_=3iU8OLpHY3Ak35T5+JcBM-qs8SbojKrpd0VXsA@mail.gmail.com [1]
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/165066828317.3907920.5690432272182042556.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently pm_pr_dbg() is used to filter kernel pm debug messages based
on pm_debug_messages_on flag. The problem is if we enable/disable this
flag it will affect all pm_pr_dbg() calls at once, so we can't
individually control them.
This patch changes pm_pr_dbg() implementation as such:
- If pm_debug_messages_on is enabled, print the message.
- If pm_debug_messages_on is disabled and CONFIG_DYNAMIC_DEBUG is
enabled, only print the messages explicitly enabled on
/sys/kernel/debug/dynamic_debug/control.
- If pm_debug_messages_on is disabled and CONFIG_DYNAMIC_DEBUG is
disabled, don't print the message.
Signed-off-by: David Cohen <dacohen@pm.me>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit e3728b50cd ("ACPI: PM: s2idle: Avoid possible race
related to the EC GPE") wakeup interrupts occurring immediately after
the one discarded by acpi_s2idle_wake() may be missed. Moreover, if
the SCI triggers again immediately after the rearming in
acpi_s2idle_wake(), that wakeup may be missed too.
The problem is that pm_system_irq_wakeup() only calls pm_system_wakeup()
when pm_wakeup_irq is 0, but that's not the case any more after the
interrupt causing acpi_s2idle_wake() to run until pm_wakeup_irq is
cleared by the pm_wakeup_clear() call in s2idle_loop(). However,
there may be wakeup interrupts occurring in that time frame and if
that happens, they will be missed.
To address that issue first move the clearing of pm_wakeup_irq to
the point at which it is known that the interrupt causing
acpi_s2idle_wake() to tun will be discarded, before rearming the SCI
for wakeup. Moreover, because that only reduces the size of the
time window in which the issue may manifest itself, allow
pm_system_irq_wakeup() to register two second wakeup interrupts in
a row and, when discarding the first one, replace it with the second
one. [Of course, this assumes that only one wakeup interrupt can be
discarded in one go, but currently that is the case and I am not
aware of any plans to change that.]
Fixes: e3728b50cd ("ACPI: PM: s2idle: Avoid possible race related to the EC GPE")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix kernel-doc warnings in kernel/power/main.c by unmarking the
comment block as kernel-doc notation. This eliminates the following
kernel-doc warnings:
kernel/power/main.c:593: warning: expecting prototype for state(). Prototype was for state_show() instead
kernel/power/main.c:593: warning: Function parameter or member 'kobj' not described in 'state_show'
kernel/power/main.c:593: warning: Function parameter or member 'attr' not described in 'state_show'
kernel/power/main.c:593: warning: Function parameter or member 'buf' not described in 'state_show'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The only usage of suspend_attr_group is to put its address in an
array of pointers to const attribute_group structs.
Make it const to allow the compiler to put it into read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current notifiers have the following error handling pattern all
over the place:
int err, nr;
err = __foo_notifier_call_chain(&chain, val_up, v, -1, &nr);
if (err & NOTIFIER_STOP_MASK)
__foo_notifier_call_chain(&chain, val_down, v, nr-1, NULL)
And aside from the endless repetition thereof, it is broken. Consider
blocking notifiers; both calls take and drop the rwsem, this means
that the notifier list can change in between the two calls, making @nr
meaningless.
Fix this by replacing all the __foo_notifier_call_chain() functions
with foo_notifier_call_chain_robust() that embeds the above pattern,
but ensures it is inside a single lock region.
Note: I switched atomic_notifier_call_chain_robust() to use
the spinlock, since RCU cannot provide the guarantee
required for the recovery.
Note: software_resume() error handling was broken afaict.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20200818135804.325626653@infradead.org
Debug messages from the system suspend/hibernation infrastructure
are disabled by default, and can only be enabled after the system
has boot up via /sys/power/pm_debug_messages.
This makes the hibernation resume hard to track as it involves system
boot up across hibernation. There's no chance for software_resume()
to track the resume process, for example.
Add a kernel command line option to set pm_debug_messages during
boot up.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The sysfs attribute `/sys/power/sync_on_suspend` controls, whether or not
filesystems are synced by the kernel before system suspend.
Congruously, the behaviour of build-time switch CONFIG_SUSPEND_SKIP_SYNC
is slightly changed: It now defines the run-tim default for the new sysfs
attribute `/sys/power/sync_on_suspend`.
The run-time attribute is added because the existing corresponding
build-time Kconfig flag for (`CONFIG_SUSPEND_SKIP_SYNC`) is not flexible
enough. E.g. Linux distributions that provide pre-compiled kernels
usually want to stick with the default (sync filesystems before suspend)
but under special conditions this needs to be changed.
One example for such a special condition is user-space handling of
suspending block devices (e.g. using `cryptsetup luksSuspend` or `dmsetup
suspend`) before system suspend. The Kernel trying to sync filesystems
after the underlying block device already got suspended obviously leads
to dead-locks. Be aware that you have to take care of the filesystem sync
yourself before suspending the system in those scenarios.
Signed-off-by: Jonas Meurer <jonas@freesources.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Include the <linux/runtime_pm.h> for the definition of
pm_wq to avoid the following warning:
kernel/power/main.c:890:25: warning: symbol 'pm_wq' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use str_has_prefix() instead of strncmp() which is less
straightforward in decode_state().
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Userspace can get suspend stats from the suspend stats debugfs node.
Since debugfs doesn't have stable ABI, expose suspend stats in
sysfs under /sys/power/suspend_stats.
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Measure the filesystems sync time during system sleep more precisely.
Among other things, this allows the pr_cont() to be dropped from
ksys_sync_helper() and makes automatic system suspend and hibernation
profiling somewhat more straightforward.
Signed-off-by: Harry Pan <harry.pan@intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Create a common helper to sync filesystems for system suspend and
hibernation.
Signed-off-by: Harry Pan <harry.pan@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since pm_mutex is not exported using lock/unlock_system_sleep() from
inside a kernel module causes a "pm_mutex undefined" linker error.
Hence move lock/unlock_system_sleep() into kernel/power/main.c and
export these.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
For SoC to achieve its lowest power platform idle state a set of hardware
preconditions must be met. These preconditions or constraints can be
obtained by issuing a device specific method (_DSM) with function "1".
Refer to the document provided in the link below.
Here during initialization (from attach() callback of LPS0 device), invoke
function 1 to get the device constraints. Each enabled constraint is
stored in a table.
The devices in this table are used to check whether they were in required
minimum state, while entering suspend. This check is done from platform
freeze wake() callback, only when /sys/power/pm_debug_messages attribute
is non zero.
If any constraint is not met and device is ACPI power managed then it
prints the device information to kernel logs.
Also if debug is enabled in acpi/sleep.c, the constraint table and state
of each device on wake is dumped in kernel logs.
Since pm_debug_messages_on setting is used as condition to check
constraints outside kernel/power/main.c, pm_debug_messages_on is changed
to a global variable.
Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The pm_test sysfs attribute is under CONFIG_PM_DEBUG, but it doesn't
make sense to provide it if CONFIG_PM_SLEEP is unset, so put it under
CONFIG_PM_SLEEP_DEBUG instead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The messages printed by tk_debug_account_sleep_time() are basically
useful for system sleep debugging, so print them only when the other
debug messages from the core suspend/hibernate code are enabled.
While at it, make it clear that the messages from
tk_debug_account_sleep_time() are about timekeeping suspend
duration, because in general timekeeping may be suspeded and
resumed for multiple times during one system suspend-resume cycle.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Debug messages from the system suspend/hibernation infrastructure can
fill up the entire kernel log buffer in some cases and anyway they
are only useful for debugging. They depend on CONFIG_PM_DEBUG, but
that is set as a rule as some generally useful diagnostic facilities
depend on it too.
For this reason, avoid printing those messages by default, but make
it possible to turn them on as needed with the help of a new sysfs
attribute under /sys/power/.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
3802 624 32 4458 116a kernel/power/main.o
File size After adding 'const':
text data bss dec hex filename
3866 560 32 4458 116a kernel/power/main.o
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are systems in which the platform doesn't support any special
sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
available system sleep state. However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that may be a pain in practice.
Commit 0399d4db3e (PM / sleep: Introduce command line argument for
sleep state enumeration) attempted to address this problem by adding
a command line argument to change the meaning of the "mem" string in
/sys/power/state to make it trigger suspend-to-idle (instead of
suspend-to-RAM).
However, there also are systems in which the platform does support
special sleep states, but suspend-to-idle is the preferred one anyway
(it even may save more energy than the platform-provided sleep states
in some cases) and the above commit doesn't help in those cases.
For this reason, rework the system sleep state selection interface
again (but preserve backwards compatibiliby). Namely, add a new
sysfs file, /sys/power/mem_sleep, that will control the system
suspend mode triggered by writing "mem" to /sys/power/state (in
analogy with what /sys/power/disk does for hibernation). Make it
select suspend-to-RAM ("deep" sleep) by default (if supported) and
fall back to suspend-to-idle ("s2idle") otherwise and add a new
command line argument, mem_sleep_default, allowing that default to
be overridden if need be.
At the same time, drop the relative_sleep_states command line
argument that doesn't make sense any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>