misc: smpro-errmon: Add Ampere's SMpro error monitor driver

Add Ampere's SMpro error monitor driver for monitoring and reporting
RAS-related errors as reported by SMpro co-processor found on Ampere's
Altra processor family.

Signed-off-by: Quan Nguyen <quan@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20221031024442.2490881-3-quan@os.amperecomputing.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit is contained in:
Quan Nguyen
2022-10-31 09:44:41 +07:00
committed by Greg Kroah-Hartman
parent c08645ea21
commit 4a4a4e9eba
4 changed files with 806 additions and 0 deletions

View File

@@ -0,0 +1,264 @@
What: /sys/bus/platform/devices/smpro-errmon.*/error_[core|mem|pcie|other]_[ce|ue]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the 48-byte Ampere (Vendor-Specific) Error Record printed
in hex format according to the table below:
+--------+---------------+-------------+------------------------------------------------------------+
| Offset | Field | Size (byte) | Description |
+--------+---------------+-------------+------------------------------------------------------------+
| 00 | Error Type | 1 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 01 | Subtype | 1 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 02 | Instance | 2 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 04 | Error status | 4 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 08 | Error Address | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 16 | Error Misc 0 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 24 | Error Misc 1 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 32 | Error Misc 2 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 40 | Error Misc 3 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
The table below defines the value of error types, their subtype, subcomponent and instance:
.. _smpro-error-types:
+-----------------+------------+----------+----------------+----------------------------------------+
| Error Group | Error Type | Sub type | Sub component | Instance |
+-----------------+------------+----------+----------------+----------------------------------------+
| CPM (core) | 0 | 0 | Snoop-Logic | CPM # |
+-----------------+------------+----------+----------------+----------------------------------------+
| CPM (core) | 0 | 2 | Armv8 Core 1 | CPM # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 1 | ERR1 | MCU # \| SLOT << 11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 2 | ERR2 | MCU # \| SLOT << 11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 3 | ERR3 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 4 | ERR4 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 5 | ERR5 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 6 | ERR6 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 7 | Link Error | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 0 | Cross Point | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 1 | Home Node(IO) | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 2 | Home Node(Mem) | X \| (Y << 5) \| NS <<11 \| device<<12 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 4 | CCIX Node | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| 2P Link (other) | 3 | 0 | N/A | Altra 2P Link # |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 0 | ERR0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 1 | ERR1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 2 | ERR2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 3 | ERR3 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 4 | ERR4 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 5 | ERR5 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 6 | ERR6 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 7 | ERR7 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 8 | ERR8 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 9 | ERR9 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 10 | ERR10 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 11 | ERR11 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 12 | ERR12 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 13-21 | ERR13 | RC # + 1 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TCU | 100 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU0 | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU1 | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU2 | 2 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU3 | 3 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU4 | 4 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU5 | 5 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU6 | 6 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU7 | 7 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU8 | 8 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU9 | 9 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe AER (pcie) | 7 | Root | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe AER (pcie) | 7 | Device | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RCA HB | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RCB HB | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RASDP | 8 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR2 | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | MPA_ERR | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | MPA_ERR | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
Example::
# cat error_other_ue
880807001e004010401040101500000001004010401040100c0000000000000000000000000000000000000000000000
The detail of each sysfs entries is as below:
+-------------+---------------------------------------------------------+----------------------------------+
| Error | Sysfs entry | Description (when triggered) |
+-------------+---------------------------------------------------------+----------------------------------+
| Core's CE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ce | Core has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Core's UE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ue | Core has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ce | Memory has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ue | Memory has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ce | any PCIe controller has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ue | any PCIe controller has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Other's CE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ce | any other CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Other's UE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ue | any other UE error |
+-------------+---------------------------------------------------------+----------------------------------+
UE: Uncorrect-able Error
CE: Correct-able Error
For details, see section `3.3 Ampere (Vendor-Specific) Error Record Formats,
Altra Family RAS Supplement`.
What: /sys/bus/platform/devices/smpro-errmon.*/overflow_[core|mem|pcie|other]_[ce|ue]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Return the overflow status of each type HW error reported:
- 0 : No overflow
- 1 : There is an overflow and the oldest HW errors are dropped
The detail of each sysfs entries is as below:
+-------------+-----------------------------------------------------------+---------------------------------------+
| Overflow | Sysfs entry | Description |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Core's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ce | Core CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Core's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ue | Core UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ce | Memory CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ue | Memory UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ce | any PCIe controller CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ue | any PCIe controller UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Other's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ce| any other CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Other's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ue| other UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
where:
- UE: Uncorrect-able Error
- CE: Correct-able Error
What: /sys/bus/platform/devices/smpro-errmon.*/[error|warn]_[smpro|pmpro]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the internal firmware error/warning printed as hex format.
The detail of each sysfs entries is as below:
+---------------+------------------------------------------------------+--------------------------+
| Error | Sysfs entry | Description |
+---------------+------------------------------------------------------+--------------------------+
| SMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_smpro | system has SMpro error |
+---------------+------------------------------------------------------+--------------------------+
| SMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_smpro | system has SMpro warning |
+---------------+------------------------------------------------------+--------------------------+
| PMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_pmpro | system has PMpro error |
+---------------+------------------------------------------------------+--------------------------+
| PMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_pmpro | system has PMpro warning |
+---------------+------------------------------------------------------+--------------------------+
For details, see section `5.10 RAS Internal Error Register Definitions,
Altra Family Soc BMC Interface Specification`.
What: /sys/bus/platform/devices/smpro-errmon.*/event_[vrd_warn_fault|vrd_hot|dimm_hot]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the detail information in case of VRD/DIMM warning/hot events
in hex format as below::
AAAA
where:
- ``AAAA``: The event detail information data
The detail of each sysfs entries is as below:
+---------------+---------------------------------------------------------------+---------------------+
| Event | Sysfs entry | Description |
+---------------+---------------------------------------------------------------+---------------------+
| VRD HOT | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_hot | VRD Hot |
+---------------+---------------------------------------------------------------+---------------------+
| VR Warn/Fault | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_warn_fault | VR Warning or Fault |
+---------------+---------------------------------------------------------------+---------------------+
| DIMM HOT | /sys/bus/platform/devices/smpro-errmon.*/event_dimm_hot | DIMM Hot |
+---------------+---------------------------------------------------------------+---------------------+
For more details, see section `5.7 GPI Status Registers,
Altra Family Soc BMC Interface Specification`.

View File

@@ -176,6 +176,18 @@ config SGI_XP
this feature will allow for direct communication between SSIs
based on a network adapter and DMA messaging.
config SMPRO_ERRMON
tristate "Ampere Computing SMPro error monitor driver"
depends on MFD_SMPRO || COMPILE_TEST
help
Say Y here to get support for the SMpro error monitor function
provided by Ampere Computing's Altra and Altra Max SoCs. Upon
loading, the driver creates sysfs files which can be use to gather
multiple HW error data reported via read and write system calls.
To compile this driver as a module, say M here. The driver will be
called smpro-errmon.
config CS5535_MFGPT
tristate "CS5535/CS5536 Geode Multi-Function General Purpose Timer (MFGPT) support"
depends on MFD_CS5535

View File

@@ -23,6 +23,7 @@ obj-$(CONFIG_ENCLOSURE_SERVICES) += enclosure.o
obj-$(CONFIG_KGDB_TESTS) += kgdbts.o
obj-$(CONFIG_SGI_XP) += sgi-xp/
obj-$(CONFIG_SGI_GRU) += sgi-gru/
obj-$(CONFIG_SMPRO_ERRMON) += smpro-errmon.o
obj-$(CONFIG_CS5535_MFGPT) += cs5535-mfgpt.o
obj-$(CONFIG_GEHC_ACHC) += gehc-achc.o
obj-$(CONFIG_HP_ILO) += hpilo.o

529
drivers/misc/smpro-errmon.c Normal file

File diff suppressed because it is too large Load Diff