Commit Graph

892 Commits

Author SHA1 Message Date
Mauro Carvalho Chehab 9713faecff EDAC: Merge mci.mem_is_per_rank with mci.csbased
Both mci.mem_is_per_rank and mci.csbased denote the same thing: the
memory controller is csrows based. Merge both fields into one.

There's no need for the driver to actually fill it, as the core detects
it by checking if one of the layers has the csrows type as part of the
memory hierarchy:

	if (layers[i].type == EDAC_MC_LAYER_CHIP_SELECT)
			per_rank = true;

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2013-03-16 06:32:30 +01:00
Mauro Carvalho Chehab 1eef128254 amd64_edac: Correct DIMM sizes
We were filling the csrow size with a wrong value. 16a528ee39 ("EDAC:
Fix csrow size reported in sysfs") tried to address the issue. It fixed
the report with the old API but not with the new one. Correct it for the
new API too.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
[ make it a per-csrow accounting regardless of ->channel_count ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2013-03-16 06:32:02 +01:00
Stephen Hemminger fbe2d3616c EDAC: Make sysfs functions static
Fixes lots of sparse warnings here.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2013-03-05 11:33:57 +01:00
Linus Torvalds ad6c2c2eb3 Merge branch 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac
Pull EDAC fixes and ghes-edac from Mauro Carvalho Chehab:
 "For:

   - Some fixes at edac drivers (i7core_edac, sb_edac, i3200_edac);
   - error injection support for i5100, when EDAC debug is enabled;
   - fix edac when it is loaded builtin (early init for the subsystem);
   - a "Firmware First" EDAC driver, allowing ghes to report errors via
     EDAC (ghes-edac).

  With regards to ghes-edac, this fixes a longstanding BZ at Red Hat
  that happens with Nehalem and Sandy Bridge CPUs: when both GHES and
  i7core_edac or sb_edac are running, the error reports are
  unpredictable, as both BIOS and OS race to access the registers.  With
  ghes-edac, the EDAC core will refuse to register any other concurrent
  memory error driver.

  This patchset moves the ghes struct definitions to a separate header
  file (include/acpi/ghes.h) and adds 3 hooks at apei/ghes.c to
  register/unregister and to report errors via ghes-edac.  Those changes
  were acked by ghes driver maintainer (Huang)."

* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (30 commits)
  i5100_edac: convert to use simple_open()
  ghes_edac: fix to use list_for_each_entry_safe() when delete list items
  ghes_edac: Fix RAS tracing
  ghes_edac: Make it compliant with UEFI spec 2.3.1
  ghes_edac: Improve driver's printk messages
  ghes_edac: Don't credit the same memory dimm twice
  ghes_edac: do a better job of filling EDAC DIMM info
  ghes_edac: add support for reporting errors via EDAC
  ghes_edac: Register at EDAC core the BIOS report
  ghes: add the needed hooks for EDAC error report
  ghes: move structures/enum to a header file
  edac: add support for error type "Info"
  edac: add support for raw error reports
  edac: reduce stack pressure by using a pre-allocated buffer
  edac: lock module owner to avoid error report conflicts
  edac: remove proc_name from mci structure
  edac: add a new memory layer type
  edac: initialize the core earlier
  edac: better report error conditions in debug mode
  i5100_edac: Remove two checkpatch warnings
  ...
2013-02-28 20:42:33 -08:00
Wei Yongjun b0769891ba i5100_edac: convert to use simple_open()
This removes an open coded simple_open() function and
replaces file operations references to the function
with simple_open() instead.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-26 10:06:18 -03:00
Wei Yongjun 5dae92a718 ghes_edac: fix to use list_for_each_entry_safe() when delete list items
Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each_entry() macro aptly named
list_for_each_entry_safe().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-26 10:05:35 -03:00
Mauro Carvalho Chehab 8ae8f50ad8 ghes_edac: Fix RAS tracing
With the current version of CPER, there's no way to associate an
error with the memory error. So, the error location in EDAC
layers is unused.

As CPER has its own idea about memory architectural layers, just
output whatever is there inside the driver's detail at the RAS
tracepoint.

The EDAC location keeps untouched, in the case that, in some future,
we could actually map the error into the dimm labels.

Now, the error message:

[   72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   72.396627] {1}[Hardware Error]: APEI generic hardware error status
[   72.396628] {1}[Hardware Error]: severity: 2, corrected
[   72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
[   72.396632] {1}[Hardware Error]: flags: 0x01
[   72.396634] {1}[Hardware Error]: primary
[   72.396635] {1}[Hardware Error]: section_type: memory error
[   72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
[   72.396638] {1}[Hardware Error]: node: 3
[   72.396639] {1}[Hardware Error]: card: 0
[   72.396640] {1}[Hardware Error]: module: 0
[   72.396641] {1}[Hardware Error]: device: 0
[   72.396643] {1}[Hardware Error]: error_type: 18, unknown
[   72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

Is properly represented on the trace event:

     kworker/0:2-584   [000] ....    72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location👎-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)

Tested on a 4 sockets E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:17 -03:00
Mauro Carvalho Chehab 689c9cd812 ghes_edac: Make it compliant with UEFI spec 2.3.1
The UEFI spec defines the memory error types ans the bits that
validate each field on the memory error record, at
Appendix N om items N.2.5 (Memory Error Section) and
N.2.11 (Error Status). Make the error description compliant with
it, only showing the valid fields.

The EDAC error log is now properly reporting the error:

[  281.556854] mce: [Hardware Error]: Machine check events logged
[  281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  281.557044] {2}[Hardware Error]: APEI generic hardware error status
[  281.557046] {2}[Hardware Error]: severity: 2, corrected
[  281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
[  281.557050] {2}[Hardware Error]: flags: 0x01
[  281.557052] {2}[Hardware Error]: primary
[  281.557053] {2}[Hardware Error]: section_type: memory error
[  281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
[  281.557056] {2}[Hardware Error]: node: 3
[  281.557057] {2}[Hardware Error]: card: 0
[  281.557058] {2}[Hardware Error]: module: 1
[  281.557059] {2}[Hardware Error]: device: 0
[  281.557061] {2}[Hardware Error]: error_type: 18, unknown
[  281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
[  281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

Tested on a 4 CPUs E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:16 -03:00
Mauro Carvalho Chehab d2a6856614 ghes_edac: Improve driver's printk messages
Provide a better infrastructure for printk's inside the driver:
	- use edac_dbg() for debug messages;
	- standardize the usage of pr_info();
	- provide warning about the risk of relying on this
	  driver.

While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

Conflicts:
	drivers/edac/ghes_edac.c
2013-02-25 19:42:15 -03:00
Mauro Carvalho Chehab 5ee726db52 ghes_edac: Don't credit the same memory dimm twice
On my tests on a 4xE5-4650 CPU's system, the GHES
EDAC driver is called twice. As the SMBIOS DMI enumeration
call will seek for the entire DIMM sockets in the system, on
this machine, equipped with 128 GB of RAM, the memory is
displayed twice:

          +-----------------------+
          |    mc0    |    mc1    |
----------+-----------------------+
memory45: |  8192 MB  |  8192 MB  |
memory44: |     0 MB  |     0 MB  |
----------+-----------------------+
memory43: |     0 MB  |     0 MB  |
memory42: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory41: |     0 MB  |     0 MB  |
memory40: |     0 MB  |     0 MB  |
----------+-----------------------+
memory39: |  8192 MB  |  8192 MB  |
memory38: |     0 MB  |     0 MB  |
----------+-----------------------+
memory37: |     0 MB  |     0 MB  |
memory36: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory35: |     0 MB  |     0 MB  |
memory34: |     0 MB  |     0 MB  |
----------+-----------------------+
memory33: |  8192 MB  |  8192 MB  |
memory32: |     0 MB  |     0 MB  |
----------+-----------------------+
memory31: |     0 MB  |     0 MB  |
memory30: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory29: |     0 MB  |     0 MB  |
memory28: |     0 MB  |     0 MB  |
----------+-----------------------+
memory27: |  8192 MB  |  8192 MB  |
memory26: |     0 MB  |     0 MB  |
----------+-----------------------+
memory25: |     0 MB  |     0 MB  |
memory24: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory23: |     0 MB  |     0 MB  |
memory22: |     0 MB  |     0 MB  |
----------+-----------------------+
memory21: |  8192 MB  |  8192 MB  |
memory20: |     0 MB  |     0 MB  |
----------+-----------------------+
memory19: |     0 MB  |     0 MB  |
memory18: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory17: |     0 MB  |     0 MB  |
memory16: |     0 MB  |     0 MB  |
----------+-----------------------+
memory15: |  8192 MB  |  8192 MB  |
memory14: |     0 MB  |     0 MB  |
----------+-----------------------+
memory13: |     0 MB  |     0 MB  |
memory12: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory11: |     0 MB  |     0 MB  |
memory10: |     0 MB  |     0 MB  |
----------+-----------------------+
memory9:  |  8192 MB  |  8192 MB  |
memory8:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory7:  |     0 MB  |     0 MB  |
memory6:  |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory5:  |     0 MB  |     0 MB  |
memory4:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory3:  |  8192 MB  |  8192 MB  |
memory2:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory1:  |     0 MB  |     0 MB  |
memory0:  |  8192 MB  |  8192 MB  |
----------+-----------------------+

Total sum of 256 GB.

As there's no reliable way to credit DIMMS to the right memory
controller, just put everything on memory controller 0 (with should
always exist).

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:14 -03:00
Mauro Carvalho Chehab 32fa1f53c2 ghes_edac: do a better job of filling EDAC DIMM info
Instead of just faking a random value for the DIMM data, get
the information that it is available via DMI table.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:13 -03:00
Mauro Carvalho Chehab f04c62a703 ghes_edac: add support for reporting errors via EDAC
Now that the EDAC core is capable of just forward the errors via
the userspace API, add a report mechanism for the GHES errors.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:13 -03:00
Mauro Carvalho Chehab 77c5f5d2f2 ghes_edac: Register at EDAC core the BIOS report
Register GHES at EDAC MC core, in order to avoid other
drivers to also handle errors and mangle with error data.

The edac core will warrant that just one driver will be used,
so the first one to register (BIOS first) will be the one that
will be reporting the hardware errors.

For now, the EDAC driver does nothing but to register at the
EDAC core, preventing the hardware-driven mechanism to
interfere with GHES.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25 19:42:12 -03:00
Linus Torvalds 06991c28f3 Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
 "Here is the big driver core merge for 3.9-rc1

  There are two major series here, both of which touch lots of drivers
  all over the kernel, and will cause you some merge conflicts:

   - add a new function called devm_ioremap_resource() to properly be
     able to check return values.

   - remove CONFIG_EXPERIMENTAL

  Other than those patches, there's not much here, some minor fixes and
  updates"

Fix up trivial conflicts

* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
  base: memory: fix soft/hard_offline_page permissions
  drivercore: Fix ordering between deferred_probe and exiting initcalls
  backlight: fix class_find_device() arguments
  TTY: mark tty_get_device call with the proper const values
  driver-core: constify data for class_find_device()
  firmware: Ignore abort check when no user-helper is used
  firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
  firmware: Make user-mode helper optional
  firmware: Refactoring for splitting user-mode helper code
  Driver core: treat unregistered bus_types as having no devices
  watchdog: Convert to devm_ioremap_resource()
  thermal: Convert to devm_ioremap_resource()
  spi: Convert to devm_ioremap_resource()
  power: Convert to devm_ioremap_resource()
  mtd: Convert to devm_ioremap_resource()
  mmc: Convert to devm_ioremap_resource()
  mfd: Convert to devm_ioremap_resource()
  media: Convert to devm_ioremap_resource()
  iommu: Convert to devm_ioremap_resource()
  drm: Convert to devm_ioremap_resource()
  ...
2013-02-21 12:05:51 -08:00
Mauro Carvalho Chehab e7e248304c edac: add support for raw error reports
That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 14:16:03 -03:00
Mauro Carvalho Chehab c7ef764554 edac: reduce stack pressure by using a pre-allocated buffer
The number of variables at the stack is too big.
Reduces the stack usage by using a pre-allocated error
buffer.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 13:48:45 -03:00
Mauro Carvalho Chehab 80cc7d87d5 edac: lock module owner to avoid error report conflicts
APEI GHES and i7core_edac/sb_edac currently can be loaded at
the same time, but those are Highlander modules:
	"There can be only one".

There are two reasons for that:

1) Each driver assumes that it is the only one registering at
   the EDAC core, as it is driver's responsibility to number
   the memory controllers, and all of them start from 0;

2) If BIOS is handling the memory errors, the OS can't also be
   doing it, as one will mangle with the other.

So, we need to add an module owner's lock at the EDAC core,
in order to avoid having two different modules handling memory
errors at the same time. The best way for doing this lock seems
to use the driver's name, as this is unique, and won't require
changes on every driver.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:38 -03:00
Mauro Carvalho Chehab c66b5a79a9 edac: add a new memory layer type
There are some cases where the memory controller layout is
completely hidden. This is the case of firmware-driven error
code, like the one provided by GHES. Add a new layer to be
used on such memory error report mechanisms.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:37 -03:00
Mauro Carvalho Chehab 4ab19b06ac edac: initialize the core earlier
In order for it to work with it builtin, the EDAC core should
be initialized earlier, otherwise the ghes_edac driver initializes
before edac_mc_sysfs_init() being called:

...
[    4.998373] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
...
[    4.998373] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    6.519495] EDAC MC: Ver: 3.0.0
[    6.523749] EDAC DEBUG: edac_mc_sysfs_init: device mc created

The net result is that no EDAC sysfs nodes will appear.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:36 -03:00
Mauro Carvalho Chehab 3d958823e2 edac: better report error conditions in debug mode
It is hard to find what's wrong without a proper error
report. Improve it, in debug mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:35 -03:00
Mauro Carvalho Chehab 59b9796d1e i5100_edac: Remove two checkpatch warnings
The last changeset introduced a few checkpatch warnings:

WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required
261: FILE: drivers/edac/i5100_edac.c:1207:
+       if (priv->debugfs)
+               debugfs_remove_recursive(priv->debugfs);

WARNING: debugfs_remove(NULL) is safe this check is probably not required
290: FILE: drivers/edac/i5100_edac.c:1250:
+       if (i5100_debugfs)
+               debugfs_remove(i5100_debugfs);

Get rid of them.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:34 -03:00
Niklas Söderlund 9cbc6d38f2 i5100_edac: connect fault injection to debugfs node
Create a debugfs direcotry i5100_edac/mcX for each memory controller and
add nodes to control how fault injection is preformed.

After configuring an injection using inject_channel, inject_deviceptr1,
inject_deviceptr2, inject_eccmask1, inject_eccmask2 and inject_hlinesel
trigger the injection by writing anything to inject_enable.

Example of a CE injection:

echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_channel
echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_hlinesel
echo 61440 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask1
echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_enable

Example of UE injection:

echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_channel
echo 2 > /sys/kernel/debug/i5100_edac/mc0/inject_hlinesel
echo 65535 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask1
echo 65535 > /sys/kernel/debug/i5100_edac/mc0/inject_eccmask2
echo 17 > /sys/kernel/debug/i5100_edac/mc0/inject_deviceptr1
echo 0 > /sys/kernel/debug/i5100_edac/mc0/inject_deviceptr2
echo 1 > /sys/kernel/debug/i5100_edac/mc0/inject_enable

Sometimes it is needed to enable the injection more then once (echo to
the inject_enable node) for the injection to happen, I am not sure why.

Signed-off-by: Niklas Söderlund <niklas.soderlund@ericsson.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:34 -03:00
Niklas Söderlund 53ceafd6a2 i5100_edac: add fault injection code
Add fault injection based on information datasheet for i5100, see 1. In
addition to the i5100 datasheet some missing information on injection
functions where found through experimentation and the i7300 datasheet,
see 2.

[1] Intel 5100 Memory Controller Hub Chipset
    Doc.Nr: 318378
    http://www.intel.com/content/dam/doc/datasheet/5100-
    memory-controller-hub-chipset-datasheet.pdf

[2] Intel 7300 Chipset MemoryController Hub (MCH)
    Doc.Nr: 318082
	http://www.intel.com/assets/pdf/datasheet/318082.pdf

Signed-off-by: Niklas Söderlund <niklas.soderlund@ericsson.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:33 -03:00
Niklas Söderlund 52608ba205 i5100_edac: probe for device 19 function 0
Probe and store the device handle for the device 19 function 0 during
driver initialization. The device is used during fault injection.

Signed-off-by: Niklas Söderlund <niklas.soderlund@ericsson.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:32 -03:00
Mauro Carvalho Chehab e7100478fa edac: only create sdram_scrub_rate where supported
Currently, sdram_scrub_rate sysfs node is created even if the device
doesn't support get/set the scub rate. Change the logic to only
create this device node when the operation is supported.

Reported-by: Felipe Balbi <balbi@ti.com>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21 11:06:31 -03:00