Commit Graph

289 Commits

Author SHA1 Message Date
Borislav Petkov
6dc36b37f6 edac, mce: Fix wrong mask and macro usage
commit 35d824b28f upstream.

Correct two mishaps which prevented reporting error type (CECC vs UECC)
and extended error description.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12 14:57:05 -07:00
Borislav Petkov
ceb2ef9f73 edac, mce: Filter out invalid values
commit 5b89d2f9ac upstream.

Print the CPU associated with the error only when the field is valid.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01 15:58:40 -07:00
Borislav Petkov
dc5b9893d1 amd64_edac: Do not falsely trigger kerneloops
commit cab4d27764 upstream.

An unfortunate "WARNING" in the message amd64_edac dumps when the system
doesn't support DRAM ECC or ECC checking is not enabled in the BIOS
used to trigger kerneloops which qualified the message as an OOPS thus
misleading the users. See, e.g.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/422536
http://bugzilla.kernel.org/show_bug.cgi?id=15238

Downgrade the message level to KERN_NOTICE and fix the formulation.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-23 07:37:52 -08:00
Tamas Vincze
7f40c6b6ad edac: i5000_edac critical fix panic out of bounds
commit 118f3e1afd upstream.

EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
Kernel panic - not syncing: EDAC MC0: Uncorrected Error  (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

This happens because FERR_NF_FBD bit 28 is not updated on i5000.  Due to
that, both bits 28 and 29 may be equal to one, returning channel = 3.  As
this value is invalid, EDAC core generates the panic.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14568

Signed-off-by: Tamas Vincze <tom@vincze.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-22 15:18:15 -08:00
Borislav Petkov
9127720087 amd64_edac: fix forcing module load/unload
commit 43f5e68733 upstream.

Clear the override flag after force-loading the module.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:16 -08:00
Borislav Petkov
15383230c0 amd64_edac: make driver loading more robust
commit 56b34b91e2 upstream.

Currently, the module does not initialize fully when the DIMMs aren't
ECC but remains still loaded. Propagate the error when no instance of
the driver is properly initialized and prevent further loading.

Reorganize and polish error handling in amd64_edac_init() while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:14 -08:00
Borislav Petkov
44a529c6b3 amd64_edac: fix driver instance freeing
commit 8f68ed9728 upstream.

Fix use-after-free errors by pushing all memory-freeing calls to the end
of amd64_remove_one_instance().

Reported-by: Darren Jenkins <darrenrjenkins@gmail.com>
LKML-Reference: <1261370306.11354.52.camel@ICE-BOX>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:13 -08:00
Borislav Petkov
eb21839600 x86, msr: Add support for non-contiguous cpumasks
commit 505422517d upstream.

The current rd/wrmsr_on_cpus helpers assume that the supplied
cpumasks are contiguous. However, there are machines out there
like some K8 multinode Opterons which have a non-contiguous core
enumeration on each node (e.g. cores 0,2 on node 0 instead of 0,1), see
http://www.gossamer-threads.com/lists/linux/kernel/1160268.

This patch fixes out-of-bounds writes (see URL above) by adding per-CPU
msr structs which are used on the respective cores.

Additionally, two helpers, msrs_{alloc,free}, are provided for use by
the callers of the MSR accessors.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20091211171440.GD31998@aftab>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:11 -08:00
Borislav Petkov
26eb2ac67f amd64_edac: unify MCGCTL ECC switching
commit f6d6ae9657 upstream.

Unify almost identical code into one function and remove NUMA-specific
usage (specifically cpumask_of_node()) in favor of generic topology
methods.

Remove unused defines, while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:10 -08:00
Rusty Russell
ebd2802865 cpumask: use modern cpumask style in drivers/edac/amd64_edac.c
commit ba578cb34a upstream.

cpumask_t -> struct cpumask, and don't put one on the stack.  (Note: this
is actually on the stack unless CONFIG_CPUMASK_OFFSTACK=y).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06 15:05:07 -08:00
Borislav Petkov
17adea01b9 amd64_edac: fix CECCs reporting
Shift error type bits properly.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-11-04 14:04:06 +01:00
Li Hong
a3c4c58085 amd64_edac: fix a wrong goto clause in amd64_edac.c
In amd64_edac_init(void) in amd64_edac.c, cache_k8_northbridges() is
called before pci_register_driver. If it fails, should exit with err
directly.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-11-04 14:02:32 +01:00
Keith Mannthey
c2494ace99 edac: i5100 fix initialization code
Allow csrows to properly initialize when the topology only has active
channels on 2 and 3.  This new check allows proper detection and
initialization in this topology.  Only checking the first mrt that
represented channels 0 and 1 is not sufficient.

I also fixed up the related debug information path.  I can submit as a 2nd
patch if needed.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Acked-by: Aristeu Rozanski <aris@ruivo.org>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Ira W. Snyder
0616fb003d edac: i5400 fix missing CONFIG_PCI define
When building without CONFIG_PCI the edac_pci_idx variable is unused,
causing a build-time warning.  Wrap the variable in #ifdef CONFIG_PCI,
just like the rest of the PCI support.

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Jeff Roberson
156edd4aaa edac: i5400 fix csrow mapping
The i5400 EDAC driver has several bugs with chip-select row computation
which most likely lead to bugs in detailed error reporting.  Attempts to
contact the authors have gone mostly unanswered so I am presenting my diff
here.  I do not subscribe to lkml and would appreciate being kept in the
cc.

The most egregious problem was miscalculating the addresses of MTR
registers after register 0 by assuming they are 32bit rather than 16.
This caused the driver to miss half of the memories.  Most motherboards
tend to have only 8 dimm slots and not 16, so this may not have been
noticed before.

Further, the row calculations multiplied the number of dimms several
times, ultimately ending up with a maximum row of 32.  The chipset only
supports 4 dimms in each of 4 channels, so csrow could not be higher than
4 unless you use a row per-rank with dual-rank dimms.  I opted to
eliminate this behavior as it is confusing to the user and the error
reporting works by slot and not rank.  This gives a much clearer view of
memory by slot and channel in /sys.

Signed-off-by: Jeff Roberson <jroberson@jroberson.net>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Borislav Petkov
4997811e3b amd64_edac: fix DRAM base and limit extraction masks, v2
This is a proper fix as a follow-up to 66216a7 and 916d11b.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-16 18:51:22 +02:00
Linus Torvalds
624235c5b3 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, pci: Correct spelling in a comment
  x86: Simplify bound checks in the MTRR code
  x86: EDAC: carve out AMD MCE decoding logic
  initcalls: Add early_initcall() for modules
  x86: EDAC: MCE: Fix MCE decoding callback logic
2009-10-08 12:06:36 -07:00
Borislav Petkov
94baaee494 amd64_edac: beef up DRAM error injection
When injecting DRAM ECC errors (F3xBC_x8), EccVector[15:0] is a bitmask
of which bits should be error injected when written to and holds the
payload of 16-bit DRAM word when read, respectively.

Add /sysfs members to show the DRAM ECC section/word/vector.

Fail wrong injection values entered over /sysfs instead of truncating
them.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:51:28 +02:00
Borislav Petkov
66216a7a15 amd64_edac: fix DRAM base and limit extraction
On Fam10h and above, F1x[1, 0][7C:40] are DRAM Base/Limit registers
which specify the destination node of a DRAM address. Those address
boundaries are being extracted into ->dram_base[] and ->dram_limit[].
Correct the extraction masks to match the respective address bits.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:51:15 +02:00
Borislav Petkov
9d858bb10a amd64_edac: fix chip select handling
Different processor families support a different number of chip selects.
Handle this in a family-dependent way with the proper values assigned at
init time (see amd64_set_dct_base_and_mask).

Remove _DCSM_COUNT defines since they're used at one place and originate
from public documentation.

CC: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:50:50 +02:00
Keith Mannthey
2cff18c22c amd64_edac: simple fix to allow reporting of CECC errors
This allows the errors to be further decoded and mapped to csrows.
Tested with ECC debug dimms and an Rev F cpu based system.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:49:58 +02:00
Borislav Petkov
8edc544589 amd64_edac: fix K8 intlv_sel check
The check when DRAM interleaving is enabled should be done against the
pvt->dram_IntlvSel field and not against the ->dram_limit.

Simplify first loop and fixup printk formatting while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:49:43 +02:00
Borislav Petkov
72f158fe6f amd64_edac: fix interleave enable tests
The pvt->dram_IntlvEn saves the 3 "Interleave Enable" bits already
right-shifted by 8 so the check in find_mc_by_sys_addr() by shifting the
values to the left 8 bits is wrong.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:48:08 +02:00
Borislav Petkov
916d11b2b5 amd64_edac: fix DRAM base and limit address extraction
K8 DRAM base and limit addresses from F1x40 +8*i and F1x44 + 8*i, where
i in (0..7) are both bits 39-24 and therefore the shifting should be
done by 24 and not by 8.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:47:51 +02:00
Borislav Petkov
3011b20da9 amd64_edac: fix driver instance lookup table allocation
Allocate memory statically for 8-node machines max for simplicity
instead of relying on MAX_NUMNODES which is 0 on !CONFIG_NUMA builds.

Spotted by Jan Beulich.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:47:34 +02:00