It is very useful to have the family/model/stepping with the reported
error so dump it. This saves us asking the bug reporter about it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Having the functional unit names in each bank decode is only misleading
as this code supports multiple families and there's no guarantee the
mapping between FUs and MCE banks will stay the same.
And also, knowing the functional unit name doesn't help much since you
end up looking at the respective BKDG anyway.
So drop all FU references and use the MC bank numbers instead.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This removes an open coded simple_open() function and replaces file
operations references to the function with simple_open() instead.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is the complement to previous commit "EDAC: Fix csrow size
reported in sysfs". This fixes the memory controller size reporting on
csrow-based memory controllers. The csrow size is already combined for
both channels. Without this patch memory size is reported doubled.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
On csrow-based memory controllers, we combine the csrow size from both
channels and there's no need to do that again in csrow_size_show which
leads to double the size of a csrow.
Fix it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Make sure code pays attention to K8 having only one DCT, reformat and
cleanup code, correct debug messages, remove unused code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Instead of open-coding it, use the DBAM_DIMM macro in
amd64_csrow_nr_pages() which we have already.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This basically reverts 603adaf6b3 ("amd64_edac: fix K8 chip select
reporting") because it was a clumsy workaround for DIMM sizes reporting
on K8 which got superceded by a much more correct one with 41d8bfaba7
("amd64_edac: Improve DRAM address mapping") without removing the prior
one. Remove it now finally.
Reported-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Rewrite CE/UE paths so that they use the same code and drop additional
code duplication in handle_ue. Add a struct err_info which collects
required info for the error reporting. This, in turn, helps slimming all
edac_mc_handle_error() calls down to one.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
All families report a valid error address when encountering a DRAM ECC
error so no need to check it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
When injecting DRAM ECC errors over the F3xB[8,C] interface, the machine
does this by injecting the error in the next non-cached access. This
takes relatively long time on a normal system so that in order for us to
expedite it, we disable the caches around the injection.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Invert kstrtoul return value testing and win one indentation level.
Also, shorten up macro names so that the lines can fit into 80 cols. No
functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
amd64_get_dram_hole_info: remove local variable 'base'.
sys_addr_to_dram_addr: do not clear local variable 'ret'. Also, sanitize
constants formatting.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
A reported error could look like this
[ 226.178315] EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x427c0d offset:0xde0 grain:0 syndrome:0x1c6)
with two spaces back-to-back due to the msg argument of
edac_mc_handle_error being passed on empty by the specific drivers.
Handle that.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The tracepoint decodes the error type later anyway so remove a useless
assignment to the temporary p which gets overwritten later anyway.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Only levels [0:4] are allowed so enforce that. Also, while at it,
massage Kconfig text and add valid debug levels range to the module
parameter description.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Currently, we unconditionally enable PCI polling and we don't look at
the edac_op_state module parameter. Make this dependent on the parameter
setting supplied on the command line.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Pull powerpc EEH bugfixes from Benjamin Herrenschmidt.
Two one-liner fixes for the new EEH code.
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/eeh: Do not invalidate PE properly
powerpc/pseries: Fix oops with MSIs when missing EEH PEs
Pull MIPS fixes from Ralf Baechle:
"Three issues fixed accross the field:
- Some functions that were recently outlined as part of a preemption
fix were causing problems with function tracing.
- The recently merged in-kernel MPI library uses very outdated
headers that contain MIPS-specific code which won't build on with
gcc 4.4 or newer.
- The MIPS non-NUMA memory initialization was making only a very
half-baked attempt at merging adjacent memory ranges. This kept
the code simple enough but is now causing issues with kexec."
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MPI: Fix compilation on MIPS with GCC 4.4 and newer
MIPS: Fix crash that occurs when function tracing is enabled
MIPS: Merge overlapping bootmem ranges
While the EEH does recovery on the specific PE that has PCI errors,
the PCI devices belonging to the PE will be removed and the PE will
be marked as invalid since we still need the information stored in
the PE. We only invalidate the PE when it doesn't have associated
EEH devices and valid child PEs. However, the code used to check
that is wrong. The patch fixes that.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>