Commit Graph

141 Commits

Author SHA1 Message Date
Linus Torvalds b20c99eb66 Merge branch 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS changes from Ingo Molnar:
 "[ The reason for drivers/ updates is that Boris asked for the
    drivers/edac/ changes to go via x86/ras in this cycle ]

  Main changes:

   - AMD CPUs:
      . Add ECC event decoding support for new F15h models
      . Various erratum fixes
      . Fix single-channel on dual-channel-controllers bug.

   - Intel CPUs:
      . UC uncorrectable memory error parsing fix
      . Add support for CMC (Corrected Machine Check) 'FF' (Firmware
        First) flag in the APEI HEST

   - Various cleanups and fixes"

* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  amd64_edac: Fix incorrect wraparounds
  amd64_edac: Correct erratum 505 range
  cpc925_edac: Use proper array termination
  x86/mce, acpi/apei: Only disable banks listed in HEST if mce is configured
  amd64_edac: Get rid of boot_cpu_data accesses
  amd64_edac: Add ECC decoding support for newer F15h models
  x86, amd_nb: Clarify F15h, model 30h GART and L3 support
  pci_ids: Add PCI device ID functions 3 and 4 for newer F15h models.
  x38_edac: Make a local function static
  i3200_edac: Make a local function static
  x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errors
  APEI/ERST: Fix error message formatting
  amd64_edac: Fix single-channel setups
  EDAC: Replace strict_strtol() with kstrtol()
  mce: acpi/apei: Soft-offline a page on firmware GHES notification
  mce: acpi/apei: Add a boot option to disable ff mode for corrected errors
  mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC
2013-09-04 11:07:04 -07:00
Aruna Balakrishnaiah 901037ba31 erst: Read and write to the 'compressed' flag of pstore
In pstore write, set the section type to CPER_SECTION_TYPE_DMESG_COMPR
if the data is compressed. In pstore read, read the section type and
update the 'compressed' flag accordingly.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 11:53:40 -07:00
Aruna Balakrishnaiah 9a4e139820 pstore: Introduce new argument 'compressed' in the read callback
Backends will set the flag 'compressed' after reading the log from
persistent store to indicate the data being returned to pstore is
compressed or not.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:11 -07:00
Aruna Balakrishnaiah b3b515bbd6 pstore: Add new argument 'compressed' in pstore write callback
Addition of new argument 'compressed' in the write call back will
help the backend to know if the data passed from pstore is compressed
or not (In case where compression fails.). If compressed, the backend
can add a tag indicating the data is compressed while writing to
persistent store.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:10 -07:00
Wei Yongjun 08b326d071 acpi/apei/erst: Add missing iounmap() on error in erst_exec_move_data()
Add the missing iounmap() before return from erst_exec_move_data()
in the error handling case.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:13:30 -07:00
Naveen N. Rao 7781544e7c x86/mce, acpi/apei: Only disable banks listed in HEST if mce is configured
Randconfig testing found this build error:

 >> hest.c(.init.text+0x6004): undefined reference to 'mce_disable_bank'

Fix by wrapping body of hest_parse_cmc() inside #ifdef
CONFIG_X86_MCE

Reported-by: "Wu, Fengguang" <fengguang.wu@intel.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/0129220@agluck-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12 17:55:12 +02:00
Ingo Molnar 0237d7f355 Merge branch 'x86/mce' into x86/ras
Pursue a single RAS/MCE topic branch on x86.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12 17:54:05 +02:00
Borislav Petkov cb82a2e4f9 APEI/ERST: Fix error message formatting
... according to acpi/apei/ conventions. Use standard pr_fmt prefix
while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
2013-07-31 11:23:10 +02:00
Naveen N. Rao cf870c70a1 mce: acpi/apei: Soft-offline a page on firmware GHES notification
If the firmware indicates in GHES error data entry that the error threshold
has exceeded for a corrected error event, then we try to soft-offline the
page. This could be called in interrupt context, so we queue this up similar
to how we handle memory failure scenarios.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-07-10 11:35:02 -07:00
Naveen N. Rao 9ad95879cd mce: acpi/apei: Add a boot option to disable ff mode for corrected errors
Add a boot option to disable firmware first mode for corrected errors.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-07-08 11:54:28 -07:00
Naveen N. Rao c3d1fb567a mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC
The Corrected Machine Check structure (CMC) in HEST has a flag which can be
set by the firmware to indicate to the OS that it prefers to process the
corrected error events first. In this scenario, the OS is expected to not
monitor for corrected errors (through CMCI/polling). Instead, the firmware
notifies the OS on corrected error events through GHES.

Linux already has support for GHES. This patch adds support for parsing CMC
structure and to disable CMCI/polling if the firmware first flag is set.

Further, the list of machine check bank structures at the end of CMC is used
to determine which MCA banks function in FF mode, so that we continue to
monitor error events on the other banks.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-07-08 11:53:01 -07:00
Linus Torvalds 65b97fb730 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "This is the powerpc changes for the 3.11 merge window.  In addition to
  the usual bug fixes and small updates, the main highlights are:

   - Support for transparent huge pages by Aneesh Kumar for 64-bit
     server processors.  This allows the use of 16M pages as transparent
     huge pages on kernels compiled with a 64K base page size.

   - Base VFIO support for KVM on power by Alexey Kardashevskiy

   - Wiring up of our nvram to the pstore infrastructure, including
     putting compressed oopses in there by Aruna Balakrishnaiah

   - Move, rework and improve our "EEH" (basically PCI error handling
     and recovery) infrastructure.  It is no longer specific to pseries
     but is now usable by the new "powernv" platform as well (no
     hypervisor) by Gavin Shan.

   - I fixed some bugs in our math-emu instruction decoding and made it
     usable to emulate some optional FP instructions on processors with
     hard FP that lack them (such as fsqrt on Freescale embedded
     processors).

   - Support for Power8 "Event Based Branch" facility by Michael
     Ellerman.  This facility allows what is basically "userspace
     interrupts" for performance monitor events.

   - A bunch of Transactional Memory vs.  Signals bug fixes and HW
     breakpoint/watchpoint fixes by Michael Neuling.

  And more ...  I appologize in advance if I've failed to highlight
  something that somebody deemed worth it."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
  pstore: Add hsize argument in write_buf call of pstore_ftrace_call
  powerpc/fsl: add MPIC timer wakeup support
  powerpc/mpic: create mpic subsystem object
  powerpc/mpic: add global timer support
  powerpc/mpic: add irq_set_wake support
  powerpc/85xx: enable coreint for all the 64bit boards
  powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
  powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
  powerpc/mpic: Add get_version API both for internal and external use
  powerpc: Handle both new style and old style reserve maps
  powerpc/hw_brk: Fix off by one error when validating DAWR region end
  powerpc/pseries: Support compression of oops text via pstore
  powerpc/pseries: Re-organise the oops compression code
  pstore: Pass header size in the pstore write callback
  powerpc/powernv: Fix iommu initialization again
  powerpc/pseries: Inform the hypervisor we are using EBB regs
  powerpc/perf: Add power8 EBB support
  powerpc/perf: Core EBB support for 64-bit book3s
  powerpc/perf: Drop MMCRA from thread_struct
  powerpc/perf: Don't enable if we have zero events
  ...
2013-07-04 10:29:23 -07:00
Linus Torvalds 862f001254 Merge tag 'pci-v3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
 "PCI device hotplug
    - Add pci_alloc_dev() interface (Gu Zheng)
    - Add pci_bus_get()/put() for reference counting (Jiang Liu)
    - Fix SR-IOV reference count issues (Jiang Liu)
    - Remove unused acpi_pci_roots list (Jiang Liu)

  MSI
    - Conserve interrupt resources on x86 (Alexander Gordeev)

  AER
    - Force fatal severity when component has been reset (Betty Dall)
    - Reset link below Root Port as well as Downstream Port (Betty Dall)
    - Fix "Firmware first" flag setting (Bjorn Helgaas)
    - Don't parse HEST for non-PCIe devices (Bjorn Helgaas)

  ASPM
    - Warn when we can't disable ASPM as driver requests (Bjorn Helgaas)

  Miscellaneous
    - Add CircuitCo PCI IDs (Darren Hart)
    - Add AMD CZ SATA and SMBus PCI IDs (Shane Huang)
    - Work around Ivytown NTB BAR size issue (Jon Mason)
    - Detect invalid initial BAR values (Kevin Hao)
    - Add pcibios_release_device() (Sebastian Ott)
    - Fix powerpc & sparc PCI_UNKNOWN power state usage (Bjorn Helgaas)"

* tag 'pci-v3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits)
  MAINTAINERS: Add ACPI folks for ACPI-related things under drivers/pci
  PCI: Add CircuitCo vendor ID and subsystem ID
  PCI: Use pdev->pm_cap instead of pci_find_capability(..,PCI_CAP_ID_PM)
  PCI: Return early on allocation failures to unindent mainline code
  PCI: Simplify IOV implementation and fix reference count races
  PCI: Drop redundant setting of bus->is_added in virtfn_add_bus()
  unicore32/PCI: Remove redundant call of pci_bus_add_devices()
  m68k/PCI: Remove redundant call of pci_bus_add_devices()
  PCI / ACPI / PM: Use correct power state strings in messages
  PCI: Fix comment typo for pcie_pme_remove()
  PCI: Rename pci_release_bus_bridge_dev() to pci_release_host_bridge_dev()
  PCI: Fix refcount issue in pci_create_root_bus() error recovery path
  ia64/PCI: Clean up pci_scan_root_bus() usage
  PCI/AER: Reset link for devices below Root Port or Downstream Port
  ACPI / APEI: Force fatal AER severity when component has been reset
  PCI/AER: Remove "extern" from function declarations
  PCI/AER: Move AER severity defines to aer.h
  PCI/AER: Set dev->__aer_firmware_first only for matching devices
  PCI/AER: Factor out HEST device type matching
  PCI/AER: Don't parse HEST table for non-PCIe devices
  ...
2013-07-03 16:31:35 -07:00
Linus Torvalds 04bbc8e1f6 Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore update from Tony Luck:
 "Fixes for pstore for 3.11 merge window"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  efivars: If pstore_register fails, free unneeded pstore buffer
  acpi: Eliminate console msg if pstore.backend excludes ERST
  pstore: Return unique error if backend registration excluded by kernel param
  pstore: Fail to unlink if a driver has not defined pstore_erase
  pstore/ram: remove the power of buffer size limitation
  pstore/ram: avoid atomic accesses for ioremapped regions
  efi, pstore: Cocci spatch "memdup.spatch"
2013-07-03 11:14:44 -07:00
Aruna Balakrishnaiah 6bbbca7359 pstore: Pass header size in the pstore write callback
Header size is needed to distinguish between header and the dump data.
Incorporate the addition of new argument (hsize) in the pstore write
callback.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 18:10:48 +10:00
Lenny Szubowicz 74fd6c6f84 acpi: Eliminate console msg if pstore.backend excludes ERST
This is patch 2/3 of a patch set that avoids what misleadingly appears
to be a error during boot:

ERST: Could not register with persistent store

This message is displayed if the system has a valid ACPI ERST table and the
pstore.backend kernel parameter has been used to disable use of ERST by
pstore. But this same message is used for errors that preclude registration.

In erst_init don't complain if the setting of kernel parameter pstore.backend
precludes use of ACPI ERST for pstore. Routine pstore_register will inform
about the facility that does register.

Also, don't leave a dangling pointer to deallocated mem for the pstore
buffer when registration fails.

Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Reported-by: Naotaka Hamaguchi <n.hamaguchi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-28 15:22:31 -07:00
Ingo Molnar d908e1ebbc Merge tag 'please-pull-einj' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras
Pull miscellaneous fixes for ACPI EINJ (error injection) code, from Tony Luck.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 13:54:04 +02:00
Rafael J. Wysocki 2314b69253 Merge branch 'acpi-fixes'
* acpi-fixes:
  ACPI / PM: Do not execute _PS0 for devices without _PSC during initialization
  ACPI / scan: do not match drivers against objects having scan handlers
  ACPI / APEI: fix error return code in ghes_probe()
  ACPI / video: ignore BIOS initial backlight value for HP Pavilion g6
  ACPI / video: ignore BIOS initial backlight value for HP m4
  x86 / platform / hp_wmi: Fix bluetooth_rfkill misuse in hp_wmi_rfkill_setup()
2013-06-07 12:35:23 +02:00
Chen Gong c5a130325f ACPI/APEI: Add parameter check before error injection
When param1 is enabled in EINJ but not assigned with a valid
value, sometimes it will cause the error like below:

APEI: Can not request [mem 0x7aaa7000-0x7aaa7007] for APEI EINJ Trigger registers

It is because some firmware will access target address specified in
param1 to trigger the error when injecting memory error. This will
cause resource conflict with regular memory. So It must be removed
from trigger table resources, but incorrect param1/param2
combination will stop this action. Add extra check to avoid
this kind of error.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-06 15:20:51 -07:00
Betty Dall 0ba98ec919 ACPI / APEI: Force fatal AER severity when component has been reset
The CPER error record has a reset bit that indicates that the platform
has reset the component. The reset bit can be set for any severity
error including recoverable.  From the AER code path's perspective,
any error is fatal if the component has been reset.  This patch
upgrades the severity of the AER recovery to AER_FATAL whenever the
CPER error record indicates that the component has been reset.

[bhelgaas: s/bus has been reset/component has been reset/]
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2013-06-06 14:39:16 -06:00
Wei Yongjun b8edb64119 ACPI, APEI, EINJ: Fix error return code in einj_init()
Fix to return -ENOMEM in the debugfs_create_xxx() error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-05 16:18:12 -07:00
Wei Yongjun a98d4f64a2 ACPI / APEI: fix error return code in ghes_probe()
Fix to return a negative error code in the acpi_gsi_to_irq() and
request_irq() error handling case instead of 0, as done elsewhere
in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-05 13:11:47 +02:00
Lance Ortiz 37448adfc7 aerdrv: Move cper_print_aer() call out of interrupt context
The following warning was seen on 3.9 when a corrected PCIe error was being
handled by the AER subsystem.

WARNING: at .../drivers/pci/search.c:214 pci_get_dev_by_id+0x8a/0x90()

This occurred because a call to pci_get_domain_bus_and_slot() was added to
cper_print_pcie() to setup for the call to cper_print_aer().  The warning
showed up because cper_print_pcie() is called in an interrupt context and
pci_get* functions are not supposed to be called in that context.

The solution is to move the cper_print_aer() call out of the interrupt
context and into aer_recover_work_func() to avoid any warnings when calling
pci_get* functions.

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-05-30 10:51:20 -07:00
Chen Gong aaf9d93be7 ACPI / APEI: fix error status check condition for CPER
In Table 18-289, ACPI5.0 SPEC, the error data length in CPER
Generic Error Data Entry can be 0, which means this generic
error data entry can have only one header. So fix the check
conditon for it.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-03-27 00:05:46 +01:00
Linus Torvalds ad6c2c2eb3 Merge branch 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac
Pull EDAC fixes and ghes-edac from Mauro Carvalho Chehab:
 "For:

   - Some fixes at edac drivers (i7core_edac, sb_edac, i3200_edac);
   - error injection support for i5100, when EDAC debug is enabled;
   - fix edac when it is loaded builtin (early init for the subsystem);
   - a "Firmware First" EDAC driver, allowing ghes to report errors via
     EDAC (ghes-edac).

  With regards to ghes-edac, this fixes a longstanding BZ at Red Hat
  that happens with Nehalem and Sandy Bridge CPUs: when both GHES and
  i7core_edac or sb_edac are running, the error reports are
  unpredictable, as both BIOS and OS race to access the registers.  With
  ghes-edac, the EDAC core will refuse to register any other concurrent
  memory error driver.

  This patchset moves the ghes struct definitions to a separate header
  file (include/acpi/ghes.h) and adds 3 hooks at apei/ghes.c to
  register/unregister and to report errors via ghes-edac.  Those changes
  were acked by ghes driver maintainer (Huang)."

* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (30 commits)
  i5100_edac: convert to use simple_open()
  ghes_edac: fix to use list_for_each_entry_safe() when delete list items
  ghes_edac: Fix RAS tracing
  ghes_edac: Make it compliant with UEFI spec 2.3.1
  ghes_edac: Improve driver's printk messages
  ghes_edac: Don't credit the same memory dimm twice
  ghes_edac: do a better job of filling EDAC DIMM info
  ghes_edac: add support for reporting errors via EDAC
  ghes_edac: Register at EDAC core the BIOS report
  ghes: add the needed hooks for EDAC error report
  ghes: move structures/enum to a header file
  edac: add support for error type "Info"
  edac: add support for raw error reports
  edac: reduce stack pressure by using a pre-allocated buffer
  edac: lock module owner to avoid error report conflicts
  edac: remove proc_name from mci structure
  edac: add a new memory layer type
  edac: initialize the core earlier
  edac: better report error conditions in debug mode
  i5100_edac: Remove two checkpatch warnings
  ...
2013-02-28 20:42:33 -08:00