Commit Graph

47 Commits

Author SHA1 Message Date
Robin Murphy
5deb67f77a libnvdimm, nd_blk: remove mmio_flush_range()
mmio_flush_range() suffers from a lack of clearly-defined semantics,
and is somewhat ambiguous to port to other architectures where the
scope of the writeback implied by "flush" and ordering might matter,
but MMIO would tend to imply non-cacheable anyway. Per the rationale
in 67a3e8fe90 ("nd_blk: change aperture mapping from WC to WB"), the
only existing use is actually to invalidate clean cache lines for
ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
cleanup of the pmem API, that also now happens to be the exact purpose
of arch_invalidate_pmem(), which would be a far more well-defined tool
for the job.

Rather than risk potentially inconsistent implementations of
mmio_flush_range() for the sake of one callsite, streamline things by
removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
definitions up to the libnvdimm level, so they can be shared by NFIT
as well. This allows NFIT to be enabled for arm64.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 15:05:10 -07:00
Dan Williams
401c0a19c6 nfit, libnvdimm, region: export 'position' in mapping info
It is useful to be able to know the position of a DIMM in an
interleave-set. Consider the case where the order of the DIMMs changes
causing a namespace to be invalidated because the interleave-set cookie no
longer matches. If the before and after state of each DIMM position is
known this state debugged by the system owner.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-04 17:20:16 -07:00
Dan Williams
9d92573fff Merge branch 'for-4.13/dax' into libnvdimm-for-next 2017-07-03 16:54:58 -07:00
Jerry Hoemann
7db5bb33ad libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
Add a bus level dsm_mask to nvdimm_bus_descriptor to allow the passthru
calling mechanism to specify a different mask from the cmd_mask.

Populate bus_dsm_mask and use it to filter dsm calls that user can
make through the pass thru interface.

Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com>
[djbw: use command number constants instead of a magic mask value]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-07-01 08:49:59 -07:00
Dan Williams
0b277961f4 libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set dax_write_cache() if it is.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29 09:29:50 -07:00
Dan Williams
ca6a4657e5 x86, libnvdimm, pmem: remove global pmem api
Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h and
asm/pmem.h.

Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27 16:29:54 -07:00
Dan Williams
faec6f8a1c libnvdimm, label: populate the type_guid property for v1.2 namespaces
The type_guid refers to the "Address Range Type GUID" for the region
backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface
Table). This 'type' identifier specifies an access mechanism for the
given namespace. This capability replaces the confusing usage of the
'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:40 -07:00
Dan Williams
c12c48ce86 libnvdimm, label: add v1.2 interleave-set-cookie algorithm
The interleave-set-cookie algorithm is extended to incorporate all the
same components that are used to generate an nvdimm unique-id. For
backwards compatibility we still maintain the old v1.1 definition.

Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com>
Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:31:39 -07:00
Dan Williams
8f078b38dd libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-04 14:01:24 -07:00
Dan Williams
23f4984483 libnvdimm: rework region badblocks clearing
Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.

An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting on any data that
is subject to change under that lock. The badblocks instance has its own
internal lock to handle mutations of the error list.

So, in order to make it clear that we are just acting on region devices,
rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
Eliminate the lock and consolidate all support routines for the new
nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
opportunity to cleanup to some unnecessary casts, make the calling
convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
resource with the minimal struct clear_badblocks_context, and use the
DEVICE_ATTR macro.

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-29 15:24:03 -07:00
Dave Jiang
b3b454f694 libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation
The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

BUG: sleeping function called from invalid context at kernel/locking/mutex.
c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0

A spinlock is introduced to protect the poison list. This allows us to not
having to acquire the reconfig_mutex for touching the poison list. The
add_poison() function has been broken out into two helper functions. One to
allocate the poison entry and the other to apppend the entry. This allows us
to unlock the poison_lock in non-I/O path and continue to be able to allocate
the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
order to satisfy being in atomic context.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-13 14:23:51 -07:00
Dave Jiang
006358b35c libnvdimm: add support for clear poison list and badblocks for device dax
Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12 21:56:43 -07:00
Dan Williams
86ef58a4e3 nfit, libnvdimm: fix interleave set cookie calculation
The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created.  The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.

Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:

    1/ platforms with namespace-label capable NVDIMMs are not widely
       available.

    2/ interleave-sets with a single-dimm are by definition not affected
       (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.

The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.

Cc: <stable@vger.kernel.org>
Fixes: eaf961536e ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-03-01 00:49:42 -08:00
Dan Williams
efda1b5d87 acpi, nfit, libnvdimm: fix / harden ars_status output length handling
Given ambiguities in the ACPI 6.1 definition of the "Output (Size)"
field of the ARS (Address Range Scrub) Status command, a firmware
implementation may in practice return 0, 4, or 8 to indicate that there
is no output payload to process.

The specification states "Size of Output Buffer in bytes, including this
field.". However, 'Output Buffer' is also the name of the entire
payload, and earlier in the specification it states "Max Query ARS
Status Output Buffer Size: Maximum size of buffer (including the Status
and Extended Status fields)".

Without this fix if the BIOS happens to return 0 it causes memory
corruption as evidenced by this result from the acpi_nfit_ctl() unit
test.

 ars_status00000000: 00020000 00000000                    ........
 BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff)
 kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC
 task: ffff8803332d2ec0 task.stack: ffffc9000174c000
 RIP: 0010:[<ffffffff814cfe72>]  [<ffffffff814cfe72>] __memcpy+0x12/0x20
 RSP: 0018:ffffc9000174f9a8  EFLAGS: 00010246
 RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56
 RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000
 RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8
 R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0
 FS:  00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0
 Stack:
  ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac
  0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000
  0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246
 Call Trace:
  [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit]
  [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test]

Cc: <stable@vger.kernel.org>
Fixes: 747ffe11b4 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-06 16:08:10 -08:00
Dan Williams
44c462eb9e libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
Before we add more libnvdimm-private fields to nd_mapping make it clear
which parameters are input vs libnvdimm internals. Use struct
nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make
struct nd_mapping private to libnvdimm.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 19:13:42 -07:00
Vishal Verma
e046114af5 libnvdimm: clear the internal poison_list when clearing badblocks
nvdimm_clear_poison cleared the user-visible badblocks, and sent
commands to the NVDIMM to clear the areas marked as 'poison', but it
neglected to clear the same areas from the internal poison_list which is
used to marshal ARS results before sorting them by namespace. As a
result, once on-demand ARS functionality was added:

37b137f nfit, libnvdimm: allow an ARS scrub to be triggered on demand

A scrub triggered from either sysfs or an MCE was found to be adding
stale entries that had been cleared from gendisk->badblocks, but were
still present in nvdimm_bus->poison_list. Additionally, the stale entries
could be triggered into producing stale disk->badblocks by simply disabling
and re-enabling the namespace or region.

This adds the missing step of clearing poison_list entries when clearing
poison, so that it is always in sync with badblocks.

Fixes: 37b137f ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 17:03:45 -07:00
Dan Williams
ba9c8dd3c2 acpi, nfit: add dimm device notification support
Per "ACPI 6.1 Section 9.20.3" NVDIMM devices, children of the ACPI0012
NVDIMM Root device, can receive health event notifications.

Given that these devices are precluded from registering a notification
handler via acpi_driver.acpi_device_ops (due to no _HID), we use
acpi_install_notify_handler() directly.  The registered handler,
acpi_nvdimm_notify(), triggers a poll(2) event on the nmemX/nfit/flags
sysfs attribute when a health event notification is received.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-08-29 14:55:17 -07:00
Vishal Verma
37b137ff8c nfit, libnvdimm: allow an ARS scrub to be triggered on demand
Normally, an ARS (Address Range Scrub) only happens at
boot/initialization time. There can however arise situations where a
bus-wide rescan is needed - notably, in the case of discovering a latent
media error, we should do a full rescan to figure out what other sectors
are bad, and thus potentially avoid triggering an mce on them in the
future. Also provide a sysfs trigger to start a bus-wide scrub.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 21:51:42 -07:00
Dan Williams
bc9775d869 libnvdimm: move ->module to struct nvdimm_bus_descriptor
Let the provider module be explicitly passed in rather than implicitly
assumed by the module that calls nvdimm_bus_register().  This is in
preparation for unifying the nfit and nfit_test driver teardown paths.

Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-21 20:03:19 -07:00
Dan Williams
f284a4f237 libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()
nvdimm_flush() is a replacement for the x86 'pcommit' instruction.  It is
an optional write flushing mechanism that an nvdimm bus can provide for
the pmem driver to consume.  In the case of the NFIT nvdimm-bus-provider
nvdimm_flush() is implemented as a series of flush-hint-address [1]
writes to each dimm in the interleave set (region) that backs the
namespace.

The nvdimm_has_flush() routine relies on platform firmware to describe
the flushing capabilities of a platform.  It uses the heuristic of
whether an nvdimm bus provider provides flush address data to return a
ternary result:

      1: flush addresses defined
      0: dimm topology described without flush addresses (assume ADR)
 -errno: no topology information, unable to determine flush mechanism

The pmem driver is expected to take the following actions on this ternary
result:

      1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown
      0: do not set, WC or FUA on the queue, take no further action
 -errno: warn and then operate as if nvdimm_has_flush() returned '0'

The caveat of this heuristic is that it can not distinguish the "dimm
does not have flush address" case from the "platform firmware is broken
and failed to describe a flush address".  Given we are already
explicitly trusting the NFIT there's not much more we can do beyond
blacklisting broken firmwares if they are ever encountered.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 16:13:42 -07:00
Dan Williams
e5ae3b252c libnvdimm, nfit: move flush hint mapping to region-device driver-data
In preparation for triggering flushes of a DIMM's writes-posted-queue
(WPQ) via the pmem driver move mapping of flush hint addresses to the
region driver.  Since this uses devm_nvdimm_memremap() the flush
addresses will remain mapped while any region to which the dimm belongs
is active.

We need to communicate more information to the nvdimm core to facilitate
this mapping, namely each dimm object now carries an array of flush hint
address resources.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 15:09:26 -07:00
Dan Williams
a8a6d2e04c libnvdimm, nfit: remove nfit_spa_map() infrastructure
Now that all shared mappings are handled by devm_nvdimm_memremap() we no
longer need nfit_spa_map() nor do we need to trigger a callback to the
bus provider at region disable time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 15:09:26 -07:00
Dan Williams
29b9aa0aa3 libnvdimm: introduce devm_nvdimm_memremap(), convert nfit_spa_map() users
In preparation for generically mapping flush hint addresses for both the
BLK and PMEM use case, provide a generic / reference counted mapping
api.  Given the fact that a dimm may belong to multiple regions (PMEM
and BLK), the flush hint addresses need to be held valid as long as any
region associated with the dimm is active.  This is similar to the
existing BLK-region case where multiple BLK-regions may share an
aperture mapping.  Up-level this shared / reference-counted mapping
capability from the nfit driver to a core nvdimm capability.

This eliminates the need for the nd_blk_region.disable() callback.  Note
that the removal of nfit_spa_map() and related infrastructure is
deferred to a later patch.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-07 17:11:09 -07:00
Dan Williams
e3654eca70 nfit, libnvdimm: clarify "commands" vs "_DSMs"
Clarify the distinction between "commands", the ioctls userspace calls
to request the kernel take some action on a given dimm device, and
"_DSMs", the actual function numbers used in the firmware interface to
the DIMM.  _DSMs are ACPI specific whereas commands are Linux kernel
generic.

This is in preparation for breaking the 1:1 implicit relationship
between the kernel ioctl number space and the firmware specific function
numbers.

Cc: Jerry Hoemann <jerry.hoemann@hpe.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-28 16:23:16 -07:00
Jerry Hoemann
40abf9be8f libnvdimm: increase max envelope size for ioctl
nd_ioctl() must first read in the fixed sized portion of an ioctl so
that it can then determine the size of the variable part.

Prepare for ND_CMD_CALL calls which have larger fixed portion
envelope.

Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-28 16:23:16 -07:00