HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory. Reasons for HMM and
migration to device memory is explained with HMM core patch.
This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support
different types of memory.
A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type. There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type. All specific code
path are protect with test against the memory type.
Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).
The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.
The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.
If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.
[arnd@arndb.de: fix warning]
Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Resource flags are exposed to userspace via the sysfs "resource" file.
lspci reads the sysfs file to determine resource properties.
Add a "BAR Equivalent Indicator" flag so lspci can distinguish between
[virtual] and [enhanced] resources.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sean O. Stalley <sean.stalley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Pull libnvdimm updates from Dan Williams:
- Asynchronous address range scrub:
Given the capacities of next generation persistent memory devices a
scrub operation to find all poison may take 10s of seconds. We
want this scrub work to be done asynchronously with the rest of
system initialization, so we move it out of line from the NFIT
probing, i.e. acpi_nfit_add().
- Clear poison:
ACPI 6.1 introduces the ability to send "clear error" commands to
the ACPI0012:00 device representing the root of an "nvdimm bus".
Similar to relocating a bad block on a disk, this support clears
media errors in response to a write.
- Persistent memory resource tracking:
A persistent memory range may be designated as simply "reserved" by
platform firmware in the efi/e820 memory map. Later when the NFIT
driver loads it discovers that the range is "Persistent Memory".
The NFIT bus driver inserts a resource to advertise that
"persistent" attribute in the system resource tree for /proc/iomem
and kernel-internal usages.
- Miscellaneous cleanups and fixes:
Workaround section misaligned pmem ranges when allocating a struct
page memmap, fix handling of the read-only case in the ioctl path,
and clean up block device major number allocation.
* tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, pmem: clear poison on write
libnvdimm, pmem: fix kmap_atomic() leak in error path
nvdimm/btt: don't allocate unused major device number
nvdimm/blk: don't allocate unused major device number
pmem: don't allocate unused major device number
ACPI: Change NFIT driver to insert new resource
resource: Export insert_resource and remove_resource
resource: Add remove_resource interface
resource: Change __request_region to inherit from immediate parent
libnvdimm, pmem: fix ia64 build, use PHYS_PFN
nfit, libnvdimm: clear poison command support
libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
libnvdimm, pmem: adjust for section collisions with 'System RAM'
libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
libnvdimm: Fix security issue with DSM IOCTL.
libnvdimm: Clean-up access mode check.
tools/testing/nvdimm: expand ars unit testing
nfit: disable userspace initiated ars during scrub
nfit: scrub and register regions in a workqueue
nfit, libnvdimm: async region scrub workqueue
...
The IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY bits are unused.
Remove them and code that depends on them.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
insert_resource() and insert_resource_conflict() are called
by resource producers to insert a new resource. When there
is any conflict, they move conflicting resources down to the
children of the new resource. There is no destructor of these
interfaces, however.
Add remove_resource(), which removes a resource previously
inserted by insert_resource() or insert_resource_conflict(),
and moves the children up to where they were before.
__release_resource() is changed to have @release_child, so
that this function can be used for remove_resource() as well.
Also add comments to clarify that these functions are intended
for producers of resources to avoid any confusion with
request/release_resource() for consumers.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
IORESOURCE_ROM_SHADOW means there is a copy of a device's option ROM in
RAM. The existence of such a copy and its location are arch-specific.
Previously the IORESOURCE_ROM_SHADOW flag was set in arch code, but the
0xC0000-0xDFFFF location was hard-coded into the PCI core.
If we're using a shadow copy in RAM, disable the ROM BAR and release the
address space it was consuming. Move the location information from the PCI
core to the arch code that sets IORESOURCE_ROM_SHADOW. Save the location
of the RAM copy in the struct resource for PCI_ROM_RESOURCE.
After this change, pci_map_rom() will call pci_assign_resource() and
pci_enable_rom() for these IORESOURCE_ROM_SHADOW resources, which we did
not do before. This is safe because:
- pci_assign_resource() will do nothing because the resource is marked
IORESOURCE_PCI_FIXED, which means we can't move it, and
- pci_enable_rom() will not turn on the ROM BAR's enable bit because the
resource is marked IORESOURCE_ROM_SHADOW, which means it is in RAM
rather than in PCI memory space.
Storing the location in the struct resource means "lspci" will show the
shadow location, not the value from the ROM BAR.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
walk_iomem_res() and region_intersects() still need to use
strcmp() for searching a resource entry by @name in the iomem
table.
This patch introduces I/O resource descriptor 'desc' in struct
resource for the iomem search interfaces. Drivers can assign
their unique descriptor to a range when they support the search
interfaces.
Otherwise, 'desc' is set to IORES_DESC_NONE (0). This avoids
changing most of the drivers as they typically allocate resource
entries statically, or by calling alloc_resource(), kzalloc(),
or alloc_bootmem_low(), which set the field to zero by default.
A later patch will address some drivers that use kmalloc()
without zero'ing the field.
Also change release_mem_region_adjustable() to set 'desc' when
its resource entry gets separated. Other resource interfaces are
also changed to initialize 'desc' explicitly although
alloc_resource() sets it to 0.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All users of __check_region(), check_region(), and check_mem_region() are
gone. We got rid of the last user in v4.0-rc1. Remove them.
bloat-o-meter on x86_64 shows:
add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function old new delta
__kstrtab___check_region 15 - -15
__ksymtab___check_region 16 - -16
__check_region 71 - -71
Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide device-managed implementations of the request_resource() and
release_resource() functions. Upon failure to request a resource, the new
devm_request_resource() function will output an error message for
consistent error reporting.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
I have added two more functions to walk through resources.
Currently walk_system_ram_range() deals with pfn and /proc/iomem can
contain partial pages. By dealing in pfn, callback function loses the
info that last page of a memory range is a partial page and not the full
page. So I implemented walk_system_ram_res() which returns u64 values to
callback functions and now it properly return start and end address.
walk_system_ram_range() uses find_next_system_ram() to find the next ram
resource. This in turn only travels through siblings of top level child
and does not travers through all the nodes of the resoruce tree. I also
need another function where I can walk through all the resources, for
example figure out where "GART" aperture is. Figure out where ACPI memory
is.
So I wrote another function walk_iomem_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller can
specify "name" of resource, start and end and flags.
Got rid of find_next_system_ram_res() and instead implemented more generic
find_next_iomem_res() which can be used to traverse top level children
only based on an argument.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes we have a struct resource where we know the type (MEM/IO/etc.)
and the size, but we haven't assigned address space for it. The
IORESOURCE_UNSET flag is a way to indicate this situation. For these
"unset" resources, the start address is meaningless, so print only the
size, e.g.,
- pci 0000:0c:00.0: reg 184: [mem 0x00000000-0x00001fff 64bit]
+ pci 0000:0c:00.0: reg 184: [mem size 0x2000 64bit]
For %pr (printing with raw flags), we still print the address range,
because %pr is mostly used for debugging anyway.
Thanks to Fengguang Wu <fengguang.wu@intel.com> for suggesting
resource_size().
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
We have two identical copies of resource_contains() already, and more
places that could use it. This moves it to ioport.h where it can be
shared.
resource_contains(struct resource *r1, struct resource *r2) returns true
iff r1 and r2 are the same type (most callers already checked this
separately) and the r1 address range completely contains r2.
In addition, the new resource_contains() checks that both r1 and r2 have
addresses assigned to them. If a resource is IORESOURCE_UNSET, it doesn't
have a valid address and can't contain or be contained by another resource.
Some callers already check this or for res->start.
No functional change.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource. This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.
This new interface is intended for memory hot-delete. During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820. Each memory resource entry usually covers
the whole contigous memory range. Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory. Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.
This new interface is restrictive (i.e. release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource(). Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.
There is no change to the existing interfaces since their restriction is
valid for I/O resources.
[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a bunch of I2C/SPI MFD drivers are using IORESOURCE_IO for
register address ranges. Since this causes some confusion due to the
primary use of this resource type for PCI/ISA I/O ports create a new
resource type IORESOURCE_REG.
Unfortunately the current resource types are specified as bitmasks and
there are no free bitmasks even though they really shouldn't be used as
such so we define the new type as IORESOURCE_IO | IORESOURCE_MEM.
Benjamin Herrenschmidt and Russell King have both verified that none of
the users in this series will have a problem with this, and no new code
should be affected.
This patch was written by Russell King but he found himself unable to
take the patch further.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Tested-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Add resource_overlaps(), which returns true if two resources overlap at all.
Use this to replace the complicated check in coalesce_windows().
Signed-Off-By: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/math-emu: Remove unnecessary code
m68k/math-emu: Remove commented out old code
m68k: Kill warning in setup_arch() when compiling for Sun3
m68k/atari: Prefix GPIO_{IN,OUT} with CODEC_
sparc: iounmap() and *_free_coherent() - Use lookup_resource()
m68k/atari: Reserve some ST-RAM early on for device buffer use
m68k/amiga: Chip RAM - Use lookup_resource()
resources: Add lookup_resource()
sparc: _sparc_find_resource() should check for exact matches
m68k/amiga: Chip RAM - Offset resource end by CHIP_PHYSADDR
m68k/amiga: Chip RAM - Use resource_size() to fix off-by-one error
m68k/amiga: Chip RAM - Change chipavail to an atomic_t
m68k/amiga: Chip RAM - Always allocate from the start of memory
m68k/amiga: Chip RAM - Convert from printk() to pr_*()
m68k/amiga: Chip RAM - Use tabs for indentation
Add a function to find an existing resource by a resource start address.
This allows to implement simple allocators (with a malloc/free-alike API)
on top of the resource system.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>