mirror of
https://github.com/armbian/linux-rockchip.git
synced 2026-01-06 11:08:10 -08:00
ce80098db2439ee44403ec6fccd3a10be21c7aff
152 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a9e7b8d4f6 |
kernel/resource: disallow access to exclusive system RAM regions
virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible.
On the one hand, the virtio-mem driver adds/removes whole memory blocks,
creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
hand, it logically (un)plugs memory inside added memory blocks,
dynamically either exposing them to the buddy or hiding them from the
buddy and marking them PG_offline.
In contrast to physical devices, like a DIMM, the virtio-mem driver is
required to actually make use of any of the device-provided memory,
because it performs the handshake with the hypervisor. virtio-mem
memory cannot simply be access via /dev/mem without a driver.
There is no safe way to:
a) Access plugged memory blocks via /dev/mem, as they might contain
unplugged holes or might get silently unplugged by the virtio-mem
driver and consequently turned inaccessible.
b) Access unplugged memory blocks via /dev/mem because the virtio-mem
driver is required to make them actually accessible first.
The virtio-spec states that unplugged memory blocks MUST NOT be written,
and only selected unplugged memory blocks MAY be read. We want to make
sure, this is the case in sane environments -- where the virtio-mem driver
was loaded.
We want to make sure that in a sane environment, nobody "accidentially"
accesses unplugged memory inside the device managed region. For example,
a user might spot a memory region in /proc/iomem and try accessing it via
/dev/mem via gdb or dumping it via something else. By the time the mmap()
happens, the memory might already have been removed by the virtio-mem
driver silently: the mmap() would succeeed and user space might
accidentially access unplugged memory.
So once the driver was loaded and detected the device along the
device-managed region, we just want to disallow any access via /dev/mem to
it.
In an ideal world, we would mark the whole region as busy ("owned by a
driver") and exclude it; however, that would be wrong, as we don't really
have actual system RAM at these ranges added to Linux ("busy system RAM").
Instead, we want to mark such ranges as "not actual busy system RAM but
still soft-reserved and prepared by a driver for future use."
Let's teach iomem_is_exclusive() to reject access to any range with
"IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
if "iomem=relaxed" is set. Introduce EXCLUSIVE_SYSTEM_RAM to make it
easier for applicable drivers to depend on this setting in their Kconfig.
For now, there are no applicable ranges and we'll modify virtio-mem next
to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
creates to contain all actual busy system RAM added via
add_memory_driver_managed().
Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b78dfa059f |
kernel/resource: clean up and optimize iomem_is_exclusive()
Patch series "virtio-mem: disallow mapping virtio-mem memory via /dev/mem", v5.
Let's add the basic infrastructure to exclude some physical memory regions
marked as "IORESOURCE_SYSTEM_RAM" completely from /dev/mem access, even
though they are not marked IORESOURCE_BUSY and even though "iomem=relaxed"
is set. Resource IORESOURCE_EXCLUSIVE for that purpose instead of adding
new flags to express something similar to "soft-busy" or "not busy yet,
but already prepared by a driver and not to be mapped by user space".
Use it for virtio-mem, to disallow mapping any virtio-mem memory via
/dev/mem to user space after the virtio-mem driver was loaded.
This patch (of 3):
We end up traversing subtrees of ranges we are not interested in; let's
optimize this case, skipping such subtrees, cleaning up the function a
bit.
For example, in the following configuration (/proc/iomem):
00000000-00000fff : Reserved
00001000-00057fff : System RAM
00058000-00058fff : Reserved
00059000-0009cfff : System RAM
0009d000-000fffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c3fff : PCI Bus 0000:00
000c4000-000c7fff : PCI Bus 0000:00
000c8000-000cbfff : PCI Bus 0000:00
000cc000-000cffff : PCI Bus 0000:00
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000e3fff : PCI Bus 0000:00
000e4000-000e7fff : PCI Bus 0000:00
000e8000-000ebfff : PCI Bus 0000:00
000ec000-000effff : PCI Bus 0000:00
000f0000-000fffff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-3fffffff : System RAM
40000000-403fffff : Reserved
40000000-403fffff : pnp 00:00
40400000-80a79fff : System RAM
...
We don't have to look at any children of "0009d000-000fffff : Reserved"
if we can just skip these 15 items directly because the parent range is
not of interest.
Link: https://lkml.kernel.org/r/20210920142856.17758-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210920142856.17758-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
eb1f065f90 |
kernel/resource: fix return code check in __request_free_mem_region
Splitting an earlier version of a patch that allowed calling __request_region() while holding the resource lock into a series of patches required changing the return code for the newly introduced __request_region_locked(). Unfortunately this change was not carried through to a subsequent commit |
||
|
|
56fd94919b |
kernel/resource: fix locking in request_free_mem_region
request_free_mem_region() is used to find an empty range of physical
addresses for hotplugging ZONE_DEVICE memory. It does this by iterating
over the range of possible addresses using region_intersects() to see if
the range is free before calling request_mem_region() to allocate the
region.
However the resource_lock is dropped between these two calls meaning by
the time request_mem_region() is called in request_free_mem_region()
another thread may have already reserved the requested region. This
results in unexpected failures and a message in the kernel log from
hitting this condition:
/*
* mm/hmm.c reserves physical addresses which then
* become unavailable to other users. Conflicts are
* not expected. Warn to aid debugging if encountered.
*/
if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
pr_warn("Unaddressable device %s %pR conflicts with %pR",
conflict->name, conflict, res);
These unexpected failures can be corrected by holding resource_lock across
the two calls. This also requires memory allocation to be performed prior
to taking the lock.
Link: https://lkml.kernel.org/r/20210419070109.4780-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Muchun Song <smuchun@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
63cdafe0af |
kernel/resource: refactor __request_region to allow external locking
Refactor the portion of __request_region() done whilst holding the resource_lock into a separate function to allow callers to hold the lock. Link: https://lkml.kernel.org/r/20210419070109.4780-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <smuchun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d486ccb252 |
kernel/resource: allow region_intersects users to hold resource_lock
Introduce a version of region_intersects() that can be called with the resource_lock already held. This will be used in a future fix to __request_free_mem_region(). [akpm@linux-foundation.org: make __region_intersects static] Link: https://lkml.kernel.org/r/20210419070109.4780-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Muchun Song <smuchun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
97523a4edb |
kernel/resource: remove first_lvl / siblings_only logic
All functions that search for IORESOURCE_SYSTEM_RAM or IORESOURCE_MEM resources now properly consider the whole resource tree, not just the first level. Let's drop the unused first_lvl / siblings_only logic. Remove documentation that indicates that some functions behave differently, all consider the full resource tree now. Link: https://lkml.kernel.org/r/20210325115326.7826-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Oscar Salvador <osalvador@suse.de> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3c9c797534 |
kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources
It used to be true that we can have system RAM (IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY) only on the first level in the resource tree. However, this is no longer holds for driver-managed system RAM (i.e., added via dax/kmem and virtio-mem), which gets added on lower levels, for example, inside device containers. IORESOURCE_SYSTEM_RAM is defined as IORESOURCE_MEM | IORESOURCE_SYSRAM and just a special type of IORESOURCE_MEM. The function walk_mem_res() only considers the first level and is used in arch/x86/mm/ioremap.c:__ioremap_check_mem() only. We currently fail to identify System RAM added by dax/kmem and virtio-mem as "IORES_MAP_SYSTEM_RAM", for example, allowing for remapping of such "normal RAM" in __ioremap_caller(). Let's find all IORESOURCE_MEM | IORESOURCE_BUSY resources, making the function behave similar to walk_system_ram_res(). Link: https://lkml.kernel.org/r/20210325115326.7826-3-david@redhat.com Fixes: |
||
|
|
97f61c8f44 |
kernel/resource: make walk_system_ram_res() find all busy IORESOURCE_SYSTEM_RAM resources
Patch series "kernel/resource: make walk_system_ram_res() and walk_mem_res() search the whole tree", v2. Playing with kdump+virtio-mem I noticed that kexec_file_load() does not consider System RAM added via dax/kmem and virtio-mem when preparing the elf header for kdump. Looking into the details, the logic used in walk_system_ram_res() and walk_mem_res() seems to be outdated. walk_system_ram_range() already does the right thing, let's change walk_system_ram_res() and walk_mem_res(), and clean up. Loading a kdump kernel via "kexec -p -s" ... will result in the kdump kernel to also dump dax/kmem and virtio-mem added System RAM now. Note: kexec-tools on x86-64 also have to be updated to consider this memory in the kexec_load() case when processing /proc/iomem. This patch (of 3): It used to be true that we can have system RAM (IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY) only on the first level in the resource tree. However, this is no longer holds for driver-managed system RAM (i.e., added via dax/kmem and virtio-mem), which gets added on lower levels, for example, inside device containers. We have two users of walk_system_ram_res(), which currently only consideres the first level: a) kernel/kexec_file.c:kexec_walk_resources() -- We properly skip IORESOURCE_SYSRAM_DRIVER_MANAGED resources via locate_mem_hole_callback(), so even after this change, we won't be placing kexec images onto dax/kmem and virtio-mem added memory. No change. b) arch/x86/kernel/crash.c:fill_up_crash_elf_data() -- we're currently not adding relevant ranges to the crash elf header, resulting in them not getting dumped via kdump. This change fixes loading a crashkernel via kexec_file_load() and including dax/kmem and virtio-mem added System RAM in the crashdump on x86-64. Note that e.g,, arm64 relies on memblock data and, therefore, always considers all added System RAM already. Let's find all IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY resources, making the function behave like walk_system_ram_range(). Link: https://lkml.kernel.org/r/20210325115326.7826-1-david@redhat.com Link: https://lkml.kernel.org/r/20210325115326.7826-2-david@redhat.com Fixes: |
||
|
|
71a1d8ed90 |
resource: Move devmem revoke code to resource framework
We want all iomem mmaps to consistently revoke ptes when the kernel takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the pci bar mmaps available through procfs and sysfs, which currently do not revoke mappings. To prepare for this, move the code from the /dev/kmem driver to kernel/resource.c. During review Jason spotted that barriers are used somewhat inconsistently. Fix that up while we shuffle this code, since it doesn't have an actual impact at runtime. Otherwise no semantic and behavioural changes intended, just code extraction and adjusting comments and names. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Hildenbrand <david@redhat.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-11-daniel.vetter@ffwll.ch |
||
|
|
3be8da5708 |
kernel/resource.c: fix kernel-doc markups
Kernel-doc markups should use this format:
identifier - description
While here, fix a kernel-doc tag that was using, instead,
a normal comment block.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/c5e38e1070f8dbe2f9607a10b44afe2875bd966c.1605521731.git.mchehab+huawei@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: "Jonathan Corbet" <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
66f4fa32eb |
resource: Simplify region_intersects() by reducing conditionals
Now we have for 'other' and 'type' variables other type return 0 0 REGION_DISJOINT 0 x REGION_INTERSECTS x 0 REGION_DISJOINT x x REGION_MIXED Obviously it's easier to check 'type' for 0 first instead of currently checked 'other'. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
cb8e3c8b4f |
kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()
"mem" in the name already indicates the root, similar to release_mem_region() and devm_request_mem_region(). Make it implicit. The only single caller always passes iomem_resource, other parents are not applicable. Suggested-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9ca6551ee2 |
mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources
Some add_memory*() users add memory in small, contiguous memory blocks. Examples include virtio-mem, hyper-v balloon, and the XEN balloon. This can quickly result in a lot of memory resources, whereby the actual resource boundaries are not of interest (e.g., it might be relevant for DIMMs, exposed via /proc/iomem to user space). We really want to merge added resources in this scenario where possible. Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource either created within add_memory*() or passed via add_memory_resource() shall be marked mergeable and merged with applicable siblings. To implement that, we need a kernel/resource interface to mark selected System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger merging. Note: We really want to merge after the whole operation succeeded, not directly when adding a resource to the resource tree (it would break add_memory_resource() and require splitting resources again when the operation failed - e.g., due to -ENOMEM). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Julien Grall <julien@xen.org> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ec62d04e3f |
kernel/resource: make release_mem_region_adjustable() never fail
Patch series "selective merging of system ram resources", v4. Some add_memory*() users add memory in small, contiguous memory blocks. Examples include virtio-mem, hyper-v balloon, and the XEN balloon. This can quickly result in a lot of memory resources, whereby the actual resource boundaries are not of interest (e.g., it might be relevant for DIMMs, exposed via /proc/iomem to user space). We really want to merge added resources in this scenario where possible. Resources are effectively stored in a list-based tree. Having a lot of resources not only wastes memory, it also makes traversing that tree more expensive, and makes /proc/iomem explode in size (e.g., requiring kexec-tools to manually merge resources when creating a kdump header. The current kexec-tools resource count limit does not allow for more than ~100GB of memory with a memory block size of 128MB on x86-64). Let's allow to selectively merge system ram resources by specifying a new flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only tested with virtio-mem. This patch (of 8): Let's make sure splitting a resource on memory hotunplug will never fail. This will become more relevant once we merge selected System RAM resources - then, we'll trigger that case more often on memory hotunplug. In general, this function is already unlikely to fail. When we remove memory, we free up quite a lot of metadata (memmap, page tables, memory block device, etc.). The only reason it could really fail would be when injecting allocation errors. All other error cases inside release_mem_region_adjustable() seem to be sanity checks if the function would be abused in different context - let's add WARN_ON_ONCE() in these cases so we can catch them. [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable] Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com Link: https://github.com/ClangBuiltLinux/linux/issues/1159 Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monn <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
73fb952d83 |
resource: report parent to walk_iomem_res_desc() callback
In support of detecting whether a resource might have been been claimed, report the parent to the walk_iomem_res_desc() callback. For example, the ACPI HMAT parser publishes "hmem" platform devices per target range. However, if the HMAT is disabled / missing a fallback driver can attach devices to the raw memory ranges as a fallback if it sees unclaimed / orphan "Soft Reserved" resources in the resource tree. Otherwise, find_next_iomem_res() returns a resource with garbage data from the stack allocation in __walk_iomem_res_desc() for the res->parent field. There are currently no users that expect ->child and ->sibling to be valid, and the resource_lock would be needed to traverse them. Use a compound literal to implicitly zero initialize the fields that are not being returned in addition to setting ->parent. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3234ac664a |
/dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of a given address range. Commit |
||
|
|
00ff9a91bd |
mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()
Patch series "mm/memory_hotplug: online_pages() cleanups", v2. Some cleanups (+ one fix for a special case) in the context of online_pages(). This patch (of 5): This makes it clearer that we will never call func() with duplicate PFNs in case we have multiple sub-page memory resources. All unaligned parts of PFNs are completely discarded. Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Arun KS <arunks@codeaurora.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0c38519039 |
resource: add a not device managed request_free_mem_region variant
Factor out the guts of devm_request_free_mem_region so that we can implement both a device managed and a manually release version as tiny wrappers around it. Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
||
|
|
756398750e |
resource: avoid unnecessary lookups in find_next_iomem_res()
find_next_iomem_res() shows up to be a source for overhead in dax benchmarks. Improve performance by not considering children of the tree if the top level does not match. Since the range of the parents should include the range of the children such check is redundant. Running sysbench on dax (pmem emulation, with write_cache disabled): sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run Provides the following results: events (avg/stddev) ------------------- 5.2-rc3: 1247669.0000/16075.39 w/patch: 1286320.5000/16402.72 (+3%) Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Borislav Petkov <bp@suse.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
49f17c26c1 |
resource: fix locking in find_next_iomem_res()
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.
Keep holding the lock while the data is copied. While at it, change the
return value to a more informative value. It is disregarded by the
callers.
[akpm@linux-foundation.org: fix find_next_iomem_res() documentation]
Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com
Fixes:
|
||
|
|
0092908d16 |
mm: factor out a devm_request_free_mem_region helper
Keep the physical address allocation that hmm_add_device does with the rest of the resource code, and allow future reuse of it without the hmm wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
||
|
|
457c899653 |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
f6c6010a07 |
mm/resource: Use resource_overlaps() to simplify region_intersects()
The three checks in region_intersects() are basically an open-coded version of resource_overlaps() - so use the real thing. Also fix typos in comments while at it. Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: bhelgaas@google.com Cc: bp@suse.de Cc: dan.j.williams@intel.com Cc: jack@suse.cz Cc: rdunlap@infradead.org Cc: tiwai@suse.de Link: http://lkml.kernel.org/r/20190305083432.23675-1-richardw.yang@linux.intel.com [ Rewrote the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
f67e3fb489 |
Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
"New device-dax infrastructure to allow persistent memory and other
"reserved" / performance differentiated memories, to be assigned to
the core-mm as "System RAM".
Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.
One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.
Summary:
- Replace the /sys/class/dax device model with /sys/bus/dax, and
include a compat driver so distributions can opt-in to the new ABI.
- Allow for an alternative driver for the device-dax address-range
- Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
- Arrange for the device-dax target-node to be onlined so that the
newly added memory range can be uniquely referenced by numa apis"
NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.
And apparently the PMEM modules can cause that a lot more than regular
RAM. The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.
Quoting Dan from another email:
"The exposure can be reduced in the volatile-RAM case by scanning for
and clearing errors before it is onlined as RAM. The userspace tooling
for that can be in place before v5.1-final. There's also runtime
notifications of errors via acpi_nfit_uc_error_notify() from
background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.
I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination"
* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: "Hotplug" persistent memory for use like normal RAM
mm/resource: Let walk_system_ram_range() search child resources
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Move HMM pr_debug() deeper into resource code
mm/resource: Return real error codes from walk failures
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
device-dax: Add a 'target_node' attribute
device-dax: Auto-bind device after successful new_id
acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
device-dax: Add /sys/class/dax backwards compatibility
device-dax: Add support for a dax override driver
device-dax: Move resource pinning+mapping into the common driver
device-dax: Introduce bus + driver model
device-dax: Start defining a dax bus model
device-dax: Remove multi-resource infrastructure
device-dax: Kill dax_region base
device-dax: Kill dax_region ida
|