mirror of
https://github.com/armbian/linux-cix.git
synced 2026-01-06 12:30:45 -08:00
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
This commit is contained in:
25
Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
Normal file
25
Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
Normal file
@@ -0,0 +1,25 @@
|
||||
What: /sys/devices/virtual/memory_tiering/
|
||||
Date: August 2022
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: A collection of all the memory tiers allocated.
|
||||
|
||||
Individual memory tier details are contained in subdirectories
|
||||
named by the abstract distance of the memory tier.
|
||||
|
||||
/sys/devices/virtual/memory_tiering/memory_tierN/
|
||||
|
||||
|
||||
What: /sys/devices/virtual/memory_tiering/memory_tierN/
|
||||
/sys/devices/virtual/memory_tiering/memory_tierN/nodes
|
||||
Date: August 2022
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Directory with details of a specific memory tier
|
||||
|
||||
This is the directory containing information about a particular
|
||||
memory tier, memtierN, where N is derived based on abstract distance.
|
||||
|
||||
A smaller value of N implies a higher (faster) memory tier in the
|
||||
hierarchy.
|
||||
|
||||
nodes: NUMA nodes that are part of this memory tier.
|
||||
|
||||
@@ -13,7 +13,7 @@ a) waiting for a CPU (while being runnable)
|
||||
b) completion of synchronous block I/O initiated by the task
|
||||
c) swapping in pages
|
||||
d) memory reclaim
|
||||
e) thrashing page cache
|
||||
e) thrashing
|
||||
f) direct compact
|
||||
g) write-protect copy
|
||||
|
||||
|
||||
@@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
|
||||
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
|
||||
isolating a page from its LRU under lruvec->lru_lock.
|
||||
|
||||
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
|
||||
2.7 Kernel Memory Extension
|
||||
-----------------------------------------------
|
||||
|
||||
With the Kernel memory extension, the Memory Controller is able to limit
|
||||
@@ -386,8 +386,6 @@ U != 0, K >= U:
|
||||
|
||||
a. Enable CONFIG_CGROUPS
|
||||
b. Enable CONFIG_MEMCG
|
||||
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
|
||||
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
|
||||
|
||||
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
|
||||
-------------------------------------------------------------------
|
||||
|
||||
@@ -1469,6 +1469,14 @@
|
||||
Permit 'security.evm' to be updated regardless of
|
||||
current integrity status.
|
||||
|
||||
early_page_ext [KNL] Enforces page_ext initialization to earlier
|
||||
stages so cover more early boot allocations.
|
||||
Please note that as side effect some optimizations
|
||||
might be disabled to achieve that (e.g. parallelized
|
||||
memory initialization is disabled) so the boot process
|
||||
might take longer, especially on systems with a lot of
|
||||
memory. Available with CONFIG_PAGE_EXTENSION=y.
|
||||
|
||||
failslab=
|
||||
fail_usercopy=
|
||||
fail_page_alloc=
|
||||
@@ -6041,12 +6049,6 @@
|
||||
This parameter controls use of the Protected
|
||||
Execution Facility on pSeries.
|
||||
|
||||
swapaccount= [KNL]
|
||||
Format: [0|1]
|
||||
Enable accounting of swap in memory resource
|
||||
controller if no parameter or 1 is given or disable
|
||||
it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
|
||||
|
||||
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
|
||||
Format: { <int> [,<int>] | force | noforce }
|
||||
<int> -- Number of I/O TLB slabs
|
||||
|
||||
@@ -5,10 +5,10 @@ CMA Debugfs Interface
|
||||
The CMA debugfs interface is useful to retrieve basic information out of the
|
||||
different CMA areas and to test allocation/release in each of the areas.
|
||||
|
||||
Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
|
||||
kernel's CMA index. So the first CMA zone would be:
|
||||
Each CMA area represents a directory under <debugfs>/cma/, represented by
|
||||
its CMA name like below:
|
||||
|
||||
<debugfs>/cma/cma-0
|
||||
<debugfs>/cma/<cma_name>
|
||||
|
||||
The structure of the files created under that directory is as follows:
|
||||
|
||||
@@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
|
||||
- [RO] bitmap: The bitmap of page states in the zone.
|
||||
- [WO] alloc: Allocate N pages from that CMA area. For example::
|
||||
|
||||
echo 5 > <debugfs>/cma/cma-2/alloc
|
||||
echo 5 > <debugfs>/cma/<cma_name>/alloc
|
||||
|
||||
would try to allocate 5 pages from the cma-2 area.
|
||||
would try to allocate 5 pages from the 'cma_name' area.
|
||||
|
||||
- [WO] free: Free N pages from that CMA area, similar to the above.
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
========================
|
||||
Monitoring Data Accesses
|
||||
========================
|
||||
==========================
|
||||
DAMON: Data Access MONitor
|
||||
==========================
|
||||
|
||||
:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
|
||||
Using DAMON, users can analyze the memory access patterns of their systems and
|
||||
|
||||
@@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at
|
||||
https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
|
||||
your ``$PATH``. It's not mandatory, though.
|
||||
|
||||
Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
|
||||
detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
|
||||
below::
|
||||
|
||||
# mount -t debugfs none /sys/kernel/debug/
|
||||
|
||||
or append the following line to your ``/etc/fstab`` file so that your system
|
||||
can automatically mount debugfs upon booting::
|
||||
|
||||
debugfs /sys/kernel/debug debugfs defaults 0 0
|
||||
Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
|
||||
detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
|
||||
mounted.
|
||||
|
||||
|
||||
Recording Data Access Patterns
|
||||
|
||||
@@ -393,6 +393,11 @@ the files as above. Above is only for an example.
|
||||
debugfs Interface
|
||||
=================
|
||||
|
||||
.. note::
|
||||
|
||||
DAMON debugfs interface will be removed after next LTS kernel is released, so
|
||||
users should move to the :ref:`sysfs interface <sysfs_interface>`.
|
||||
|
||||
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
|
||||
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
|
||||
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
|
||||
|
||||
@@ -32,6 +32,7 @@ the Linux memory management.
|
||||
idle_page_tracking
|
||||
ksm
|
||||
memory-hotplug
|
||||
multigen_lru
|
||||
nommu-mmap
|
||||
numa_memory_policy
|
||||
numaperf
|
||||
|
||||
@@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||
be increased accordingly.
|
||||
|
||||
Monitoring KSM profit
|
||||
=====================
|
||||
|
||||
KSM can save memory by merging identical pages, but also can consume
|
||||
additional memory, because it needs to generate a number of rmap_items to
|
||||
save each scanned page's brief rmap information. Some of these pages may
|
||||
be merged, but some may not be abled to be merged after being checked
|
||||
several times, which are unprofitable memory consumed.
|
||||
|
||||
1) How to determine whether KSM save memory or consume memory in system-wide
|
||||
range? Here is a simple approximate calculation for reference::
|
||||
|
||||
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
|
||||
sizeof(rmap_item);
|
||||
|
||||
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
|
||||
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
|
||||
|
||||
2) The KSM profit inner a single process can be similarly obtained by the
|
||||
following approximate calculation::
|
||||
|
||||
process_profit =~ ksm_merging_pages * sizeof(page) -
|
||||
ksm_rmap_items * sizeof(rmap_item).
|
||||
|
||||
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
|
||||
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
|
||||
|
||||
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
|
||||
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
|
||||
administrators have to rethink how to change madvise policy. Giving an example
|
||||
for reference, a page's size is usually 4K, and the rmap_item's size is
|
||||
separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
|
||||
so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
|
||||
or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
|
||||
because the ksm profit is approximately zero or negative.
|
||||
|
||||
Monitoring KSM events
|
||||
=====================
|
||||
|
||||
|
||||
162
Documentation/admin-guide/mm/multigen_lru.rst
Normal file
162
Documentation/admin-guide/mm/multigen_lru.rst
Normal file
@@ -0,0 +1,162 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=============
|
||||
Multi-Gen LRU
|
||||
=============
|
||||
The multi-gen LRU is an alternative LRU implementation that optimizes
|
||||
page reclaim and improves performance under memory pressure. Page
|
||||
reclaim decides the kernel's caching policy and ability to overcommit
|
||||
memory. It directly impacts the kswapd CPU usage and RAM efficiency.
|
||||
|
||||
Quick start
|
||||
===========
|
||||
Build the kernel with the following configurations.
|
||||
|
||||
* ``CONFIG_LRU_GEN=y``
|
||||
* ``CONFIG_LRU_GEN_ENABLED=y``
|
||||
|
||||
All set!
|
||||
|
||||
Runtime options
|
||||
===============
|
||||
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
|
||||
following subsections.
|
||||
|
||||
Kill switch
|
||||
-----------
|
||||
``enabled`` accepts different values to enable or disable the
|
||||
following components. Its default value depends on
|
||||
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
|
||||
unless some of them have unforeseen side effects. Writing to
|
||||
``enabled`` has no effect when a component is not supported by the
|
||||
hardware, and valid values will be accepted even when the main switch
|
||||
is off.
|
||||
|
||||
====== ===============================================================
|
||||
Values Components
|
||||
====== ===============================================================
|
||||
0x0001 The main switch for the multi-gen LRU.
|
||||
0x0002 Clearing the accessed bit in leaf page table entries in large
|
||||
batches, when MMU sets it (e.g., on x86). This behavior can
|
||||
theoretically worsen lock contention (mmap_lock). If it is
|
||||
disabled, the multi-gen LRU will suffer a minor performance
|
||||
degradation for workloads that contiguously map hot pages,
|
||||
whose accessed bits can be otherwise cleared by fewer larger
|
||||
batches.
|
||||
0x0004 Clearing the accessed bit in non-leaf page table entries as
|
||||
well, when MMU sets it (e.g., on x86). This behavior was not
|
||||
verified on x86 varieties other than Intel and AMD. If it is
|
||||
disabled, the multi-gen LRU will suffer a negligible
|
||||
performance degradation.
|
||||
[yYnN] Apply to all the components above.
|
||||
====== ===============================================================
|
||||
|
||||
E.g.,
|
||||
::
|
||||
|
||||
echo y >/sys/kernel/mm/lru_gen/enabled
|
||||
cat /sys/kernel/mm/lru_gen/enabled
|
||||
0x0007
|
||||
echo 5 >/sys/kernel/mm/lru_gen/enabled
|
||||
cat /sys/kernel/mm/lru_gen/enabled
|
||||
0x0005
|
||||
|
||||
Thrashing prevention
|
||||
--------------------
|
||||
Personal computers are more sensitive to thrashing because it can
|
||||
cause janks (lags when rendering UI) and negatively impact user
|
||||
experience. The multi-gen LRU offers thrashing prevention to the
|
||||
majority of laptop and desktop users who do not have ``oomd``.
|
||||
|
||||
Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
|
||||
``N`` milliseconds from getting evicted. The OOM killer is triggered
|
||||
if this working set cannot be kept in memory. In other words, this
|
||||
option works as an adjustable pressure relief valve, and when open, it
|
||||
terminates applications that are hopefully not being used.
|
||||
|
||||
Based on the average human detectable lag (~100ms), ``N=1000`` usually
|
||||
eliminates intolerable janks due to thrashing. Larger values like
|
||||
``N=3000`` make janks less noticeable at the risk of premature OOM
|
||||
kills.
|
||||
|
||||
The default value ``0`` means disabled.
|
||||
|
||||
Experimental features
|
||||
=====================
|
||||
``/sys/kernel/debug/lru_gen`` accepts commands described in the
|
||||
following subsections. Multiple command lines are supported, so does
|
||||
concatenation with delimiters ``,`` and ``;``.
|
||||
|
||||
``/sys/kernel/debug/lru_gen_full`` provides additional stats for
|
||||
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
|
||||
evicted generations in this file.
|
||||
|
||||
Working set estimation
|
||||
----------------------
|
||||
Working set estimation measures how much memory an application needs
|
||||
in a given time interval, and it is usually done with little impact on
|
||||
the performance of the application. E.g., data centers want to
|
||||
optimize job scheduling (bin packing) to improve memory utilizations.
|
||||
When a new job comes in, the job scheduler needs to find out whether
|
||||
each server it manages can allocate a certain amount of memory for
|
||||
this new job before it can pick a candidate. To do so, the job
|
||||
scheduler needs to estimate the working sets of the existing jobs.
|
||||
|
||||
When it is read, ``lru_gen`` returns a histogram of numbers of pages
|
||||
accessed over different time intervals for each memcg and node.
|
||||
``MAX_NR_GENS`` decides the number of bins for each histogram. The
|
||||
histograms are noncumulative.
|
||||
::
|
||||
|
||||
memcg memcg_id memcg_path
|
||||
node node_id
|
||||
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
|
||||
...
|
||||
max_gen_nr age_in_ms nr_anon_pages nr_file_pages
|
||||
|
||||
Each bin contains an estimated number of pages that have been accessed
|
||||
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
|
||||
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
|
||||
the former is the largest and that of the latter is the smallest.
|
||||
|
||||
Users can write the following command to ``lru_gen`` to create a new
|
||||
generation ``max_gen_nr+1``:
|
||||
|
||||
``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
|
||||
|
||||
``can_swap`` defaults to the swap setting and, if it is set to ``1``,
|
||||
it forces the scan of anon pages when swap is off, and vice versa.
|
||||
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
|
||||
employs heuristics to reduce the overhead, which is likely to reduce
|
||||
the coverage as well.
|
||||
|
||||
A typical use case is that a job scheduler runs this command at a
|
||||
certain time interval to create new generations, and it ranks the
|
||||
servers it manages based on the sizes of their cold pages defined by
|
||||
this time interval.
|
||||
|
||||
Proactive reclaim
|
||||
-----------------
|
||||
Proactive reclaim induces page reclaim when there is no memory
|
||||
pressure. It usually targets cold pages only. E.g., when a new job
|
||||
comes in, the job scheduler wants to proactively reclaim cold pages on
|
||||
the server it selected, to improve the chance of successfully landing
|
||||
this new job.
|
||||
|
||||
Users can write the following command to ``lru_gen`` to evict
|
||||
generations less than or equal to ``min_gen_nr``.
|
||||
|
||||
``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
|
||||
|
||||
``min_gen_nr`` should be less than ``max_gen_nr-1``, since
|
||||
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
|
||||
the active list) and therefore cannot be evicted. ``swappiness``
|
||||
overrides the default value in ``/proc/sys/vm/swappiness``.
|
||||
``nr_to_reclaim`` limits the number of pages to evict.
|
||||
|
||||
A typical use case is that a job scheduler runs this command before it
|
||||
tries to land a new job on a server. If it fails to materialize enough
|
||||
cold pages because of the overestimation, it retries on the next
|
||||
server according to the ranking result obtained from the working set
|
||||
estimation step. This less forceful approach limits the impacts on the
|
||||
existing jobs.
|
||||
@@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||
|
||||
The khugepaged progress can be seen in the number of pages collapsed::
|
||||
The khugepaged progress can be seen in the number of pages collapsed (note
|
||||
that this counter may not be an exact count of the number of pages
|
||||
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
|
||||
being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
|
||||
one 2M hugepage. Each may happen independently, or together, depending on
|
||||
the type of memory and the failures that occur. As such, this value should
|
||||
be interpreted roughly as a sign of progress, and counters in /proc/vmstat
|
||||
consulted for more accurate accounting)::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||
|
||||
@@ -366,10 +373,9 @@ thp_split_pmd
|
||||
page table entry.
|
||||
|
||||
thp_zero_page_alloc
|
||||
is incremented every time a huge zero page is
|
||||
successfully allocated. It includes allocations which where
|
||||
dropped due race with other allocation. Note, it doesn't count
|
||||
every map of the huge zero page, only its allocation.
|
||||
is incremented every time a huge zero page used for thp is
|
||||
successfully allocated. Note, it doesn't count every map of
|
||||
the huge zero page, only its allocation.
|
||||
|
||||
thp_zero_page_alloc_failed
|
||||
is incremented if kernel fails to allocate
|
||||
|
||||
@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
|
||||
Userspace creates a new userfaultfd, initializes it, and registers one or more
|
||||
regions of virtual memory with it. Then, any page faults which occur within the
|
||||
region(s) result in a message being delivered to the userfaultfd, notifying
|
||||
userspace of the fault.
|
||||
|
||||
The ``userfaultfd`` (aside from registering and unregistering virtual
|
||||
memory ranges) provides two primary functionalities:
|
||||
@@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
|
||||
management of mremap/mprotect is that the userfaults in all their
|
||||
operations never involve heavyweight structures like vmas (in fact the
|
||||
``userfaultfd`` runtime load never takes the mmap_lock for writing).
|
||||
|
||||
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||||
when dealing with virtual address spaces that could span
|
||||
Terabytes. Too many vmas would be needed for that.
|
||||
|
||||
The ``userfaultfd`` once opened by invoking the syscall, can also be
|
||||
The ``userfaultfd``, once created, can also be
|
||||
passed using unix domain sockets to a manager process, so the same
|
||||
manager process could handle the userfaults of a multitude of
|
||||
different processes without them being aware about what is going on
|
||||
@@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
|
||||
API
|
||||
===
|
||||
|
||||
Creating a userfaultfd
|
||||
----------------------
|
||||
|
||||
There are two ways to create a new userfaultfd, each of which provide ways to
|
||||
restrict access to this functionality (since historically userfaultfds which
|
||||
handle kernel page faults have been a useful tool for exploiting the kernel).
|
||||
|
||||
The first way, supported since userfaultfd was introduced, is the
|
||||
userfaultfd(2) syscall. Access to this is controlled in several ways:
|
||||
|
||||
- Any user can always create a userfaultfd which traps userspace page faults
|
||||
only. Such a userfaultfd can be created using the userfaultfd(2) syscall
|
||||
with the flag UFFD_USER_MODE_ONLY.
|
||||
|
||||
- In order to also trap kernel page faults for the address space, either the
|
||||
process needs the CAP_SYS_PTRACE capability, or the system must have
|
||||
vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
|
||||
is set to 0.
|
||||
|
||||
The second way, added to the kernel more recently, is by opening
|
||||
/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
|
||||
yields equivalent userfaultfds to the userfaultfd(2) syscall.
|
||||
|
||||
Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
|
||||
filesystem permissions (user/group/mode), which gives fine grained access to
|
||||
userfaultfd specifically, without also granting other unrelated privileges at
|
||||
the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
|
||||
to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
|
||||
vm.unprivileged_userfaultfd is not considered.
|
||||
|
||||
Initializing a userfaultfd
|
||||
--------------------------
|
||||
|
||||
When first opened the ``userfaultfd`` must be enabled invoking the
|
||||
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
|
||||
a later API version) which will specify the ``read/POLLIN`` protocol
|
||||
|
||||
@@ -635,6 +635,17 @@ different types of memory (represented as different NUMA nodes) to
|
||||
place the hot pages in the fast memory. This is implemented based on
|
||||
unmapping and page fault too.
|
||||
|
||||
numa_balancing_promote_rate_limit_MBps
|
||||
======================================
|
||||
|
||||
Too high promotion/demotion throughput between different memory types
|
||||
may hurt application latency. This can be used to rate limit the
|
||||
promotion throughput. The per-node max promotion throughput in MB/s
|
||||
will be limited to be no more than the set value.
|
||||
|
||||
A rule of thumb is to set this to less than 1/10 of the PMEM node
|
||||
write bandwidth.
|
||||
|
||||
oops_all_cpu_backtrace
|
||||
======================
|
||||
|
||||
|
||||
@@ -926,6 +926,9 @@ calls without any restrictions.
|
||||
|
||||
The default value is 0.
|
||||
|
||||
Another way to control permissions for userfaultfd is to use
|
||||
/dev/userfaultfd instead of userfaultfd(2). See
|
||||
Documentation/admin-guide/mm/userfaultfd.rst.
|
||||
|
||||
user_reserve_kbytes
|
||||
===================
|
||||
|
||||
@@ -37,6 +37,7 @@ Library functionality that is used throughout the kernel.
|
||||
kref
|
||||
assoc_array
|
||||
xarray
|
||||
maple_tree
|
||||
idr
|
||||
circular-buffers
|
||||
rbtree
|
||||
|
||||
217
Documentation/core-api/maple_tree.rst
Normal file
217
Documentation/core-api/maple_tree.rst
Normal file
@@ -0,0 +1,217 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0+
|
||||
|
||||
|
||||
==========
|
||||
Maple Tree
|
||||
==========
|
||||
|
||||
:Author: Liam R. Howlett
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The Maple Tree is a B-Tree data type which is optimized for storing
|
||||
non-overlapping ranges, including ranges of size 1. The tree was designed to
|
||||
be simple to use and does not require a user written search method. It
|
||||
supports iterating over a range of entries and going to the previous or next
|
||||
entry in a cache-efficient manner. The tree can also be put into an RCU-safe
|
||||
mode of operation which allows reading and writing concurrently. Writers must
|
||||
synchronize on a lock, which can be the default spinlock, or the user can set
|
||||
the lock to an external lock of a different type.
|
||||
|
||||
The Maple Tree maintains a small memory footprint and was designed to use
|
||||
modern processor cache efficiently. The majority of the users will be able to
|
||||
use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
|
||||
scenarios. The most important usage of the Maple Tree is the tracking of the
|
||||
virtual memory areas.
|
||||
|
||||
The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
|
||||
Tree reserves values with the bottom two bits set to '10' which are below 4096
|
||||
(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
|
||||
entries then the users can convert the entries using xa_mk_value() and convert
|
||||
them back by calling xa_to_value(). If the user needs to use a reserved
|
||||
value, then the user can convert the value when using the
|
||||
:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
|
||||
|
||||
The Maple Tree can also be configured to support searching for a gap of a given
|
||||
size (or larger).
|
||||
|
||||
Pre-allocating of nodes is also supported using the
|
||||
:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
|
||||
successful store operation within a given
|
||||
code segment when allocating cannot be done. Allocations of nodes are
|
||||
relatively small at around 256 bytes.
|
||||
|
||||
.. _maple-tree-normal-api:
|
||||
|
||||
Normal API
|
||||
==========
|
||||
|
||||
Start by initialising a maple tree, either with DEFINE_MTREE() for statically
|
||||
allocated maple trees or mt_init() for dynamically allocated ones. A
|
||||
freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
|
||||
- ``ULONG_MAX``. There are currently two types of maple trees supported: the
|
||||
allocation tree and the regular tree. The regular tree has a higher branching
|
||||
factor for internal nodes. The allocation tree has a lower branching factor
|
||||
but allows the user to search for a gap of a given size or larger from either
|
||||
``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
|
||||
passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
|
||||
|
||||
You can then set entries using mtree_store() or mtree_store_range().
|
||||
mtree_store() will overwrite any entry with the new entry and return 0 on
|
||||
success or an error code otherwise. mtree_store_range() works in the same way
|
||||
but takes a range. mtree_load() is used to retrieve the entry stored at a
|
||||
given index. You can use mtree_erase() to erase an entire range by only
|
||||
knowing one value within that range, or mtree_store() call with an entry of
|
||||
NULL may be used to partially erase a range or many ranges at once.
|
||||
|
||||
If you want to only store a new entry to a range (or index) if that range is
|
||||
currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
|
||||
return -EEXIST if the range is not empty.
|
||||
|
||||
You can search for an entry from an index upwards by using mt_find().
|
||||
|
||||
You can walk each entry within a range by calling mt_for_each(). You must
|
||||
provide a temporary variable to store a cursor. If you want to walk each
|
||||
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
|
||||
the caller is going to hold the lock for the duration of the walk then it is
|
||||
worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
|
||||
section.
|
||||
|
||||
Sometimes it is necessary to ensure the next call to store to a maple tree does
|
||||
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
|
||||
|
||||
Finally, you can remove all entries from a maple tree by calling
|
||||
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
|
||||
the entries first.
|
||||
|
||||
Allocating Nodes
|
||||
----------------
|
||||
|
||||
The allocations are handled by the internal tree code. See
|
||||
:ref:`maple-tree-advanced-alloc` for other options.
|
||||
|
||||
Locking
|
||||
-------
|
||||
|
||||
You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
|
||||
for other options.
|
||||
|
||||
The Maple Tree uses RCU and an internal spinlock to synchronise access:
|
||||
|
||||
Takes RCU read lock:
|
||||
* mtree_load()
|
||||
* mt_find()
|
||||
* mt_for_each()
|
||||
* mt_next()
|
||||
* mt_prev()
|
||||
|
||||
Takes ma_lock internally:
|
||||
* mtree_store()
|
||||
* mtree_store_range()
|
||||
* mtree_insert()
|
||||
* mtree_insert_range()
|
||||
* mtree_erase()
|
||||
* mtree_destroy()
|
||||
* mt_set_in_rcu()
|
||||
* mt_clear_in_rcu()
|
||||
|
||||
If you want to take advantage of the internal lock to protect the data
|
||||
structures that you are storing in the Maple Tree, you can call mtree_lock()
|
||||
before calling mtree_load(), then take a reference count on the object you
|
||||
have found before calling mtree_unlock(). This will prevent stores from
|
||||
removing the object from the tree between looking up the object and
|
||||
incrementing the refcount. You can also use RCU to avoid dereferencing
|
||||
freed memory, but an explanation of that is beyond the scope of this
|
||||
document.
|
||||
|
||||
.. _maple-tree-advanced-api:
|
||||
|
||||
Advanced API
|
||||
============
|
||||
|
||||
The advanced API offers more flexibility and better performance at the
|
||||
cost of an interface which can be harder to use and has fewer safeguards.
|
||||
You must take care of your own locking while using the advanced API.
|
||||
You can use the ma_lock, RCU or an external lock for protection.
|
||||
You can mix advanced and normal operations on the same array, as long
|
||||
as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
|
||||
in terms of the advanced API.
|
||||
|
||||
The advanced API is based around the ma_state, this is where the 'mas'
|
||||
prefix originates. The ma_state struct keeps track of tree operations to make
|
||||
life easier for both internal and external tree users.
|
||||
|
||||
Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
|
||||
Please see above.
|
||||
|
||||
The maple state keeps track of the range start and end in mas->index and
|
||||
mas->last, respectively.
|
||||
|
||||
mas_walk() will walk the tree to the location of mas->index and set the
|
||||
mas->index and mas->last according to the range for the entry.
|
||||
|
||||
You can set entries using mas_store(). mas_store() will overwrite any entry
|
||||
with the new entry and return the first existing entry that is overwritten.
|
||||
The range is passed in as members of the maple state: index and last.
|
||||
|
||||
You can use mas_erase() to erase an entire range by setting index and
|
||||
last of the maple state to the desired range to erase. This will erase
|
||||
the first range that is found in that range, set the maple state index
|
||||
and last as the range that was erased and return the entry that existed
|
||||
at that location.
|
||||
|
||||
You can walk each entry within a range by using mas_for_each(). If you want
|
||||
to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
|
||||
the range. If the lock needs to be periodically dropped, see the locking
|
||||
section mas_pause().
|
||||
|
||||
Using a maple state allows mas_next() and mas_prev() to function as if the
|
||||
tree was a linked list. With such a high branching factor the amortized
|
||||
performance penalty is outweighed by cache optimization. mas_next() will
|
||||
return the next entry which occurs after the entry at index. mas_prev()
|
||||
will return the previous entry which occurs before the entry at index.
|
||||
|
||||
mas_find() will find the first entry which exists at or above index on
|
||||
the first call, and the next entry from every subsequent calls.
|
||||
|
||||
mas_find_rev() will find the fist entry which exists at or below the last on
|
||||
the first call, and the previous entry from every subsequent calls.
|
||||
|
||||
If the user needs to yield the lock during an operation, then the maple state
|
||||
must be paused using mas_pause().
|
||||
|
||||
There are a few extra interfaces provided when using an allocation tree.
|
||||
If you wish to search for a gap within a range, then mas_empty_area()
|
||||
or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
|
||||
starting at the lowest index given up to the maximum of the range.
|
||||
mas_empty_area_rev() searches for a gap starting at the highest index given
|
||||
and continues downward to the lower bound of the range.
|
||||
|
||||
.. _maple-tree-advanced-alloc:
|
||||
|
||||
Advanced Allocating Nodes
|
||||
-------------------------
|
||||
|
||||
Allocations are usually handled internally to the tree, however if allocations
|
||||
need to occur before a write occurs then calling mas_expected_entries() will
|
||||
allocate the worst-case number of needed nodes to insert the provided number of
|
||||
ranges. This also causes the tree to enter mass insertion mode. Once
|
||||
insertions are complete calling mas_destroy() on the maple state will free the
|
||||
unused allocations.
|
||||
|
||||
.. _maple-tree-advanced-locks:
|
||||
|
||||
Advanced Locking
|
||||
----------------
|
||||
|
||||
The maple tree uses a spinlock by default, but external locks can be used for
|
||||
tree updates as well. To use an external lock, the tree must be initialized
|
||||
with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
|
||||
MTREE_INIT_EXT() #define, which takes an external lock as an argument.
|
||||
|
||||
Functions and structures
|
||||
========================
|
||||
|
||||
.. kernel-doc:: include/linux/maple_tree.h
|
||||
.. kernel-doc:: lib/maple_tree.c
|
||||
@@ -19,9 +19,6 @@ User Space Memory Access
|
||||
Memory Allocation Controls
|
||||
==========================
|
||||
|
||||
.. kernel-doc:: include/linux/gfp.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: include/linux/gfp_types.h
|
||||
:doc: Page mobility and placement hints
|
||||
|
||||
|
||||
@@ -24,6 +24,7 @@ Documentation/dev-tools/testing-overview.rst
|
||||
kcov
|
||||
gcov
|
||||
kasan
|
||||
kmsan
|
||||
ubsan
|
||||
kmemleak
|
||||
kcsan
|
||||
|
||||
@@ -111,9 +111,17 @@ parameter can be used to control panic and reporting behaviour:
|
||||
report or also panic the kernel (default: ``report``). The panic happens even
|
||||
if ``kasan_multi_shot`` is enabled.
|
||||
|
||||
Hardware Tag-Based KASAN mode (see the section about various modes below) is
|
||||
intended for use in production as a security mitigation. Therefore, it supports
|
||||
additional boot parameters that allow disabling KASAN or controlling features:
|
||||
Software and Hardware Tag-Based KASAN modes (see the section about various
|
||||
modes below) support altering stack trace collection behavior:
|
||||
|
||||
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
|
||||
traces collection (default: ``on``).
|
||||
- ``kasan.stack_ring_size=<number of entries>`` specifies the number of entries
|
||||
in the stack ring (default: ``32768``).
|
||||
|
||||
Hardware Tag-Based KASAN mode is intended for use in production as a security
|
||||
mitigation. Therefore, it supports additional boot parameters that allow
|
||||
disabling KASAN altogether or controlling its features:
|
||||
|
||||
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
|
||||
|
||||
@@ -132,9 +140,6 @@ additional boot parameters that allow disabling KASAN or controlling features:
|
||||
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
|
||||
allocations (default: ``on``).
|
||||
|
||||
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
|
||||
traces collection (default: ``on``).
|
||||
|
||||
Error reports
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user