Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
This commit is contained in:
Linus Torvalds
2022-10-10 17:53:04 -07:00
409 changed files with 65691 additions and 7933 deletions

View File

@@ -0,0 +1,25 @@
What: /sys/devices/virtual/memory_tiering/
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: A collection of all the memory tiers allocated.
Individual memory tier details are contained in subdirectories
named by the abstract distance of the memory tier.
/sys/devices/virtual/memory_tiering/memory_tierN/
What: /sys/devices/virtual/memory_tiering/memory_tierN/
/sys/devices/virtual/memory_tiering/memory_tierN/nodes
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Directory with details of a specific memory tier
This is the directory containing information about a particular
memory tier, memtierN, where N is derived based on abstract distance.
A smaller value of N implies a higher (faster) memory tier in the
hierarchy.
nodes: NUMA nodes that are part of this memory tier.

View File

@@ -13,7 +13,7 @@ a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages
d) memory reclaim
e) thrashing page cache
e) thrashing
f) direct compact
g) write-protect copy

View File

@@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
2.7 Kernel Memory Extension
-----------------------------------------------
With the Kernel memory extension, the Memory Controller is able to limit
@@ -386,8 +386,6 @@ U != 0, K >= U:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
-------------------------------------------------------------------

View File

@@ -1469,6 +1469,14 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
early_page_ext [KNL] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
memory initialization is disabled) so the boot process
might take longer, especially on systems with a lot of
memory. Available with CONFIG_PAGE_EXTENSION=y.
failslab=
fail_usercopy=
fail_page_alloc=
@@ -6041,12 +6049,6 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
swapaccount= [KNL]
Format: [0|1]
Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs

View File

@@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.
Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
kernel's CMA index. So the first CMA zone would be:
Each CMA area represents a directory under <debugfs>/cma/, represented by
its CMA name like below:
<debugfs>/cma/cma-0
<debugfs>/cma/<cma_name>
The structure of the files created under that directory is as follows:
@@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::
echo 5 > <debugfs>/cma/cma-2/alloc
echo 5 > <debugfs>/cma/<cma_name>/alloc
would try to allocate 5 pages from the cma-2 area.
would try to allocate 5 pages from the 'cma_name' area.
- [WO] free: Free N pages from that CMA area, similar to the above.

View File

@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
========================
Monitoring Data Accesses
========================
==========================
DAMON: Data Access MONitor
==========================
:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
Using DAMON, users can analyze the memory access patterns of their systems and

View File

@@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at
https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
your ``$PATH``. It's not mandatory, though.
Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
below::
# mount -t debugfs none /sys/kernel/debug/
or append the following line to your ``/etc/fstab`` file so that your system
can automatically mount debugfs upon booting::
debugfs /sys/kernel/debug debugfs defaults 0 0
Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
mounted.
Recording Data Access Patterns

View File

@@ -393,6 +393,11 @@ the files as above. Above is only for an example.
debugfs Interface
=================
.. note::
DAMON debugfs interface will be removed after next LTS kernel is released, so
users should move to the :ref:`sysfs interface <sysfs_interface>`.
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.

View File

@@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
multigen_lru
nommu-mmap
numa_memory_policy
numaperf

View File

@@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
Monitoring KSM profit
=====================
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate a number of rmap_items to
save each scanned page's brief rmap information. Some of these pages may
be merged, but some may not be abled to be merged after being checked
several times, which are unprofitable memory consumed.
1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::
process_profit =~ ksm_merging_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
administrators have to rethink how to change madvise policy. Giving an example
for reference, a page's size is usually 4K, and the rmap_item's size is
separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
because the ksm profit is approximately zero or negative.
Monitoring KSM events
=====================

View File

@@ -0,0 +1,162 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.
Quick start
===========
Build the kernel with the following configurations.
* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``
All set!
Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
following subsections.
Kill switch
-----------
``enabled`` accepts different values to enable or disable the
following components. Its default value depends on
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
unless some of them have unforeseen side effects. Writing to
``enabled`` has no effect when a component is not supported by the
hardware, and valid values will be accepted even when the main switch
is off.
====== ===============================================================
Values Components
====== ===============================================================
0x0001 The main switch for the multi-gen LRU.
0x0002 Clearing the accessed bit in leaf page table entries in large
batches, when MMU sets it (e.g., on x86). This behavior can
theoretically worsen lock contention (mmap_lock). If it is
disabled, the multi-gen LRU will suffer a minor performance
degradation for workloads that contiguously map hot pages,
whose accessed bits can be otherwise cleared by fewer larger
batches.
0x0004 Clearing the accessed bit in non-leaf page table entries as
well, when MMU sets it (e.g., on x86). This behavior was not
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
[yYnN] Apply to all the components above.
====== ===============================================================
E.g.,
::
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have ``oomd``.
Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.
Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.
The default value ``0`` means disabled.
Experimental features
=====================
``/sys/kernel/debug/lru_gen`` accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.
``/sys/kernel/debug/lru_gen_full`` provides additional stats for
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
evicted generations in this file.
Working set estimation
----------------------
Working set estimation measures how much memory an application needs
in a given time interval, and it is usually done with little impact on
the performance of the application. E.g., data centers want to
optimize job scheduling (bin packing) to improve memory utilizations.
When a new job comes in, the job scheduler needs to find out whether
each server it manages can allocate a certain amount of memory for
this new job before it can pick a candidate. To do so, the job
scheduler needs to estimate the working sets of the existing jobs.
When it is read, ``lru_gen`` returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
``MAX_NR_GENS`` decides the number of bins for each histogram. The
histograms are noncumulative.
::
memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages
Each bin contains an estimated number of pages that have been accessed
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
the former is the largest and that of the latter is the smallest.
Users can write the following command to ``lru_gen`` to create a new
generation ``max_gen_nr+1``:
``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
``can_swap`` defaults to the swap setting and, if it is set to ``1``,
it forces the scan of anon pages when swap is off, and vice versa.
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
employs heuristics to reduce the overhead, which is likely to reduce
the coverage as well.
A typical use case is that a job scheduler runs this command at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold pages defined by
this time interval.
Proactive reclaim
-----------------
Proactive reclaim induces page reclaim when there is no memory
pressure. It usually targets cold pages only. E.g., when a new job
comes in, the job scheduler wants to proactively reclaim cold pages on
the server it selected, to improve the chance of successfully landing
this new job.
Users can write the following command to ``lru_gen`` to evict
generations less than or equal to ``min_gen_nr``.
``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
the active list) and therefore cannot be evicted. ``swappiness``
overrides the default value in ``/proc/sys/vm/swappiness``.
``nr_to_reclaim`` limits the number of pages to evict.
A typical use case is that a job scheduler runs this command before it
tries to land a new job on a server. If it fails to materialize enough
cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the
existing jobs.

View File

@@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
The khugepaged progress can be seen in the number of pages collapsed::
The khugepaged progress can be seen in the number of pages collapsed (note
that this counter may not be an exact count of the number of pages
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
one 2M hugepage. Each may happen independently, or together, depending on
the type of memory and the failures that occur. As such, this value should
be interpreted roughly as a sign of progress, and counters in /proc/vmstat
consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -366,10 +373,9 @@ thp_split_pmd
page table entry.
thp_zero_page_alloc
is incremented every time a huge zero page is
successfully allocated. It includes allocations which where
dropped due race with other allocation. Note, it doesn't count
every map of the huge zero page, only its allocation.
is incremented every time a huge zero page used for thp is
successfully allocated. Note, it doesn't count every map of
the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate

View File

@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======
Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
Userspace creates a new userfaultfd, initializes it, and registers one or more
regions of virtual memory with it. Then, any page faults which occur within the
region(s) result in a message being delivered to the userfaultfd, notifying
userspace of the fault.
The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
@@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
The ``userfaultfd`` once opened by invoking the syscall, can also be
The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
@@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
API
===
Creating a userfaultfd
----------------------
There are two ways to create a new userfaultfd, each of which provide ways to
restrict access to this functionality (since historically userfaultfds which
handle kernel page faults have been a useful tool for exploiting the kernel).
The first way, supported since userfaultfd was introduced, is the
userfaultfd(2) syscall. Access to this is controlled in several ways:
- Any user can always create a userfaultfd which traps userspace page faults
only. Such a userfaultfd can be created using the userfaultfd(2) syscall
with the flag UFFD_USER_MODE_ONLY.
- In order to also trap kernel page faults for the address space, either the
process needs the CAP_SYS_PTRACE capability, or the system must have
vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
is set to 0.
The second way, added to the kernel more recently, is by opening
/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
yields equivalent userfaultfds to the userfaultfd(2) syscall.
Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
filesystem permissions (user/group/mode), which gives fine grained access to
userfaultfd specifically, without also granting other unrelated privileges at
the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
vm.unprivileged_userfaultfd is not considered.
Initializing a userfaultfd
--------------------------
When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol

View File

@@ -635,6 +635,17 @@ different types of memory (represented as different NUMA nodes) to
place the hot pages in the fast memory. This is implemented based on
unmapping and page fault too.
numa_balancing_promote_rate_limit_MBps
======================================
Too high promotion/demotion throughput between different memory types
may hurt application latency. This can be used to rate limit the
promotion throughput. The per-node max promotion throughput in MB/s
will be limited to be no more than the set value.
A rule of thumb is to set this to less than 1/10 of the PMEM node
write bandwidth.
oops_all_cpu_backtrace
======================

View File

@@ -926,6 +926,9 @@ calls without any restrictions.
The default value is 0.
Another way to control permissions for userfaultfd is to use
/dev/userfaultfd instead of userfaultfd(2). See
Documentation/admin-guide/mm/userfaultfd.rst.
user_reserve_kbytes
===================

View File

@@ -37,6 +37,7 @@ Library functionality that is used throughout the kernel.
kref
assoc_array
xarray
maple_tree
idr
circular-buffers
rbtree

View File

@@ -0,0 +1,217 @@
.. SPDX-License-Identifier: GPL-2.0+
==========
Maple Tree
==========
:Author: Liam R. Howlett
Overview
========
The Maple Tree is a B-Tree data type which is optimized for storing
non-overlapping ranges, including ranges of size 1. The tree was designed to
be simple to use and does not require a user written search method. It
supports iterating over a range of entries and going to the previous or next
entry in a cache-efficient manner. The tree can also be put into an RCU-safe
mode of operation which allows reading and writing concurrently. Writers must
synchronize on a lock, which can be the default spinlock, or the user can set
the lock to an external lock of a different type.
The Maple Tree maintains a small memory footprint and was designed to use
modern processor cache efficiently. The majority of the users will be able to
use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
scenarios. The most important usage of the Maple Tree is the tracking of the
virtual memory areas.
The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
Tree reserves values with the bottom two bits set to '10' which are below 4096
(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
entries then the users can convert the entries using xa_mk_value() and convert
them back by calling xa_to_value(). If the user needs to use a reserved
value, then the user can convert the value when using the
:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
The Maple Tree can also be configured to support searching for a gap of a given
size (or larger).
Pre-allocating of nodes is also supported using the
:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
successful store operation within a given
code segment when allocating cannot be done. Allocations of nodes are
relatively small at around 256 bytes.
.. _maple-tree-normal-api:
Normal API
==========
Start by initialising a maple tree, either with DEFINE_MTREE() for statically
allocated maple trees or mt_init() for dynamically allocated ones. A
freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
- ``ULONG_MAX``. There are currently two types of maple trees supported: the
allocation tree and the regular tree. The regular tree has a higher branching
factor for internal nodes. The allocation tree has a lower branching factor
but allows the user to search for a gap of a given size or larger from either
``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
You can then set entries using mtree_store() or mtree_store_range().
mtree_store() will overwrite any entry with the new entry and return 0 on
success or an error code otherwise. mtree_store_range() works in the same way
but takes a range. mtree_load() is used to retrieve the entry stored at a
given index. You can use mtree_erase() to erase an entire range by only
knowing one value within that range, or mtree_store() call with an entry of
NULL may be used to partially erase a range or many ranges at once.
If you want to only store a new entry to a range (or index) if that range is
currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
return -EEXIST if the range is not empty.
You can search for an entry from an index upwards by using mt_find().
You can walk each entry within a range by calling mt_for_each(). You must
provide a temporary variable to store a cursor. If you want to walk each
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
the caller is going to hold the lock for the duration of the walk then it is
worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
section.
Sometimes it is necessary to ensure the next call to store to a maple tree does
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
Finally, you can remove all entries from a maple tree by calling
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
the entries first.
Allocating Nodes
----------------
The allocations are handled by the internal tree code. See
:ref:`maple-tree-advanced-alloc` for other options.
Locking
-------
You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
for other options.
The Maple Tree uses RCU and an internal spinlock to synchronise access:
Takes RCU read lock:
* mtree_load()
* mt_find()
* mt_for_each()
* mt_next()
* mt_prev()
Takes ma_lock internally:
* mtree_store()
* mtree_store_range()
* mtree_insert()
* mtree_insert_range()
* mtree_erase()
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
If you want to take advantage of the internal lock to protect the data
structures that you are storing in the Maple Tree, you can call mtree_lock()
before calling mtree_load(), then take a reference count on the object you
have found before calling mtree_unlock(). This will prevent stores from
removing the object from the tree between looking up the object and
incrementing the refcount. You can also use RCU to avoid dereferencing
freed memory, but an explanation of that is beyond the scope of this
document.
.. _maple-tree-advanced-api:
Advanced API
============
The advanced API offers more flexibility and better performance at the
cost of an interface which can be harder to use and has fewer safeguards.
You must take care of your own locking while using the advanced API.
You can use the ma_lock, RCU or an external lock for protection.
You can mix advanced and normal operations on the same array, as long
as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
in terms of the advanced API.
The advanced API is based around the ma_state, this is where the 'mas'
prefix originates. The ma_state struct keeps track of tree operations to make
life easier for both internal and external tree users.
Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
Please see above.
The maple state keeps track of the range start and end in mas->index and
mas->last, respectively.
mas_walk() will walk the tree to the location of mas->index and set the
mas->index and mas->last according to the range for the entry.
You can set entries using mas_store(). mas_store() will overwrite any entry
with the new entry and return the first existing entry that is overwritten.
The range is passed in as members of the maple state: index and last.
You can use mas_erase() to erase an entire range by setting index and
last of the maple state to the desired range to erase. This will erase
the first range that is found in that range, set the maple state index
and last as the range that was erased and return the entry that existed
at that location.
You can walk each entry within a range by using mas_for_each(). If you want
to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
the range. If the lock needs to be periodically dropped, see the locking
section mas_pause().
Using a maple state allows mas_next() and mas_prev() to function as if the
tree was a linked list. With such a high branching factor the amortized
performance penalty is outweighed by cache optimization. mas_next() will
return the next entry which occurs after the entry at index. mas_prev()
will return the previous entry which occurs before the entry at index.
mas_find() will find the first entry which exists at or above index on
the first call, and the next entry from every subsequent calls.
mas_find_rev() will find the fist entry which exists at or below the last on
the first call, and the previous entry from every subsequent calls.
If the user needs to yield the lock during an operation, then the maple state
must be paused using mas_pause().
There are a few extra interfaces provided when using an allocation tree.
If you wish to search for a gap within a range, then mas_empty_area()
or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
starting at the lowest index given up to the maximum of the range.
mas_empty_area_rev() searches for a gap starting at the highest index given
and continues downward to the lower bound of the range.
.. _maple-tree-advanced-alloc:
Advanced Allocating Nodes
-------------------------
Allocations are usually handled internally to the tree, however if allocations
need to occur before a write occurs then calling mas_expected_entries() will
allocate the worst-case number of needed nodes to insert the provided number of
ranges. This also causes the tree to enter mass insertion mode. Once
insertions are complete calling mas_destroy() on the maple state will free the
unused allocations.
.. _maple-tree-advanced-locks:
Advanced Locking
----------------
The maple tree uses a spinlock by default, but external locks can be used for
tree updates as well. To use an external lock, the tree must be initialized
with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
MTREE_INIT_EXT() #define, which takes an external lock as an argument.
Functions and structures
========================
.. kernel-doc:: include/linux/maple_tree.h
.. kernel-doc:: lib/maple_tree.c

View File

@@ -19,9 +19,6 @@ User Space Memory Access
Memory Allocation Controls
==========================
.. kernel-doc:: include/linux/gfp.h
:internal:
.. kernel-doc:: include/linux/gfp_types.h
:doc: Page mobility and placement hints

View File

@@ -24,6 +24,7 @@ Documentation/dev-tools/testing-overview.rst
kcov
gcov
kasan
kmsan
ubsan
kmemleak
kcsan

View File

@@ -111,9 +111,17 @@ parameter can be used to control panic and reporting behaviour:
report or also panic the kernel (default: ``report``). The panic happens even
if ``kasan_multi_shot`` is enabled.
Hardware Tag-Based KASAN mode (see the section about various modes below) is
intended for use in production as a security mitigation. Therefore, it supports
additional boot parameters that allow disabling KASAN or controlling features:
Software and Hardware Tag-Based KASAN modes (see the section about various
modes below) support altering stack trace collection behavior:
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
- ``kasan.stack_ring_size=<number of entries>`` specifies the number of entries
in the stack ring (default: ``32768``).
Hardware Tag-Based KASAN mode is intended for use in production as a security
mitigation. Therefore, it supports additional boot parameters that allow
disabling KASAN altogether or controlling its features:
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
@@ -132,9 +140,6 @@ additional boot parameters that allow disabling KASAN or controlling features:
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``).
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
Error reports
~~~~~~~~~~~~~

Some files were not shown because too many files have changed in this diff Show More