Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
This commit is contained in:
Linus Torvalds
2022-05-26 12:32:41 -07:00
240 changed files with 9223 additions and 4680 deletions

View File

@@ -23,9 +23,10 @@ Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing 'on' or 'off' to this file makes the kdamond starts or
stops, respectively. Reading the file returns the keywords
based on the current status. Writing 'update_schemes_stats' to
the file updates contents of schemes stats files of the
kdamond.
based on the current status. Writing 'commit' to this file
makes the kdamond reads the user inputs in the sysfs files
except 'state' again. Writing 'update_schemes_stats' to the
file updates contents of schemes stats files of the kdamond.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022
@@ -40,14 +41,24 @@ Description: Writing a number 'N' to this file creates the number of
directories for controlling each DAMON context named '0' to
'N-1' under the contexts/ directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/avail_operations
Date: Apr 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the available monitoring operations
sets on the currently running kernel.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/operations
Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing a keyword for a monitoring operations set ('vaddr' for
virtual address spaces monitoring, and 'paddr' for the physical
address space monitoring) to this file makes the context to use
the operations set. Reading the file returns the keyword for
the operations set the context is set to use.
virtual address spaces monitoring, 'fvaddr' for fixed virtual
address ranges monitoring, and 'paddr' for the physical address
space monitoring) to this file makes the context to use the
operations set. Reading the file returns the keyword for the
operations set the context is set to use.
Note that only the operations sets that listed in
'avail_operations' file are valid inputs.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us
Date: Mar 2022

View File

@@ -343,6 +343,11 @@ Admin can request writeback of those idle pages at right timing via::
With the command, zram will writeback idle pages from memory to the storage.
Additionally, if a user choose to writeback only huge and idle pages
this can be accomplished with::
echo huge_idle > /sys/block/zramX/writeback
If an admin wants to write a specific page in zram device to the backing device,
they could write a page index into the interface.

View File

@@ -1208,6 +1208,34 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
This is a simple interface to trigger memory reclaim in the
target cgroup.
This file accepts a single key, the number of bytes to reclaim.
No nested keys are currently supported.
Example::
echo "1G" > memory.reclaim
The interface can be later extended with nested keys to
configure the reclaim behavior. For example, specify the
type of memory to reclaim from (anon, file, ..).
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
memory.peak
A read-only single value file which exists on non-root
cgroups.
The max memory usage recorded for the cgroup and its
descendants since the creation of the cgroup.
memory.oom.group
A read-write single value file which exists on non-root
cgroups. The default value is "0".
@@ -1326,6 +1354,12 @@ PAGE_SIZE multiple when read back.
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
zswap
Amount of memory consumed by the zswap compression backend.
zswapped
Amount of application memory swapped out to zswap.
file_mapped
Amount of cached filesystem data mapped with mmap()
@@ -1516,6 +1550,21 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
memory.zswap.current
A read-only single value file which exists on non-root
cgroups.
The total amount of memory consumed by the zswap compression
backend.
memory.zswap.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
Zswap usage hard limit. If a cgroup's zswap pool reaches this
limit, it will refuse to take any more stores before existing
entries fault back in or are written out to disk.
memory.pressure
A read-only nested-keyed file.

View File

@@ -1705,16 +1705,16 @@
boot-time allocation of gigantic hugepages is skipped.
hugetlb_free_vmemmap=
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
Allows heavy hugetlb users to free up some more
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
Format: { on | off (default) }
Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
on: enable the feature
off: disable the feature
[oO][Nn]/Y/y/1: enable the feature
[oO][Ff]/N/n/0: disable the feature
Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
the default is on.
This is not compatible with memory_hotplug.memmap_on_memory.

View File

@@ -66,6 +66,17 @@ Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
no real monitoring and reclamation due to the watermarks-based activation
condition. Refer to below descriptions for the watermarks parameter for this.
commit_inputs
-------------
Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
Input parameters that updated while DAMON_RECLAIM is running are not applied
by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
of parametrs except ``enabled`` again. Once the re-reading is done, this
parameter is set as ``N``. If invalid parameters are found while the
re-reading, DAMON_RECLAIM will be disabled.
min_age
-------

View File

@@ -68,7 +68,7 @@ comma (","). ::
│ kdamonds/nr_kdamonds
│ │ 0/state,pid
│ │ │ contexts/nr_contexts
│ │ │ │ 0/operations
│ │ │ │ 0/avail_operations,operations
│ │ │ │ │ monitoring_attrs/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
│ │ │ │ │ │ nr_regions/min,max
@@ -121,10 +121,11 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
Reading ``state`` returns ``on`` if the kdamond is currently running, or
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
in the state. Writing ``update_schemes_stats`` to ``state`` file updates the
contents of stats files for each DAMON-based operation scheme of the kdamond.
For details of the stats, please refer to :ref:`stats section
<sysfs_schemes_stats>`.
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@@ -143,17 +144,28 @@ be written to the file.
contexts/<N>/
-------------
In each context directory, one file (``operations``) and three directories
(``monitoring_attrs``, ``targets``, and ``schemes``) exist.
In each context directory, two files (``avail_operations`` and ``operations``)
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
exist.
DAMON supports multiple types of monitoring operations, including those for
virtual address space and the physical address space. You can set and get what
type of monitoring operations DAMON will use for the context by writing one of
below keywords to, and reading from the file.
virtual address space and the physical address space. You can get the list of
available monitoring operations set on the currently running kernel by reading
``avail_operations`` file. Based on the kernel configuration, the file will
list some or all of below keywords.
- vaddr: Monitor virtual address spaces of specific processes
- fvaddr: Monitor fixed virtual address ranges
- paddr: Monitor the physical address space of the system
Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
differences between the operations sets in terms of the monitoring target
regions.
You can set and get what type of monitoring operations DAMON will use for the
context by writing one of the keywords listed in ``avail_operations`` file and
reading from the ``operations`` file.
contexts/<N>/monitoring_attrs/
------------------------------
@@ -192,6 +204,8 @@ If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
be a process. You can specify the process to DAMON by writing the pid of the
process to the ``pid_target`` file.
.. _sysfs_regions:
targets/<N>/regions
-------------------
@@ -202,9 +216,10 @@ can be covered. However, users could want to set the initial monitoring region
to specific address ranges.
In contrast, DAMON do not automatically sets and updates the monitoring target
regions when ``paddr`` monitoring operations set is being used (``paddr`` is
written to the ``contexts/<N>/operations``). Therefore, users should set the
monitoring target regions by themselves in the case.
regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
Therefore, users should set the monitoring target regions by themselves in the
cases.
For such cases, users can explicitly set the initial monitoring target regions
as they want, by writing proper values to the files under this directory.

View File

@@ -164,7 +164,7 @@ default_hugepagesz
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.
hugetlb_free_vmemmap
When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
unused vmemmap pages associated with each HugeTLB page.
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``

View File

@@ -184,6 +184,24 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
Monitoring KSM events
=====================
There are some counters in /proc/vmstat that may be used to monitor KSM events.
KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
or on swapping in copy. Those events could help users evaluate whether or how
to use KSM. For example, if cow_ksm increases too fast, user may decrease the
range of madvise(, , MADV_MERGEABLE).
cow_ksm
is incremented every time a KSM page triggers copy on write (COW)
when users try to write to a KSM page, we have to make a copy.
ksm_swpin_copy
is incremented every time a KSM page is copied when swapping in
note that KSM page might be copied when swapping in because do_swap_page()
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009

View File

@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
- overcommit_memory
- overcommit_ratio
- page-cluster
- page_lock_unfairness
- panic_on_oom
- percpu_pagelist_high_fraction
- stat_interval
@@ -561,6 +562,45 @@ Change the minimum size of the hugepage pool.
See Documentation/admin-guide/mm/hugetlbpage.rst
hugetlb_optimize_vmemmap
========================
This knob is not available when memory_hotplug.memmap_on_memory (kernel parameter)
is configured or the size of 'struct page' (a structure defined in
include/linux/mm_types.h) is not power of two (an unusual system config could
result in this).
Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
associated with each HugeTLB page.
Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool
to the buddy allocator, the vmemmap pages representing that range needs to be
remapped again and the vmemmap pages discarded earlier need to be rellocated
again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
the fly') instead of being pulled from the HugeTLB pool, you should weigh the
benefits of memory savings against the more overhead (~2x slower than before)
of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
allocator. Another behavior to note is that if the system is under heavy memory
pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
pool to the buddy allocator since the allocation of vmemmap pages could be
failed, you have to retry later if your system encounter this situation.
Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
buddy allocator will not be optimized meaning the extra overhead at allocation
time from buddy allocator disappears, whereas already optimized HugeTLB pages
will not be affected. If you want to make sure there are no optimized HugeTLB
pages, you can set "nr_hugepages" to 0 first and then disable this. Note that
writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
pages. So, those surplus pages are still optimized until they are no longer
in use. You would need to wait for those surplus pages to be released before
there are no optimized pages in the system.
nr_hugepages_mempolicy
======================
@@ -754,6 +794,14 @@ extra faults and I/O delays for following faults if they would have been part of
that consecutive pages readahead would have brought in.
page_lock_unfairness
====================
This value determines the number of times that the page lock can be
stolen from under a waiter. After the lock is stolen the number of times
specified in this file (default is 5), the "fair lock handoff" semantics
will apply, and the waiter will only be awakened if the lock can be taken.
panic_on_oom
============

View File

@@ -4,39 +4,76 @@ The Kernel Address Sanitizer (KASAN)
Overview
--------
KernelAddressSANitizer (KASAN) is a dynamic memory safety error detector
designed to find out-of-bound and use-after-free bugs. KASAN has three modes:
Kernel Address Sanitizer (KASAN) is a dynamic memory safety error detector
designed to find out-of-bounds and use-after-free bugs.
1. generic KASAN (similar to userspace ASan),
2. software tag-based KASAN (similar to userspace HWASan),
3. hardware tag-based KASAN (based on hardware memory tagging).
KASAN has three modes:
Generic KASAN is mainly used for debugging due to a large memory overhead.
Software tag-based KASAN can be used for dogfood testing as it has a lower
memory overhead that allows using it with real workloads. Hardware tag-based
KASAN comes with low memory and performance overheads and, therefore, can be
used in production. Either as an in-field memory bug detector or as a security
mitigation.
1. Generic KASAN
2. Software Tag-Based KASAN
3. Hardware Tag-Based KASAN
Software KASAN modes (#1 and #2) use compile-time instrumentation to insert
validity checks before every memory access and, therefore, require a compiler
version that supports that.
Generic KASAN, enabled with CONFIG_KASAN_GENERIC, is the mode intended for
debugging, similar to userspace ASan. This mode is supported on many CPU
architectures, but it has significant performance and memory overheads.
Generic KASAN is supported in GCC and Clang. With GCC, it requires version
8.3.0 or later. Any supported Clang version is compatible, but detection of
out-of-bounds accesses for global variables is only supported since Clang 11.
Software Tag-Based KASAN or SW_TAGS KASAN, enabled with CONFIG_KASAN_SW_TAGS,
can be used for both debugging and dogfood testing, similar to userspace HWASan.
This mode is only supported for arm64, but its moderate memory overhead allows
using it for testing on memory-restricted devices with real workloads.
Software tag-based KASAN mode is only supported in Clang.
Hardware Tag-Based KASAN or HW_TAGS KASAN, enabled with CONFIG_KASAN_HW_TAGS,
is the mode intended to be used as an in-field memory bug detector or as a
security mitigation. This mode only works on arm64 CPUs that support MTE
(Memory Tagging Extension), but it has low memory and performance overheads and
thus can be used in production.
The hardware KASAN mode (#3) relies on hardware to perform the checks but
still requires a compiler version that supports memory tagging instructions.
This mode is supported in GCC 10+ and Clang 12+.
For details about the memory and performance impact of each KASAN mode, see the
descriptions of the corresponding Kconfig options.
Both software KASAN modes work with SLUB and SLAB memory allocators,
while the hardware tag-based KASAN currently only supports SLUB.
The Generic and the Software Tag-Based modes are commonly referred to as the
software modes. The Software Tag-Based and the Hardware Tag-Based modes are
referred to as the tag-based modes.
Currently, generic KASAN is supported for the x86_64, arm, arm64, xtensa, s390,
and riscv architectures, and tag-based KASAN modes are supported only for arm64.
Support
-------
Architectures
~~~~~~~~~~~~~
Generic KASAN is supported on x86_64, arm, arm64, powerpc, riscv, s390, and
xtensa, and the tag-based KASAN modes are supported only on arm64.
Compilers
~~~~~~~~~
Software KASAN modes use compile-time instrumentation to insert validity checks
before every memory access and thus require a compiler version that provides
support for that. The Hardware Tag-Based mode relies on hardware to perform
these checks but still requires a compiler version that supports the memory
tagging instructions.
Generic KASAN requires GCC version 8.3.0 or later
or any Clang version supported by the kernel.
Software Tag-Based KASAN requires GCC 11+
or any Clang version supported by the kernel.
Hardware Tag-Based KASAN requires GCC 10+ or Clang 12+.
Memory types
~~~~~~~~~~~~
Generic KASAN supports finding bugs in all of slab, page_alloc, vmap, vmalloc,
stack, and global memory.
Software Tag-Based KASAN supports slab, page_alloc, vmalloc, and stack memory.
Hardware Tag-Based KASAN supports slab, page_alloc, and non-executable vmalloc
memory.
For slab, both software KASAN modes support SLUB and SLAB allocators, while
Hardware Tag-Based KASAN only supports SLUB.
Usage
-----
@@ -45,18 +82,59 @@ To enable KASAN, configure the kernel with::
CONFIG_KASAN=y
and choose between ``CONFIG_KASAN_GENERIC`` (to enable generic KASAN),
``CONFIG_KASAN_SW_TAGS`` (to enable software tag-based KASAN), and
``CONFIG_KASAN_HW_TAGS`` (to enable hardware tag-based KASAN).
and choose between ``CONFIG_KASAN_GENERIC`` (to enable Generic KASAN),
``CONFIG_KASAN_SW_TAGS`` (to enable Software Tag-Based KASAN), and
``CONFIG_KASAN_HW_TAGS`` (to enable Hardware Tag-Based KASAN).
For software modes, also choose between ``CONFIG_KASAN_OUTLINE`` and
For the software modes, also choose between ``CONFIG_KASAN_OUTLINE`` and
``CONFIG_KASAN_INLINE``. Outline and inline are compiler instrumentation types.
The former produces a smaller binary while the latter is 1.1-2 times faster.
The former produces a smaller binary while the latter is up to 2 times faster.
To include alloc and free stack traces of affected slab objects into reports,
enable ``CONFIG_STACKTRACE``. To include alloc and free stack traces of affected
physical pages, enable ``CONFIG_PAGE_OWNER`` and boot with ``page_owner=on``.
Boot parameters
~~~~~~~~~~~~~~~
KASAN is affected by the generic ``panic_on_warn`` command line parameter.
When it is enabled, KASAN panics the kernel after printing a bug report.
By default, KASAN prints a bug report only for the first invalid memory access.
With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This
effectively disables ``panic_on_warn`` for KASAN reports.
Alternatively, independent of ``panic_on_warn``, the ``kasan.fault=`` boot
parameter can be used to control panic and reporting behaviour:
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
report or also panic the kernel (default: ``report``). The panic happens even
if ``kasan_multi_shot`` is enabled.
Hardware Tag-Based KASAN mode (see the section about various modes below) is
intended for use in production as a security mitigation. Therefore, it supports
additional boot parameters that allow disabling KASAN or controlling features:
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
- ``kasan.mode=sync``, ``=async`` or ``=asymm`` controls whether KASAN
is configured in synchronous, asynchronous or asymmetric mode of
execution (default: ``sync``).
Synchronous mode: a bad access is detected immediately when a tag
check fault occurs.
Asynchronous mode: a bad access detection is delayed. When a tag check
fault occurs, the information is stored in hardware (in the TFSR_EL1
register for arm64). The kernel periodically checks the hardware and
only reports tag faults during these checks.
Asymmetric mode: a bad access is detected synchronously on reads and
asynchronously on writes.
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``).
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
Error reports
~~~~~~~~~~~~~
@@ -146,7 +224,7 @@ is either 8 or 16 aligned bytes depending on KASAN mode. Each number in the
memory state section of the report shows the state of one of the memory
granules that surround the accessed address.
For generic KASAN, the size of each memory granule is 8. The state of each
For Generic KASAN, the size of each memory granule is 8. The state of each
granule is encoded in one shadow byte. Those 8 bytes can be accessible,
partially accessible, freed, or be a part of a redzone. KASAN uses the following
encoding for each shadow byte: 00 means that all 8 bytes of the corresponding
@@ -171,47 +249,6 @@ traces point to places in code that interacted with the object but that are not
directly present in the bad access stack trace. Currently, this includes
call_rcu() and workqueue queuing.
Boot parameters
~~~~~~~~~~~~~~~
KASAN is affected by the generic ``panic_on_warn`` command line parameter.
When it is enabled, KASAN panics the kernel after printing a bug report.
By default, KASAN prints a bug report only for the first invalid memory access.
With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This
effectively disables ``panic_on_warn`` for KASAN reports.
Alternatively, independent of ``panic_on_warn`` the ``kasan.fault=`` boot
parameter can be used to control panic and reporting behaviour:
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
report or also panic the kernel (default: ``report``). The panic happens even
if ``kasan_multi_shot`` is enabled.
Hardware tag-based KASAN mode (see the section about various modes below) is
intended for use in production as a security mitigation. Therefore, it supports
additional boot parameters that allow disabling KASAN or controlling features:
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
- ``kasan.mode=sync``, ``=async`` or ``=asymm`` controls whether KASAN
is configured in synchronous, asynchronous or asymmetric mode of
execution (default: ``sync``).
Synchronous mode: a bad access is detected immediately when a tag
check fault occurs.
Asynchronous mode: a bad access detection is delayed. When a tag check
fault occurs, the information is stored in hardware (in the TFSR_EL1
register for arm64). The kernel periodically checks the hardware and
only reports tag faults during these checks.
Asymmetric mode: a bad access is detected synchronously on reads and
asynchronously on writes.
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``).
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
Implementation details
----------------------
@@ -250,49 +287,46 @@ outline-instrumented kernel.
Generic KASAN is the only mode that delays the reuse of freed objects via
quarantine (see mm/kasan/quarantine.c for implementation).
Software tag-based KASAN
Software Tag-Based KASAN
~~~~~~~~~~~~~~~~~~~~~~~~
Software tag-based KASAN uses a software memory tagging approach to checking
Software Tag-Based KASAN uses a software memory tagging approach to checking
access validity. It is currently only implemented for the arm64 architecture.
Software tag-based KASAN uses the Top Byte Ignore (TBI) feature of arm64 CPUs
Software Tag-Based KASAN uses the Top Byte Ignore (TBI) feature of arm64 CPUs
to store a pointer tag in the top byte of kernel pointers. It uses shadow memory
to store memory tags associated with each 16-byte memory cell (therefore, it
dedicates 1/16th of the kernel memory for shadow memory).
On each memory allocation, software tag-based KASAN generates a random tag, tags
On each memory allocation, Software Tag-Based KASAN generates a random tag, tags
the allocated memory with this tag, and embeds the same tag into the returned
pointer.
Software tag-based KASAN uses compile-time instrumentation to insert checks
Software Tag-Based KASAN uses compile-time instrumentation to insert checks
before each memory access. These checks make sure that the tag of the memory
that is being accessed is equal to the tag of the pointer that is used to access
this memory. In case of a tag mismatch, software tag-based KASAN prints a bug
this memory. In case of a tag mismatch, Software Tag-Based KASAN prints a bug
report.
Software tag-based KASAN also has two instrumentation modes (outline, which
Software Tag-Based KASAN also has two instrumentation modes (outline, which
emits callbacks to check memory accesses; and inline, which performs the shadow
memory checks inline). With outline instrumentation mode, a bug report is
printed from the function that performs the access check. With inline
instrumentation, a ``brk`` instruction is emitted by the compiler, and a
dedicated ``brk`` handler is used to print bug reports.
Software tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through
Software Tag-Based KASAN uses 0xFF as a match-all pointer tag (accesses through
pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently
reserved to tag freed memory regions.
Software tag-based KASAN currently only supports tagging of slab, page_alloc,
and vmalloc memory.
Hardware tag-based KASAN
Hardware Tag-Based KASAN
~~~~~~~~~~~~~~~~~~~~~~~~
Hardware tag-based KASAN is similar to the software mode in concept but uses
Hardware Tag-Based KASAN is similar to the software mode in concept but uses
hardware memory tagging support instead of compiler instrumentation and
shadow memory.
Hardware tag-based KASAN is currently only implemented for arm64 architecture
Hardware Tag-Based KASAN is currently only implemented for arm64 architecture
and based on both arm64 Memory Tagging Extension (MTE) introduced in ARMv8.5
Instruction Set Architecture and Top Byte Ignore (TBI).
@@ -302,21 +336,18 @@ access, hardware makes sure that the tag of the memory that is being accessed is
equal to the tag of the pointer that is used to access this memory. In case of a
tag mismatch, a fault is generated, and a report is printed.
Hardware tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through
Hardware Tag-Based KASAN uses 0xFF as a match-all pointer tag (accesses through
pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently
reserved to tag freed memory regions.
Hardware tag-based KASAN currently only supports tagging of slab, page_alloc,
and VM_ALLOC-based vmalloc memory.
If the hardware does not support MTE (pre ARMv8.5), hardware tag-based KASAN
If the hardware does not support MTE (pre ARMv8.5), Hardware Tag-Based KASAN
will not be enabled. In this case, all KASAN boot parameters are ignored.
Note that enabling CONFIG_KASAN_HW_TAGS always results in in-kernel TBI being
enabled. Even when ``kasan.mode=off`` is provided or when the hardware does not
support MTE (but supports TBI).
Hardware tag-based KASAN only reports the first found bug. After that, MTE tag
Hardware Tag-Based KASAN only reports the first found bug. After that, MTE tag
checking gets disabled.
Shadow memory
@@ -414,19 +445,18 @@ generic ``noinstr`` one.
Note that disabling compiler instrumentation (either on a per-file or a
per-function basis) makes KASAN ignore the accesses that happen directly in
that code for software KASAN modes. It does not help when the accesses happen
indirectly (through calls to instrumented functions) or with the hardware
tag-based mode that does not use compiler instrumentation.
indirectly (through calls to instrumented functions) or with Hardware
Tag-Based KASAN, which does not use compiler instrumentation.
For software KASAN modes, to disable KASAN reports in a part of the kernel code
for the current task, annotate this part of the code with a
``kasan_disable_current()``/``kasan_enable_current()`` section. This also
disables the reports for indirect accesses that happen through function calls.
For tag-based KASAN modes (include the hardware one), to disable access
checking, use ``kasan_reset_tag()`` or ``page_kasan_tag_reset()``. Note that
temporarily disabling access checking via ``page_kasan_tag_reset()`` requires
saving and restoring the per-page KASAN tag via
``page_kasan_tag``/``page_kasan_tag_set``.
For tag-based KASAN modes, to disable access checking, use
``kasan_reset_tag()`` or ``page_kasan_tag_reset()``. Note that temporarily
disabling access checking via ``page_kasan_tag_reset()`` requires saving and
restoring the per-page KASAN tag via ``page_kasan_tag``/``page_kasan_tag_set``.
Tests
~~~~~

View File

@@ -258,8 +258,9 @@ prototypes::
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
int (*error_remove_page)(struct address_space *, struct page *);
int (*swap_activate)(struct file *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
locking rules:
All except dirty_folio and free_folio may block
@@ -287,6 +288,7 @@ is_partially_uptodate: yes
error_remove_page: yes
swap_activate: no
swap_deactivate: no
swap_rw: yes, unlocks
====================== ======================== ========= ===============
->write_begin(), ->write_end() and ->read_folio() may be called from
@@ -386,15 +388,19 @@ cleaned, or an error value if not. Note that in order to prevent the folio
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.
->swap_activate will be called with a non-zero argument on
files backing (non block device backed) swapfiles. A return value
of zero indicates success, in which case this file can be used for
backing swapspace. The swapspace operations will be proxied to the
address space operations.
->swap_activate() will be called to prepare the given file for swap. It
should perform any validation and preparation necessary to ensure that
writes can be performed with minimal memory allocation. It should call
add_swap_extent(), or the helper iomap_swapfile_activate(), and return
the number of extents added. If IO should be submitted through
->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
directly to the block device ``sis->bdev``.
->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.
->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
file_lock_operations
====================

View File

@@ -942,56 +942,73 @@ can be substantial. In many cases there are other means to find out
additional memory using subsystem specific interfaces, for instance
/proc/net/sockstat for TCP memory allocations.
The following is from a 16GB PIII, which has highmem enabled.
You may not have all of these fields.
Example output. You may not have all of these fields.
::
> cat /proc/meminfo
MemTotal: 16344972 kB
MemFree: 13634064 kB
MemAvailable: 14836172 kB
Buffers: 3656 kB
Cached: 1195708 kB
SwapCached: 0 kB
Active: 891636 kB
Inactive: 1077224 kB
HighTotal: 15597528 kB
HighFree: 13629632 kB
LowTotal: 747444 kB
LowFree: 4432 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 968 kB
Writeback: 0 kB
AnonPages: 861800 kB
Mapped: 280372 kB
Shmem: 644 kB
KReclaimable: 168048 kB
Slab: 284364 kB
SReclaimable: 159856 kB
SUnreclaim: 124508 kB
PageTables: 24448 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7669796 kB
Committed_AS: 100056 kB
VmallocTotal: 112216 kB
VmallocUsed: 428 kB
VmallocChunk: 111088 kB
Percpu: 62080 kB
HardwareCorrupted: 0 kB
AnonHugePages: 49152 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
MemTotal: 32858820 kB
MemFree: 21001236 kB
MemAvailable: 27214312 kB
Buffers: 581092 kB
Cached: 5587612 kB
SwapCached: 0 kB
Active: 3237152 kB
Inactive: 7586256 kB
Active(anon): 94064 kB
Inactive(anon): 4570616 kB
Active(file): 3143088 kB
Inactive(file): 3015640 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Zswap: 1904 kB
Zswapped: 7792 kB
Dirty: 12 kB
Writeback: 0 kB
AnonPages: 4654780 kB
Mapped: 266244 kB
Shmem: 9976 kB
KReclaimable: 517708 kB
Slab: 660044 kB
SReclaimable: 517708 kB
SUnreclaim: 142336 kB
KernelStack: 11168 kB
PageTables: 20540 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16429408 kB
Committed_AS: 7715148 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 40444 kB
VmallocChunk: 0 kB
Percpu: 29312 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4149248 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 401152 kB
DirectMap2M: 10008576 kB
DirectMap1G: 24117248 kB
MemTotal
Total usable RAM (i.e. physical RAM minus a few reserved
bits and the kernel binary code)
MemFree
The sum of LowFree+HighFree
Total free RAM. On highmem systems, the sum of LowFree+HighFree
MemAvailable
An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree,
@@ -1005,8 +1022,9 @@ Buffers
Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached
in-memory cache for files read from the disk (the
pagecache). Doesn't include SwapCached
In-memory cache for files read from the disk (the
pagecache) as well as tmpfs & shmem.
Doesn't include SwapCached.
SwapCached
Memory that once was swapped out, is swapped back in but
still also is in the swapfile (if memory is needed it
@@ -1018,6 +1036,11 @@ Active
Inactive
Memory which has been less recently used. It is more
eligible to be reclaimed for other purposes
Unevictable
Memory allocated for userspace which cannot be reclaimed, such
as mlocked pages, ramfs backing pages, secret memfd pages etc.
Mlocked
Memory locked with mlock().
HighTotal, HighFree
Highmem is all memory above ~860MB of physical memory.
Highmem areas are for use by userspace programs, or
@@ -1034,26 +1057,20 @@ SwapTotal
SwapFree
Memory which has been evicted from RAM, and is temporarily
on the disk
Zswap
Memory consumed by the zswap backend (compressed size)
Zswapped
Amount of anonymous memory stored in zswap (original size)
Dirty
Memory which is waiting to get written back to the disk
Writeback
Memory which is actively being written back to the disk
AnonPages
Non-file backed pages mapped into userspace page tables
HardwareCorrupted
The amount of RAM/memory in KB, the kernel identifies as
corrupted.
AnonHugePages
Non-file backed huge pages mapped into userspace page tables
Mapped
files which have been mmaped, such as libraries
Shmem
Total memory used by shared memory (shmem) and tmpfs
ShmemHugePages
Memory used by shared memory (shmem) and tmpfs allocated
with huge pages
ShmemPmdMapped
Shared memory mapped into userspace with huge pages
KReclaimable
Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other
@@ -1064,9 +1081,10 @@ SReclaimable
Part of Slab, that might be reclaimed, such as caches
SUnreclaim
Part of Slab, that cannot be reclaimed on memory pressure
KernelStack
Memory consumed by the kernel stacks of all tasks
PageTables
amount of memory dedicated to the lowest level of page
tables.
Memory consumed by userspace page tables
NFS_Unstable
Always zero. Previous counted pages which had been written to
the server, but has not been committed to stable storage.
@@ -1098,7 +1116,7 @@ Committed_AS
has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G
of memory, but only touches 300M of it will show up as
using 1G. This 1G is memory which has been "committed" to
using 1G. This 1G is memory which has been "committed" to
by the VM and can be used at any time by the allocating
application. With strict overcommit enabled on the system
(mode 2 in 'vm.overcommit_memory'), allocations which would
@@ -1107,7 +1125,7 @@ Committed_AS
not fail due to lack of memory once that memory has been
successfully allocated.
VmallocTotal
total size of vmalloc memory area
total size of vmalloc virtual address space
VmallocUsed
amount of vmalloc area which is used
VmallocChunk
@@ -1115,6 +1133,30 @@ VmallocChunk
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
HardwareCorrupted
The amount of RAM/memory in KB, the kernel identifies as
corrupted.
AnonHugePages
Non-file backed huge pages mapped into userspace page tables
ShmemHugePages
Memory used by shared memory (shmem) and tmpfs allocated
with huge pages
ShmemPmdMapped
Shared memory mapped into userspace with huge pages
FileHugePages
Memory used for filesystem data (page cache) allocated
with huge pages
FilePmdMapped
Page cache mapped into userspace with huge pages
CmaTotal
Memory reserved for the Contiguous Memory Allocator (CMA)
CmaFree
Free remaining memory in the CMA reserves
HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
See Documentation/admin-guide/mm/hugetlbpage.rst.
DirectMap4k, DirectMap2M, DirectMap1G
Breakdown of page table sizes used in the kernel's
identity mapping of RAM
vmallocinfo
~~~~~~~~~~~

View File

@@ -749,8 +749,9 @@ cache in your filesystem. The following members are defined:
size_t count);
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
int (*swap_activate)(struct file *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
``writepage``
@@ -948,15 +949,21 @@ cache in your filesystem. The following members are defined:
unless you have them locked or reference counts increased.
``swap_activate``
Called when swapon is used on a file to allocate space if
necessary and pin the block lookup information in memory. A
return value of zero indicates success, in which case this file
can be used to back swapspace.
Called to prepare the given file for swap. It should perform
any validation and preparation necessary to ensure that writes
can be performed with minimal memory allocation. It should call
add_swap_extent(), or the helper iomap_swapfile_activate(), and
return the number of extents added. If IO should be submitted
through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
be submitted directly to the block device ``sis->bdev``.
``swap_deactivate``
Called during swapoff on files where swap_activate was
successful.
``swap_rw``
Called to read or write swap pages when SWP_FS_OPS is set.
The File Object
===============

View File

@@ -50,61 +50,74 @@ space when they use mm context tags.
Temporary Virtual Mappings
==========================
The kernel contains several ways of creating temporary mappings:
The kernel contains several ways of creating temporary mappings. The following
list shows them in order of preference of use.
* vmap(). This can be used to make a long duration mapping of multiple
physical pages into a contiguous virtual space. It needs global
synchronization to unmap.
* kmap_local_page(). This function is used to require short term mappings.
It can be invoked from any context (including interrupts) but the mappings
can only be used in the context which acquired them.
* kmap(). This permits a short duration mapping of a single page. It needs
global synchronization, but is amortized somewhat. It is also prone to
deadlocks when using in a nested fashion, and so it is not recommended for
new code.
This function should be preferred, where feasible, over all the others.
These mappings are thread-local and CPU-local, meaning that the mapping
can only be accessed from within this thread and the thread is bound the
CPU while the mapping is active. Even if the thread is preempted (since
preemption is never disabled by the function) the CPU can not be
unplugged from the system via CPU-hotplug until the mapping is disposed.
It's valid to take pagefaults in a local kmap region, unless the context
in which the local mapping is acquired does not allow it for other reasons.
kmap_local_page() always returns a valid virtual address and it is assumed
that kunmap_local() will never fail.
Nesting kmap_local_page() and kmap_atomic() mappings is allowed to a certain
extent (up to KMAP_TYPE_NR) but their invocations have to be strictly ordered
because the map implementation is stack based. See kmap_local_page() kdocs
(included in the "Functions" section) for details on how to manage nested
mappings.
* kmap_atomic(). This permits a very short duration mapping of a single
page. Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called.
kmap_atomic() may also be used by interrupt contexts, since it does not
sleep and the callers too may not sleep until after kunmap_atomic() is
called.
It may be assumed that k[un]map_atomic() won't fail.
Each call of kmap_atomic() in the kernel creates a non-preemptible section
and disable pagefaults. This could be a source of unwanted latency. Therefore
users should prefer kmap_local_page() instead of kmap_atomic().
It is assumed that k[un]map_atomic() won't fail.
Using kmap_atomic
=================
* kmap(). This should be used to make short duration mapping of a single
page with no restrictions on preemption or migration. It comes with an
overhead as mapping space is restricted and protected by a global lock
for synchronization. When mapping is no longer needed, the address that
the page was mapped to must be released with kunmap().
When and where to use kmap_atomic() is straightforward. It is used when code
wants to access the contents of a page that might be allocated from high memory
(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
functions, and they can be used in a manner similar to the following::
Mapping changes must be propagated across all the CPUs. kmap() also
requires global TLB invalidation when the kmap's pool wraps and it might
block when the mapping space is fully utilized until a slot becomes
available. Therefore, kmap() is only callable from preemptible context.
/* Find the page of interest. */
struct page *page = find_get_page(mapping, offset);
All the above work is necessary if a mapping must last for a relatively
long time but the bulk of high-memory mappings in the kernel are
short-lived and only used in one place. This means that the cost of
kmap() is mostly wasted in such cases. kmap() was not intended for long
term mappings but it has morphed in that direction and its use is
strongly discouraged in newer code and the set of the preceding functions
should be preferred.
/* Gain access to the contents of that page. */
void *vaddr = kmap_atomic(page);
On 64-bit systems, calls to kmap_local_page(), kmap_atomic() and kmap() have
no real work to do because a 64-bit address space is more than sufficient to
address all the physical memory whose pages are permanently mapped.
/* Do something to the contents of that page. */
memset(vaddr, 0, PAGE_SIZE);
/* Unmap that page. */
kunmap_atomic(vaddr);
Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
not the argument.
If you need to map two pages because you want to copy from one page to
another you need to keep the kmap_atomic calls strictly nested, like::
vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2);
memcpy(vaddr1, vaddr2, PAGE_SIZE);
kunmap_atomic(vaddr2);
kunmap_atomic(vaddr1);
* vmap(). This can be used to make a long duration mapping of multiple
physical pages into a contiguous virtual space. It needs global
synchronization to unmap.
Cost of Temporary Mappings
@@ -145,3 +158,10 @@ The general recommendation is that you don't use more than 8GiB on a 32-bit
machine - although more might work for you and your workload, you're pretty
much on your own - don't expect kernel developers to really care much if things
come apart.
Functions
=========
.. kernel-doc:: include/linux/highmem.h
.. kernel-doc:: include/linux/highmem-internal.h

View File

@@ -63,5 +63,6 @@ above structured documentation, or deleted if it has served its purpose.
transhuge
unevictable-lru
vmalloced-kernel-stacks
vmemmap_dedup
z3fold
zsmalloc

View File

@@ -121,6 +121,14 @@ Usage
-r Sort by memory release time.
-s Sort by stack trace.
-t Sort by times (default).
--sort <order> Specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]].
Choose a key from the **STANDARD FORMAT SPECIFIERS** section. The "+" is
optional since default direction is increasing numerical or lexicographic
order. Mixed use of abbreviated and complete-form of keys is allowed.
Examples:
./page_owner_sort <input> <output> --sort=n,+pid,-tgid
./page_owner_sort <input> <output> --sort=at
additional function::
@@ -129,7 +137,6 @@ Usage
Specify culling rules.Culling syntax is key[,key[,...]].Choose a
multi-letter key from the **STANDARD FORMAT SPECIFIERS** section.
<rules> is a single argument in the form of a comma-separated list,
which offers a way to specify individual culling rules. The recognized
keywords are described in the **STANDARD FORMAT SPECIFIERS** section below.
@@ -137,7 +144,6 @@ Usage
the STANDARD SORT KEYS section below. Mixed use of abbreviated and
complete-form of keys is allowed.
Examples:
./page_owner_sort <input> <output> --cull=stacktrace
./page_owner_sort <input> <output> --cull=st,pid,name
@@ -147,17 +153,44 @@ Usage
-f Filter out the information of blocks whose memory has been released.
Select:
--pid <PID> Select by pid.
--tgid <TGID> Select by tgid.
--name <command> Select by task command name.
--pid <pidlist> Select by pid. This selects the blocks whose process ID
numbers appear in <pidlist>.
--tgid <tgidlist> Select by tgid. This selects the blocks whose thread
group ID numbers appear in <tgidlist>.
--name <cmdlist> Select by task command name. This selects the blocks whose
task command name appear in <cmdlist>.
<pidlist>, <tgidlist>, <cmdlist> are single arguments in the form of a comma-separated list,
which offers a way to specify individual selecting rules.
Examples:
./page_owner_sort <input> <output> --pid=1
./page_owner_sort <input> <output> --tgid=1,2,3
./page_owner_sort <input> <output> --name name1,name2
STANDARD FORMAT SPECIFIERS
==========================
::
For --sort option:
KEY LONG DESCRIPTION
p pid process ID
tg tgid thread group ID
n name task command name
st stacktrace stack trace of the page allocation
T txt full text of block
ft free_ts timestamp of the page when it was released
at alloc_ts timestamp of the page when it was allocated
ator allocator memory allocator for pages
For --curl option:
KEY LONG DESCRIPTION
p pid process ID
tg tgid thread group ID
n name task command name
f free whether the page has been released or not
st stacktrace stace trace of the page allocation
st stacktrace stack trace of the page allocation
ator allocator memory allocator for pages

View File

@@ -0,0 +1,223 @@
.. SPDX-License-Identifier: GPL-2.0
=========================================
A vmemmap diet for HugeTLB and Device DAX
=========================================
HugeTLB
=======
The struct page structures (page structs) are used to describe a physical
page frame. By default, there is a one-to-one mapping from a page frame to
it's corresponding page struct.
HugeTLB pages consist of multiple base page size pages and is supported by many
architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages.
For each base page, there is a corresponding page struct.
Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
this upper limit. The only 'useful' information in the remaining page structs
is the compound_head field, and this field is the same for all tail pages.
By removing redundant page structs for HugeTLB pages, memory can be returned
to the buddy allocator for other uses.
Different architectures support different HugeTLB pages. For example, the
following table is the HugeTLB page size supported by x86 and arm64
architectures. Because arm64 supports 4k, 16k, and 64k base pages and
supports contiguous entries, so it supports many kinds of sizes of HugeTLB
page.
+--------------+-----------+-----------------------------------------------+
| Architecture | Page Size | HugeTLB Page Size |
+--------------+-----------+-----------+-----------+-----------+-----------+
| x86-64 | 4KB | 2MB | 1GB | | |
+--------------+-----------+-----------+-----------+-----------+-----------+
| | 4KB | 64KB | 2MB | 32MB | 1GB |
| +-----------+-----------+-----------+-----------+-----------+
| arm64 | 16KB | 2MB | 32MB | 1GB | |
| +-----------+-----------+-----------+-----------+-----------+
| | 64KB | 2MB | 512MB | 16GB | |
+--------------+-----------+-----------+-----------+-----------+-----------+
When the system boot up, every HugeTLB page has more than one struct page
structs which size is (unit: pages)::
struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
relationship::
HugeTLB_Size = n * PAGE_SIZE
Then::
struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
= n * sizeof(struct page) / PAGE_SIZE
We can use huge mapping at the pud/pmd level for the HugeTLB page.
For the HugeTLB page of the pmd level mapping, then::
struct_size = n * sizeof(struct page) / PAGE_SIZE
= PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
= sizeof(struct page) / sizeof(pte_t)
= 64 / 8
= 8 (pages)
Where n is how many pte entries which one page can contains. So the value of
n is (PAGE_SIZE / sizeof(pte_t)).
This optimization only supports 64-bit system, so the value of sizeof(pte_t)
is 8. And this optimization also applicable only when the size of struct page
is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
size of struct page structs of it is 8 page frames which size depends on the
size of the base page.
For the HugeTLB page of the pud level mapping, then::
struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
= PAGE_SIZE / 8 * 8 (pages)
= PAGE_SIZE (pages)
Where the struct_size(pmd) is the size of the struct page structs of a
HugeTLB page of the pmd level mapping.
E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
HugeTLB page consists in 4096.
Next, we take the pmd level mapping of the HugeTLB page as an example to
show the internal implementation of this optimization. There are 8 pages
struct page structs associated with a HugeTLB page which is pmd mapped.
Here is how things look before optimization::
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | -------------> | 2 |
| | +-----------+ +-----------+
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
| PMD | +-----------+ +-----------+
| level | | 5 | -------------> | 5 |
| mapping | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
| | +-----------+ +-----------+
| |
| |
| |
+-----------+
The value of page->compound_head is the same for all tail pages. The first
page of page structs (page 0) associated with the HugeTLB page contains the 4
page structs necessary to describe the HugeTLB. The only use of the remaining
pages of page structs (page 1 to page 7) is to point to page->compound_head.
Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
will be used for each HugeTLB page. This will allow us to free the remaining
7 pages to the buddy allocator.
Here is how things look after remapping::
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
| | +-----------+ | | | | | |
| | | 2 | -----------------+ | | | | |
| | +-----------+ | | | | |
| | | 3 | -------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | ---------------------+ | | |
| PMD | +-----------+ | | |
| level | | 5 | -----------------------+ | |
| mapping | +-----------+ | |
| | | 6 | -------------------------+ |
| | +-----------+ |
| | | 7 | ---------------------------+
| | +-----------+
| |
| |
| |
+-----------+
When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
vmemmap pages and restore the previous mapping relationship.
For the HugeTLB page of the pud level mapping. It is similar to the former.
We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
(e.g. aarch64) provides a contiguous bit in the translation table entries
that hints to the MMU to indicate that it is one of a contiguous set of
entries that can be cached in a single TLB entry.
The contiguous bit is used to increase the mapping size at the pmd and pte
(last) level. So this type of HugeTLB page can be optimized only when its
size of the struct page structs is greater than 1 page.
Notice: The head vmemmap page is not freed to the buddy allocator and all
tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
associated with each HugeTLB page. The compound_head() can handle this
correctly (more details refer to the comment above compound_head()).
Device DAX
==========
The device-dax interface uses the same tail deduplication technique explained
in the previous chapter, except when used with the vmemmap in
the device (altmap).
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
The differences with HugeTLB are relatively minor.
It only use 3 page structs for storing all information as opposed
to 4 on HugeTLB pages.
There's no remapping of vmemmap given that device-dax memory is not part of
System RAM ranges initialized at boot. Thus the tail page deduplication
happens at a later stage when we populate the sections. HugeTLB reuses the
the head vmemmap page representing, whereas device-dax reuses the tail
vmemmap page. This results in only half of the savings compared to HugeTLB.
Deduplicated tail pages are not mapped read-only.
Here's how things look like on device-dax after the sections are populated::
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
| | +-----------+ | | | | |
| | | 3 | ------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | --------------------+ | | |
| PMD | +-----------+ | | |
| level | | 5 | ----------------------+ | |
| mapping | +-----------+ | |
| | | 6 | ------------------------+ |
| | +-----------+ |
| | | 7 | --------------------------+
| | +-----------+
| |
| |
| |
+-----------+

View File

@@ -5027,6 +5027,7 @@ F: Documentation/admin-guide/cgroup-v1/
F: Documentation/admin-guide/cgroup-v2.rst
F: include/linux/cgroup*
F: kernel/cgroup/
F: tools/testing/selftests/cgroup/
CONTROL GROUP - BLOCK IO CONTROLLER (BLKIO)
M: Tejun Heo <tj@kernel.org>
@@ -5060,6 +5061,8 @@ L: linux-mm@kvack.org
S: Maintained
F: mm/memcontrol.c
F: mm/swap_cgroup.c
F: tools/testing/selftests/cgroup/test_kmem.c
F: tools/testing/selftests/cgroup/test_memcontrol.c
CORETEMP HARDWARE MONITORING DRIVER
M: Fenghua Yu <fenghua.yu@intel.com>
@@ -9064,16 +9067,20 @@ S: Orphan
F: Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
F: drivers/net/ethernet/huawei/hinic/
HUGETLB FILESYSTEM
HUGETLB SUBSYSTEM
M: Mike Kravetz <mike.kravetz@oracle.com>
M: Muchun Song <songmuchun@bytedance.com>
L: linux-mm@kvack.org
S: Maintained
F: Documentation/ABI/testing/sysfs-kernel-mm-hugepages
F: Documentation/admin-guide/mm/hugetlbpage.rst
F: Documentation/vm/hugetlbfs_reserv.rst
F: Documentation/vm/vmemmap_dedup.rst
F: fs/hugetlbfs/
F: include/linux/hugetlb.h
F: mm/hugetlb.c
F: mm/hugetlb_vmemmap.c
F: mm/hugetlb_vmemmap.h
HVA ST MEDIA DRIVER
M: Jean-Christophe Trotin <jean-christophe.trotin@foss.st.com>

View File

@@ -18,7 +18,7 @@ extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vmaddr)
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
extern void copy_page(void * _to, void * _from);

View File

@@ -45,6 +45,7 @@ config ARM64
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_VM_GET_PAGE_PROT
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_ELF_PROT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
@@ -91,11 +92,13 @@ config ARM64
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_NO_INSTR
select ARCH_HAS_UBSAN_SANITIZE_ALL

Some files were not shown because too many files have changed in this diff Show More