You've already forked linux-rockchip
mirror of
https://github.com/armbian/linux-rockchip.git
synced 2026-01-06 11:08:10 -08:00
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Almost all of MM here. A few things are still getting finished off,
reviewed, etc.
- Yang Shi has improved the behaviour of khugepaged collapsing of
readonly file-backed transparent hugepages.
- Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
- Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
runtime enablement of the recent huge page vmemmap optimization
feature.
- Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
- Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
- Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
- David Vernet has done some fixup work on the memcg selftests.
- Peter Xu has taught userfaultfd to handle write protection faults
against shmem- and hugetlbfs-backed files.
- More DAMON development from SeongJae Park - adding online tuning of
the feature and support for monitoring of fixed virtual address
ranges. Also easier discovery of which monitoring operations are
available.
- Nadav Amit has done some optimization of TLB flushing during
mprotect().
- Neil Brown continues to labor away at improving our swap-over-NFS
support.
- David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
- Peng Liu fixed some errors in the core hugetlb code.
- Joao Martins has reduced the amount of memory consumed by
device-dax's compound devmaps.
- Some cleanups of the arch-specific pagemap code from Anshuman
Khandual.
- Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
- Roman Gushchin has done more work on the memcg selftests.
... and, of course, many smaller fixes and cleanups. Notably, the
customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
mm: kfence: use PAGE_ALIGNED helper
selftests: vm: add the "settings" file with timeout variable
selftests: vm: add "test_hmm.sh" to TEST_FILES
selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
selftests: vm: add migration to the .gitignore
selftests/vm/pkeys: fix typo in comment
ksm: fix typo in comment
selftests: vm: add process_mrelease tests
Revert "mm/vmscan: never demote for memcg reclaim"
mm/kfence: print disabling or re-enabling message
include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
mm: fix a potential infinite loop in start_isolate_page_range()
MAINTAINERS: add Muchun as co-maintainer for HugeTLB
zram: fix Kconfig dependency warning
mm/shmem: fix shmem folio swapoff hang
cgroup: fix an error handling path in alloc_pagecache_max_30M()
mm: damon: use HPAGE_PMD_SIZE
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
nodemask.h: fix compilation error with GCC12
...
This commit is contained in:
@@ -23,9 +23,10 @@ Date: Mar 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
||||
stops, respectively. Reading the file returns the keywords
|
||||
based on the current status. Writing 'update_schemes_stats' to
|
||||
the file updates contents of schemes stats files of the
|
||||
kdamond.
|
||||
based on the current status. Writing 'commit' to this file
|
||||
makes the kdamond reads the user inputs in the sysfs files
|
||||
except 'state' again. Writing 'update_schemes_stats' to the
|
||||
file updates contents of schemes stats files of the kdamond.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
|
||||
Date: Mar 2022
|
||||
@@ -40,14 +41,24 @@ Description: Writing a number 'N' to this file creates the number of
|
||||
directories for controlling each DAMON context named '0' to
|
||||
'N-1' under the contexts/ directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/avail_operations
|
||||
Date: Apr 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading this file returns the available monitoring operations
|
||||
sets on the currently running kernel.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/operations
|
||||
Date: Mar 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing a keyword for a monitoring operations set ('vaddr' for
|
||||
virtual address spaces monitoring, and 'paddr' for the physical
|
||||
address space monitoring) to this file makes the context to use
|
||||
the operations set. Reading the file returns the keyword for
|
||||
the operations set the context is set to use.
|
||||
virtual address spaces monitoring, 'fvaddr' for fixed virtual
|
||||
address ranges monitoring, and 'paddr' for the physical address
|
||||
space monitoring) to this file makes the context to use the
|
||||
operations set. Reading the file returns the keyword for the
|
||||
operations set the context is set to use.
|
||||
|
||||
Note that only the operations sets that listed in
|
||||
'avail_operations' file are valid inputs.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us
|
||||
Date: Mar 2022
|
||||
|
||||
@@ -343,6 +343,11 @@ Admin can request writeback of those idle pages at right timing via::
|
||||
|
||||
With the command, zram will writeback idle pages from memory to the storage.
|
||||
|
||||
Additionally, if a user choose to writeback only huge and idle pages
|
||||
this can be accomplished with::
|
||||
|
||||
echo huge_idle > /sys/block/zramX/writeback
|
||||
|
||||
If an admin wants to write a specific page in zram device to the backing device,
|
||||
they could write a page index into the interface.
|
||||
|
||||
|
||||
@@ -1208,6 +1208,34 @@ PAGE_SIZE multiple when read back.
|
||||
high limit is used and monitored properly, this limit's
|
||||
utility is limited to providing the final safety net.
|
||||
|
||||
memory.reclaim
|
||||
A write-only nested-keyed file which exists for all cgroups.
|
||||
|
||||
This is a simple interface to trigger memory reclaim in the
|
||||
target cgroup.
|
||||
|
||||
This file accepts a single key, the number of bytes to reclaim.
|
||||
No nested keys are currently supported.
|
||||
|
||||
Example::
|
||||
|
||||
echo "1G" > memory.reclaim
|
||||
|
||||
The interface can be later extended with nested keys to
|
||||
configure the reclaim behavior. For example, specify the
|
||||
type of memory to reclaim from (anon, file, ..).
|
||||
|
||||
Please note that the kernel can over or under reclaim from
|
||||
the target cgroup. If less bytes are reclaimed than the
|
||||
specified amount, -EAGAIN is returned.
|
||||
|
||||
memory.peak
|
||||
A read-only single value file which exists on non-root
|
||||
cgroups.
|
||||
|
||||
The max memory usage recorded for the cgroup and its
|
||||
descendants since the creation of the cgroup.
|
||||
|
||||
memory.oom.group
|
||||
A read-write single value file which exists on non-root
|
||||
cgroups. The default value is "0".
|
||||
@@ -1326,6 +1354,12 @@ PAGE_SIZE multiple when read back.
|
||||
Amount of cached filesystem data that is swap-backed,
|
||||
such as tmpfs, shm segments, shared anonymous mmap()s
|
||||
|
||||
zswap
|
||||
Amount of memory consumed by the zswap compression backend.
|
||||
|
||||
zswapped
|
||||
Amount of application memory swapped out to zswap.
|
||||
|
||||
file_mapped
|
||||
Amount of cached filesystem data mapped with mmap()
|
||||
|
||||
@@ -1516,6 +1550,21 @@ PAGE_SIZE multiple when read back.
|
||||
higher than the limit for an extended period of time. This
|
||||
reduces the impact on the workload and memory management.
|
||||
|
||||
memory.zswap.current
|
||||
A read-only single value file which exists on non-root
|
||||
cgroups.
|
||||
|
||||
The total amount of memory consumed by the zswap compression
|
||||
backend.
|
||||
|
||||
memory.zswap.max
|
||||
A read-write single value file which exists on non-root
|
||||
cgroups. The default is "max".
|
||||
|
||||
Zswap usage hard limit. If a cgroup's zswap pool reaches this
|
||||
limit, it will refuse to take any more stores before existing
|
||||
entries fault back in or are written out to disk.
|
||||
|
||||
memory.pressure
|
||||
A read-only nested-keyed file.
|
||||
|
||||
|
||||
@@ -1705,16 +1705,16 @@
|
||||
boot-time allocation of gigantic hugepages is skipped.
|
||||
|
||||
hugetlb_free_vmemmap=
|
||||
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
|
||||
[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
enabled.
|
||||
Allows heavy hugetlb users to free up some more
|
||||
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
|
||||
Format: { on | off (default) }
|
||||
Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
|
||||
|
||||
on: enable the feature
|
||||
off: disable the feature
|
||||
[oO][Nn]/Y/y/1: enable the feature
|
||||
[oO][Ff]/N/n/0: disable the feature
|
||||
|
||||
Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
|
||||
Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
|
||||
the default is on.
|
||||
|
||||
This is not compatible with memory_hotplug.memmap_on_memory.
|
||||
|
||||
@@ -66,6 +66,17 @@ Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
|
||||
no real monitoring and reclamation due to the watermarks-based activation
|
||||
condition. Refer to below descriptions for the watermarks parameter for this.
|
||||
|
||||
commit_inputs
|
||||
-------------
|
||||
|
||||
Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
|
||||
|
||||
Input parameters that updated while DAMON_RECLAIM is running are not applied
|
||||
by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
|
||||
of parametrs except ``enabled`` again. Once the re-reading is done, this
|
||||
parameter is set as ``N``. If invalid parameters are found while the
|
||||
re-reading, DAMON_RECLAIM will be disabled.
|
||||
|
||||
min_age
|
||||
-------
|
||||
|
||||
|
||||
@@ -68,7 +68,7 @@ comma (","). ::
|
||||
│ kdamonds/nr_kdamonds
|
||||
│ │ 0/state,pid
|
||||
│ │ │ contexts/nr_contexts
|
||||
│ │ │ │ 0/operations
|
||||
│ │ │ │ 0/avail_operations,operations
|
||||
│ │ │ │ │ monitoring_attrs/
|
||||
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
|
||||
│ │ │ │ │ │ nr_regions/min,max
|
||||
@@ -121,10 +121,11 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
|
||||
|
||||
Reading ``state`` returns ``on`` if the kdamond is currently running, or
|
||||
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
|
||||
in the state. Writing ``update_schemes_stats`` to ``state`` file updates the
|
||||
contents of stats files for each DAMON-based operation scheme of the kdamond.
|
||||
For details of the stats, please refer to :ref:`stats section
|
||||
<sysfs_schemes_stats>`.
|
||||
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
|
||||
user inputs in the sysfs files except ``state`` file again. Writing
|
||||
``update_schemes_stats`` to ``state`` file updates the contents of stats files
|
||||
for each DAMON-based operation scheme of the kdamond. For details of the
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||
|
||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||
|
||||
@@ -143,17 +144,28 @@ be written to the file.
|
||||
contexts/<N>/
|
||||
-------------
|
||||
|
||||
In each context directory, one file (``operations``) and three directories
|
||||
(``monitoring_attrs``, ``targets``, and ``schemes``) exist.
|
||||
In each context directory, two files (``avail_operations`` and ``operations``)
|
||||
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
|
||||
exist.
|
||||
|
||||
DAMON supports multiple types of monitoring operations, including those for
|
||||
virtual address space and the physical address space. You can set and get what
|
||||
type of monitoring operations DAMON will use for the context by writing one of
|
||||
below keywords to, and reading from the file.
|
||||
virtual address space and the physical address space. You can get the list of
|
||||
available monitoring operations set on the currently running kernel by reading
|
||||
``avail_operations`` file. Based on the kernel configuration, the file will
|
||||
list some or all of below keywords.
|
||||
|
||||
- vaddr: Monitor virtual address spaces of specific processes
|
||||
- fvaddr: Monitor fixed virtual address ranges
|
||||
- paddr: Monitor the physical address space of the system
|
||||
|
||||
Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
|
||||
differences between the operations sets in terms of the monitoring target
|
||||
regions.
|
||||
|
||||
You can set and get what type of monitoring operations DAMON will use for the
|
||||
context by writing one of the keywords listed in ``avail_operations`` file and
|
||||
reading from the ``operations`` file.
|
||||
|
||||
contexts/<N>/monitoring_attrs/
|
||||
------------------------------
|
||||
|
||||
@@ -192,6 +204,8 @@ If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
|
||||
be a process. You can specify the process to DAMON by writing the pid of the
|
||||
process to the ``pid_target`` file.
|
||||
|
||||
.. _sysfs_regions:
|
||||
|
||||
targets/<N>/regions
|
||||
-------------------
|
||||
|
||||
@@ -202,9 +216,10 @@ can be covered. However, users could want to set the initial monitoring region
|
||||
to specific address ranges.
|
||||
|
||||
In contrast, DAMON do not automatically sets and updates the monitoring target
|
||||
regions when ``paddr`` monitoring operations set is being used (``paddr`` is
|
||||
written to the ``contexts/<N>/operations``). Therefore, users should set the
|
||||
monitoring target regions by themselves in the case.
|
||||
regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
|
||||
(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
|
||||
Therefore, users should set the monitoring target regions by themselves in the
|
||||
cases.
|
||||
|
||||
For such cases, users can explicitly set the initial monitoring target regions
|
||||
as they want, by writing proper values to the files under this directory.
|
||||
|
||||
@@ -164,7 +164,7 @@ default_hugepagesz
|
||||
will all result in 256 2M huge pages being allocated. Valid default
|
||||
huge page size is architecture dependent.
|
||||
hugetlb_free_vmemmap
|
||||
When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
|
||||
When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
|
||||
unused vmemmap pages associated with each HugeTLB page.
|
||||
|
||||
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||
|
||||
@@ -184,6 +184,24 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||
be increased accordingly.
|
||||
|
||||
Monitoring KSM events
|
||||
=====================
|
||||
|
||||
There are some counters in /proc/vmstat that may be used to monitor KSM events.
|
||||
KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
|
||||
or on swapping in copy. Those events could help users evaluate whether or how
|
||||
to use KSM. For example, if cow_ksm increases too fast, user may decrease the
|
||||
range of madvise(, , MADV_MERGEABLE).
|
||||
|
||||
cow_ksm
|
||||
is incremented every time a KSM page triggers copy on write (COW)
|
||||
when users try to write to a KSM page, we have to make a copy.
|
||||
|
||||
ksm_swpin_copy
|
||||
is incremented every time a KSM page is copied when swapping in
|
||||
note that KSM page might be copied when swapping in because do_swap_page()
|
||||
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
|
||||
|
||||
--
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
||||
|
||||
@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
|
||||
- overcommit_memory
|
||||
- overcommit_ratio
|
||||
- page-cluster
|
||||
- page_lock_unfairness
|
||||
- panic_on_oom
|
||||
- percpu_pagelist_high_fraction
|
||||
- stat_interval
|
||||
@@ -561,6 +562,45 @@ Change the minimum size of the hugepage pool.
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||
|
||||
|
||||
hugetlb_optimize_vmemmap
|
||||
========================
|
||||
|
||||
This knob is not available when memory_hotplug.memmap_on_memory (kernel parameter)
|
||||
is configured or the size of 'struct page' (a structure defined in
|
||||
include/linux/mm_types.h) is not power of two (an unusual system config could
|
||||
result in this).
|
||||
|
||||
Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
|
||||
associated with each HugeTLB page.
|
||||
|
||||
Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
|
||||
buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
|
||||
per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
|
||||
optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool
|
||||
to the buddy allocator, the vmemmap pages representing that range needs to be
|
||||
remapped again and the vmemmap pages discarded earlier need to be rellocated
|
||||
again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
|
||||
never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
|
||||
'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
|
||||
the fly') instead of being pulled from the HugeTLB pool, you should weigh the
|
||||
benefits of memory savings against the more overhead (~2x slower than before)
|
||||
of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
|
||||
allocator. Another behavior to note is that if the system is under heavy memory
|
||||
pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
|
||||
pool to the buddy allocator since the allocation of vmemmap pages could be
|
||||
failed, you have to retry later if your system encounter this situation.
|
||||
|
||||
Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
|
||||
buddy allocator will not be optimized meaning the extra overhead at allocation
|
||||
time from buddy allocator disappears, whereas already optimized HugeTLB pages
|
||||
will not be affected. If you want to make sure there are no optimized HugeTLB
|
||||
pages, you can set "nr_hugepages" to 0 first and then disable this. Note that
|
||||
writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
|
||||
pages. So, those surplus pages are still optimized until they are no longer
|
||||
in use. You would need to wait for those surplus pages to be released before
|
||||
there are no optimized pages in the system.
|
||||
|
||||
|
||||
nr_hugepages_mempolicy
|
||||
======================
|
||||
|
||||
@@ -754,6 +794,14 @@ extra faults and I/O delays for following faults if they would have been part of
|
||||
that consecutive pages readahead would have brought in.
|
||||
|
||||
|
||||
page_lock_unfairness
|
||||
====================
|
||||
|
||||
This value determines the number of times that the page lock can be
|
||||
stolen from under a waiter. After the lock is stolen the number of times
|
||||
specified in this file (default is 5), the "fair lock handoff" semantics
|
||||
will apply, and the waiter will only be awakened if the lock can be taken.
|
||||
|
||||
panic_on_oom
|
||||
============
|
||||
|
||||
|
||||
@@ -4,39 +4,76 @@ The Kernel Address Sanitizer (KASAN)
|
||||
Overview
|
||||
--------
|
||||
|
||||
KernelAddressSANitizer (KASAN) is a dynamic memory safety error detector
|
||||
designed to find out-of-bound and use-after-free bugs. KASAN has three modes:
|
||||
Kernel Address Sanitizer (KASAN) is a dynamic memory safety error detector
|
||||
designed to find out-of-bounds and use-after-free bugs.
|
||||
|
||||
1. generic KASAN (similar to userspace ASan),
|
||||
2. software tag-based KASAN (similar to userspace HWASan),
|
||||
3. hardware tag-based KASAN (based on hardware memory tagging).
|
||||
KASAN has three modes:
|
||||
|
||||
Generic KASAN is mainly used for debugging due to a large memory overhead.
|
||||
Software tag-based KASAN can be used for dogfood testing as it has a lower
|
||||
memory overhead that allows using it with real workloads. Hardware tag-based
|
||||
KASAN comes with low memory and performance overheads and, therefore, can be
|
||||
used in production. Either as an in-field memory bug detector or as a security
|
||||
mitigation.
|
||||
1. Generic KASAN
|
||||
2. Software Tag-Based KASAN
|
||||
3. Hardware Tag-Based KASAN
|
||||
|
||||
Software KASAN modes (#1 and #2) use compile-time instrumentation to insert
|
||||
validity checks before every memory access and, therefore, require a compiler
|
||||
version that supports that.
|
||||
Generic KASAN, enabled with CONFIG_KASAN_GENERIC, is the mode intended for
|
||||
debugging, similar to userspace ASan. This mode is supported on many CPU
|
||||
architectures, but it has significant performance and memory overheads.
|
||||
|
||||
Generic KASAN is supported in GCC and Clang. With GCC, it requires version
|
||||
8.3.0 or later. Any supported Clang version is compatible, but detection of
|
||||
out-of-bounds accesses for global variables is only supported since Clang 11.
|
||||
Software Tag-Based KASAN or SW_TAGS KASAN, enabled with CONFIG_KASAN_SW_TAGS,
|
||||
can be used for both debugging and dogfood testing, similar to userspace HWASan.
|
||||
This mode is only supported for arm64, but its moderate memory overhead allows
|
||||
using it for testing on memory-restricted devices with real workloads.
|
||||
|
||||
Software tag-based KASAN mode is only supported in Clang.
|
||||
Hardware Tag-Based KASAN or HW_TAGS KASAN, enabled with CONFIG_KASAN_HW_TAGS,
|
||||
is the mode intended to be used as an in-field memory bug detector or as a
|
||||
security mitigation. This mode only works on arm64 CPUs that support MTE
|
||||
(Memory Tagging Extension), but it has low memory and performance overheads and
|
||||
thus can be used in production.
|
||||
|
||||
The hardware KASAN mode (#3) relies on hardware to perform the checks but
|
||||
still requires a compiler version that supports memory tagging instructions.
|
||||
This mode is supported in GCC 10+ and Clang 12+.
|
||||
For details about the memory and performance impact of each KASAN mode, see the
|
||||
descriptions of the corresponding Kconfig options.
|
||||
|
||||
Both software KASAN modes work with SLUB and SLAB memory allocators,
|
||||
while the hardware tag-based KASAN currently only supports SLUB.
|
||||
The Generic and the Software Tag-Based modes are commonly referred to as the
|
||||
software modes. The Software Tag-Based and the Hardware Tag-Based modes are
|
||||
referred to as the tag-based modes.
|
||||
|
||||
Currently, generic KASAN is supported for the x86_64, arm, arm64, xtensa, s390,
|
||||
and riscv architectures, and tag-based KASAN modes are supported only for arm64.
|
||||
Support
|
||||
-------
|
||||
|
||||
Architectures
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Generic KASAN is supported on x86_64, arm, arm64, powerpc, riscv, s390, and
|
||||
xtensa, and the tag-based KASAN modes are supported only on arm64.
|
||||
|
||||
Compilers
|
||||
~~~~~~~~~
|
||||
|
||||
Software KASAN modes use compile-time instrumentation to insert validity checks
|
||||
before every memory access and thus require a compiler version that provides
|
||||
support for that. The Hardware Tag-Based mode relies on hardware to perform
|
||||
these checks but still requires a compiler version that supports the memory
|
||||
tagging instructions.
|
||||
|
||||
Generic KASAN requires GCC version 8.3.0 or later
|
||||
or any Clang version supported by the kernel.
|
||||
|
||||
Software Tag-Based KASAN requires GCC 11+
|
||||
or any Clang version supported by the kernel.
|
||||
|
||||
Hardware Tag-Based KASAN requires GCC 10+ or Clang 12+.
|
||||
|
||||
Memory types
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Generic KASAN supports finding bugs in all of slab, page_alloc, vmap, vmalloc,
|
||||
stack, and global memory.
|
||||
|
||||
Software Tag-Based KASAN supports slab, page_alloc, vmalloc, and stack memory.
|
||||
|
||||
Hardware Tag-Based KASAN supports slab, page_alloc, and non-executable vmalloc
|
||||
memory.
|
||||
|
||||
For slab, both software KASAN modes support SLUB and SLAB allocators, while
|
||||
Hardware Tag-Based KASAN only supports SLUB.
|
||||
|
||||
Usage
|
||||
-----
|
||||
@@ -45,18 +82,59 @@ To enable KASAN, configure the kernel with::
|
||||
|
||||
CONFIG_KASAN=y
|
||||
|
||||
and choose between ``CONFIG_KASAN_GENERIC`` (to enable generic KASAN),
|
||||
``CONFIG_KASAN_SW_TAGS`` (to enable software tag-based KASAN), and
|
||||
``CONFIG_KASAN_HW_TAGS`` (to enable hardware tag-based KASAN).
|
||||
and choose between ``CONFIG_KASAN_GENERIC`` (to enable Generic KASAN),
|
||||
``CONFIG_KASAN_SW_TAGS`` (to enable Software Tag-Based KASAN), and
|
||||
``CONFIG_KASAN_HW_TAGS`` (to enable Hardware Tag-Based KASAN).
|
||||
|
||||
For software modes, also choose between ``CONFIG_KASAN_OUTLINE`` and
|
||||
For the software modes, also choose between ``CONFIG_KASAN_OUTLINE`` and
|
||||
``CONFIG_KASAN_INLINE``. Outline and inline are compiler instrumentation types.
|
||||
The former produces a smaller binary while the latter is 1.1-2 times faster.
|
||||
The former produces a smaller binary while the latter is up to 2 times faster.
|
||||
|
||||
To include alloc and free stack traces of affected slab objects into reports,
|
||||
enable ``CONFIG_STACKTRACE``. To include alloc and free stack traces of affected
|
||||
physical pages, enable ``CONFIG_PAGE_OWNER`` and boot with ``page_owner=on``.
|
||||
|
||||
Boot parameters
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
KASAN is affected by the generic ``panic_on_warn`` command line parameter.
|
||||
When it is enabled, KASAN panics the kernel after printing a bug report.
|
||||
|
||||
By default, KASAN prints a bug report only for the first invalid memory access.
|
||||
With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This
|
||||
effectively disables ``panic_on_warn`` for KASAN reports.
|
||||
|
||||
Alternatively, independent of ``panic_on_warn``, the ``kasan.fault=`` boot
|
||||
parameter can be used to control panic and reporting behaviour:
|
||||
|
||||
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
|
||||
report or also panic the kernel (default: ``report``). The panic happens even
|
||||
if ``kasan_multi_shot`` is enabled.
|
||||
|
||||
Hardware Tag-Based KASAN mode (see the section about various modes below) is
|
||||
intended for use in production as a security mitigation. Therefore, it supports
|
||||
additional boot parameters that allow disabling KASAN or controlling features:
|
||||
|
||||
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
|
||||
|
||||
- ``kasan.mode=sync``, ``=async`` or ``=asymm`` controls whether KASAN
|
||||
is configured in synchronous, asynchronous or asymmetric mode of
|
||||
execution (default: ``sync``).
|
||||
Synchronous mode: a bad access is detected immediately when a tag
|
||||
check fault occurs.
|
||||
Asynchronous mode: a bad access detection is delayed. When a tag check
|
||||
fault occurs, the information is stored in hardware (in the TFSR_EL1
|
||||
register for arm64). The kernel periodically checks the hardware and
|
||||
only reports tag faults during these checks.
|
||||
Asymmetric mode: a bad access is detected synchronously on reads and
|
||||
asynchronously on writes.
|
||||
|
||||
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
|
||||
allocations (default: ``on``).
|
||||
|
||||
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
|
||||
traces collection (default: ``on``).
|
||||
|
||||
Error reports
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
@@ -146,7 +224,7 @@ is either 8 or 16 aligned bytes depending on KASAN mode. Each number in the
|
||||
memory state section of the report shows the state of one of the memory
|
||||
granules that surround the accessed address.
|
||||
|
||||
For generic KASAN, the size of each memory granule is 8. The state of each
|
||||
For Generic KASAN, the size of each memory granule is 8. The state of each
|
||||
granule is encoded in one shadow byte. Those 8 bytes can be accessible,
|
||||
partially accessible, freed, or be a part of a redzone. KASAN uses the following
|
||||
encoding for each shadow byte: 00 means that all 8 bytes of the corresponding
|
||||
@@ -171,47 +249,6 @@ traces point to places in code that interacted with the object but that are not
|
||||
directly present in the bad access stack trace. Currently, this includes
|
||||
call_rcu() and workqueue queuing.
|
||||
|
||||
Boot parameters
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
KASAN is affected by the generic ``panic_on_warn`` command line parameter.
|
||||
When it is enabled, KASAN panics the kernel after printing a bug report.
|
||||
|
||||
By default, KASAN prints a bug report only for the first invalid memory access.
|
||||
With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This
|
||||
effectively disables ``panic_on_warn`` for KASAN reports.
|
||||
|
||||
Alternatively, independent of ``panic_on_warn`` the ``kasan.fault=`` boot
|
||||
parameter can be used to control panic and reporting behaviour:
|
||||
|
||||
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
|
||||
report or also panic the kernel (default: ``report``). The panic happens even
|
||||
if ``kasan_multi_shot`` is enabled.
|
||||
|
||||
Hardware tag-based KASAN mode (see the section about various modes below) is
|
||||
intended for use in production as a security mitigation. Therefore, it supports
|
||||
additional boot parameters that allow disabling KASAN or controlling features:
|
||||
|
||||
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
|
||||
|
||||
- ``kasan.mode=sync``, ``=async`` or ``=asymm`` controls whether KASAN
|
||||
is configured in synchronous, asynchronous or asymmetric mode of
|
||||
execution (default: ``sync``).
|
||||
Synchronous mode: a bad access is detected immediately when a tag
|
||||
check fault occurs.
|
||||
Asynchronous mode: a bad access detection is delayed. When a tag check
|
||||
fault occurs, the information is stored in hardware (in the TFSR_EL1
|
||||
register for arm64). The kernel periodically checks the hardware and
|
||||
only reports tag faults during these checks.
|
||||
Asymmetric mode: a bad access is detected synchronously on reads and
|
||||
asynchronously on writes.
|
||||
|
||||
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
|
||||
allocations (default: ``on``).
|
||||
|
||||
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
|
||||
traces collection (default: ``on``).
|
||||
|
||||
Implementation details
|
||||
----------------------
|
||||
|
||||
@@ -250,49 +287,46 @@ outline-instrumented kernel.
|
||||
Generic KASAN is the only mode that delays the reuse of freed objects via
|
||||
quarantine (see mm/kasan/quarantine.c for implementation).
|
||||
|
||||
Software tag-based KASAN
|
||||
Software Tag-Based KASAN
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Software tag-based KASAN uses a software memory tagging approach to checking
|
||||
Software Tag-Based KASAN uses a software memory tagging approach to checking
|
||||
access validity. It is currently only implemented for the arm64 architecture.
|
||||
|
||||
Software tag-based KASAN uses the Top Byte Ignore (TBI) feature of arm64 CPUs
|
||||
Software Tag-Based KASAN uses the Top Byte Ignore (TBI) feature of arm64 CPUs
|
||||
to store a pointer tag in the top byte of kernel pointers. It uses shadow memory
|
||||
to store memory tags associated with each 16-byte memory cell (therefore, it
|
||||
dedicates 1/16th of the kernel memory for shadow memory).
|
||||
|
||||
On each memory allocation, software tag-based KASAN generates a random tag, tags
|
||||
On each memory allocation, Software Tag-Based KASAN generates a random tag, tags
|
||||
the allocated memory with this tag, and embeds the same tag into the returned
|
||||
pointer.
|
||||
|
||||
Software tag-based KASAN uses compile-time instrumentation to insert checks
|
||||
Software Tag-Based KASAN uses compile-time instrumentation to insert checks
|
||||
before each memory access. These checks make sure that the tag of the memory
|
||||
that is being accessed is equal to the tag of the pointer that is used to access
|
||||
this memory. In case of a tag mismatch, software tag-based KASAN prints a bug
|
||||
this memory. In case of a tag mismatch, Software Tag-Based KASAN prints a bug
|
||||
report.
|
||||
|
||||
Software tag-based KASAN also has two instrumentation modes (outline, which
|
||||
Software Tag-Based KASAN also has two instrumentation modes (outline, which
|
||||
emits callbacks to check memory accesses; and inline, which performs the shadow
|
||||
memory checks inline). With outline instrumentation mode, a bug report is
|
||||
printed from the function that performs the access check. With inline
|
||||
instrumentation, a ``brk`` instruction is emitted by the compiler, and a
|
||||
dedicated ``brk`` handler is used to print bug reports.
|
||||
|
||||
Software tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through
|
||||
Software Tag-Based KASAN uses 0xFF as a match-all pointer tag (accesses through
|
||||
pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently
|
||||
reserved to tag freed memory regions.
|
||||
|
||||
Software tag-based KASAN currently only supports tagging of slab, page_alloc,
|
||||
and vmalloc memory.
|
||||
|
||||
Hardware tag-based KASAN
|
||||
Hardware Tag-Based KASAN
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Hardware tag-based KASAN is similar to the software mode in concept but uses
|
||||
Hardware Tag-Based KASAN is similar to the software mode in concept but uses
|
||||
hardware memory tagging support instead of compiler instrumentation and
|
||||
shadow memory.
|
||||
|
||||
Hardware tag-based KASAN is currently only implemented for arm64 architecture
|
||||
Hardware Tag-Based KASAN is currently only implemented for arm64 architecture
|
||||
and based on both arm64 Memory Tagging Extension (MTE) introduced in ARMv8.5
|
||||
Instruction Set Architecture and Top Byte Ignore (TBI).
|
||||
|
||||
@@ -302,21 +336,18 @@ access, hardware makes sure that the tag of the memory that is being accessed is
|
||||
equal to the tag of the pointer that is used to access this memory. In case of a
|
||||
tag mismatch, a fault is generated, and a report is printed.
|
||||
|
||||
Hardware tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through
|
||||
Hardware Tag-Based KASAN uses 0xFF as a match-all pointer tag (accesses through
|
||||
pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently
|
||||
reserved to tag freed memory regions.
|
||||
|
||||
Hardware tag-based KASAN currently only supports tagging of slab, page_alloc,
|
||||
and VM_ALLOC-based vmalloc memory.
|
||||
|
||||
If the hardware does not support MTE (pre ARMv8.5), hardware tag-based KASAN
|
||||
If the hardware does not support MTE (pre ARMv8.5), Hardware Tag-Based KASAN
|
||||
will not be enabled. In this case, all KASAN boot parameters are ignored.
|
||||
|
||||
Note that enabling CONFIG_KASAN_HW_TAGS always results in in-kernel TBI being
|
||||
enabled. Even when ``kasan.mode=off`` is provided or when the hardware does not
|
||||
support MTE (but supports TBI).
|
||||
|
||||
Hardware tag-based KASAN only reports the first found bug. After that, MTE tag
|
||||
Hardware Tag-Based KASAN only reports the first found bug. After that, MTE tag
|
||||
checking gets disabled.
|
||||
|
||||
Shadow memory
|
||||
@@ -414,19 +445,18 @@ generic ``noinstr`` one.
|
||||
Note that disabling compiler instrumentation (either on a per-file or a
|
||||
per-function basis) makes KASAN ignore the accesses that happen directly in
|
||||
that code for software KASAN modes. It does not help when the accesses happen
|
||||
indirectly (through calls to instrumented functions) or with the hardware
|
||||
tag-based mode that does not use compiler instrumentation.
|
||||
indirectly (through calls to instrumented functions) or with Hardware
|
||||
Tag-Based KASAN, which does not use compiler instrumentation.
|
||||
|
||||
For software KASAN modes, to disable KASAN reports in a part of the kernel code
|
||||
for the current task, annotate this part of the code with a
|
||||
``kasan_disable_current()``/``kasan_enable_current()`` section. This also
|
||||
disables the reports for indirect accesses that happen through function calls.
|
||||
|
||||
For tag-based KASAN modes (include the hardware one), to disable access
|
||||
checking, use ``kasan_reset_tag()`` or ``page_kasan_tag_reset()``. Note that
|
||||
temporarily disabling access checking via ``page_kasan_tag_reset()`` requires
|
||||
saving and restoring the per-page KASAN tag via
|
||||
``page_kasan_tag``/``page_kasan_tag_set``.
|
||||
For tag-based KASAN modes, to disable access checking, use
|
||||
``kasan_reset_tag()`` or ``page_kasan_tag_reset()``. Note that temporarily
|
||||
disabling access checking via ``page_kasan_tag_reset()`` requires saving and
|
||||
restoring the per-page KASAN tag via ``page_kasan_tag``/``page_kasan_tag_set``.
|
||||
|
||||
Tests
|
||||
~~~~~
|
||||
|
||||
@@ -258,8 +258,9 @@ prototypes::
|
||||
int (*launder_folio)(struct folio *);
|
||||
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
|
||||
int (*error_remove_page)(struct address_space *, struct page *);
|
||||
int (*swap_activate)(struct file *);
|
||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||
int (*swap_deactivate)(struct file *);
|
||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||
|
||||
locking rules:
|
||||
All except dirty_folio and free_folio may block
|
||||
@@ -287,6 +288,7 @@ is_partially_uptodate: yes
|
||||
error_remove_page: yes
|
||||
swap_activate: no
|
||||
swap_deactivate: no
|
||||
swap_rw: yes, unlocks
|
||||
====================== ======================== ========= ===============
|
||||
|
||||
->write_begin(), ->write_end() and ->read_folio() may be called from
|
||||
@@ -386,15 +388,19 @@ cleaned, or an error value if not. Note that in order to prevent the folio
|
||||
getting mapped back in and redirtied, it needs to be kept locked
|
||||
across the entire operation.
|
||||
|
||||
->swap_activate will be called with a non-zero argument on
|
||||
files backing (non block device backed) swapfiles. A return value
|
||||
of zero indicates success, in which case this file can be used for
|
||||
backing swapspace. The swapspace operations will be proxied to the
|
||||
address space operations.
|
||||
->swap_activate() will be called to prepare the given file for swap. It
|
||||
should perform any validation and preparation necessary to ensure that
|
||||
writes can be performed with minimal memory allocation. It should call
|
||||
add_swap_extent(), or the helper iomap_swapfile_activate(), and return
|
||||
the number of extents added. If IO should be submitted through
|
||||
->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
|
||||
directly to the block device ``sis->bdev``.
|
||||
|
||||
->swap_deactivate() will be called in the sys_swapoff()
|
||||
path after ->swap_activate() returned success.
|
||||
|
||||
->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
|
||||
|
||||
file_lock_operations
|
||||
====================
|
||||
|
||||
|
||||
@@ -942,56 +942,73 @@ can be substantial. In many cases there are other means to find out
|
||||
additional memory using subsystem specific interfaces, for instance
|
||||
/proc/net/sockstat for TCP memory allocations.
|
||||
|
||||
The following is from a 16GB PIII, which has highmem enabled.
|
||||
You may not have all of these fields.
|
||||
Example output. You may not have all of these fields.
|
||||
|
||||
::
|
||||
|
||||
> cat /proc/meminfo
|
||||
|
||||
MemTotal: 16344972 kB
|
||||
MemFree: 13634064 kB
|
||||
MemAvailable: 14836172 kB
|
||||
Buffers: 3656 kB
|
||||
Cached: 1195708 kB
|
||||
SwapCached: 0 kB
|
||||
Active: 891636 kB
|
||||
Inactive: 1077224 kB
|
||||
HighTotal: 15597528 kB
|
||||
HighFree: 13629632 kB
|
||||
LowTotal: 747444 kB
|
||||
LowFree: 4432 kB
|
||||
SwapTotal: 0 kB
|
||||
SwapFree: 0 kB
|
||||
Dirty: 968 kB
|
||||
Writeback: 0 kB
|
||||
AnonPages: 861800 kB
|
||||
Mapped: 280372 kB
|
||||
Shmem: 644 kB
|
||||
KReclaimable: 168048 kB
|
||||
Slab: 284364 kB
|
||||
SReclaimable: 159856 kB
|
||||
SUnreclaim: 124508 kB
|
||||
PageTables: 24448 kB
|
||||
NFS_Unstable: 0 kB
|
||||
Bounce: 0 kB
|
||||
WritebackTmp: 0 kB
|
||||
CommitLimit: 7669796 kB
|
||||
Committed_AS: 100056 kB
|
||||
VmallocTotal: 112216 kB
|
||||
VmallocUsed: 428 kB
|
||||
VmallocChunk: 111088 kB
|
||||
Percpu: 62080 kB
|
||||
HardwareCorrupted: 0 kB
|
||||
AnonHugePages: 49152 kB
|
||||
ShmemHugePages: 0 kB
|
||||
ShmemPmdMapped: 0 kB
|
||||
MemTotal: 32858820 kB
|
||||
MemFree: 21001236 kB
|
||||
MemAvailable: 27214312 kB
|
||||
Buffers: 581092 kB
|
||||
Cached: 5587612 kB
|
||||
SwapCached: 0 kB
|
||||
Active: 3237152 kB
|
||||
Inactive: 7586256 kB
|
||||
Active(anon): 94064 kB
|
||||
Inactive(anon): 4570616 kB
|
||||
Active(file): 3143088 kB
|
||||
Inactive(file): 3015640 kB
|
||||
Unevictable: 0 kB
|
||||
Mlocked: 0 kB
|
||||
SwapTotal: 0 kB
|
||||
SwapFree: 0 kB
|
||||
Zswap: 1904 kB
|
||||
Zswapped: 7792 kB
|
||||
Dirty: 12 kB
|
||||
Writeback: 0 kB
|
||||
AnonPages: 4654780 kB
|
||||
Mapped: 266244 kB
|
||||
Shmem: 9976 kB
|
||||
KReclaimable: 517708 kB
|
||||
Slab: 660044 kB
|
||||
SReclaimable: 517708 kB
|
||||
SUnreclaim: 142336 kB
|
||||
KernelStack: 11168 kB
|
||||
PageTables: 20540 kB
|
||||
NFS_Unstable: 0 kB
|
||||
Bounce: 0 kB
|
||||
WritebackTmp: 0 kB
|
||||
CommitLimit: 16429408 kB
|
||||
Committed_AS: 7715148 kB
|
||||
VmallocTotal: 34359738367 kB
|
||||
VmallocUsed: 40444 kB
|
||||
VmallocChunk: 0 kB
|
||||
Percpu: 29312 kB
|
||||
HardwareCorrupted: 0 kB
|
||||
AnonHugePages: 4149248 kB
|
||||
ShmemHugePages: 0 kB
|
||||
ShmemPmdMapped: 0 kB
|
||||
FileHugePages: 0 kB
|
||||
FilePmdMapped: 0 kB
|
||||
CmaTotal: 0 kB
|
||||
CmaFree: 0 kB
|
||||
HugePages_Total: 0
|
||||
HugePages_Free: 0
|
||||
HugePages_Rsvd: 0
|
||||
HugePages_Surp: 0
|
||||
Hugepagesize: 2048 kB
|
||||
Hugetlb: 0 kB
|
||||
DirectMap4k: 401152 kB
|
||||
DirectMap2M: 10008576 kB
|
||||
DirectMap1G: 24117248 kB
|
||||
|
||||
MemTotal
|
||||
Total usable RAM (i.e. physical RAM minus a few reserved
|
||||
bits and the kernel binary code)
|
||||
MemFree
|
||||
The sum of LowFree+HighFree
|
||||
Total free RAM. On highmem systems, the sum of LowFree+HighFree
|
||||
MemAvailable
|
||||
An estimate of how much memory is available for starting new
|
||||
applications, without swapping. Calculated from MemFree,
|
||||
@@ -1005,8 +1022,9 @@ Buffers
|
||||
Relatively temporary storage for raw disk blocks
|
||||
shouldn't get tremendously large (20MB or so)
|
||||
Cached
|
||||
in-memory cache for files read from the disk (the
|
||||
pagecache). Doesn't include SwapCached
|
||||
In-memory cache for files read from the disk (the
|
||||
pagecache) as well as tmpfs & shmem.
|
||||
Doesn't include SwapCached.
|
||||
SwapCached
|
||||
Memory that once was swapped out, is swapped back in but
|
||||
still also is in the swapfile (if memory is needed it
|
||||
@@ -1018,6 +1036,11 @@ Active
|
||||
Inactive
|
||||
Memory which has been less recently used. It is more
|
||||
eligible to be reclaimed for other purposes
|
||||
Unevictable
|
||||
Memory allocated for userspace which cannot be reclaimed, such
|
||||
as mlocked pages, ramfs backing pages, secret memfd pages etc.
|
||||
Mlocked
|
||||
Memory locked with mlock().
|
||||
HighTotal, HighFree
|
||||
Highmem is all memory above ~860MB of physical memory.
|
||||
Highmem areas are for use by userspace programs, or
|
||||
@@ -1034,26 +1057,20 @@ SwapTotal
|
||||
SwapFree
|
||||
Memory which has been evicted from RAM, and is temporarily
|
||||
on the disk
|
||||
Zswap
|
||||
Memory consumed by the zswap backend (compressed size)
|
||||
Zswapped
|
||||
Amount of anonymous memory stored in zswap (original size)
|
||||
Dirty
|
||||
Memory which is waiting to get written back to the disk
|
||||
Writeback
|
||||
Memory which is actively being written back to the disk
|
||||
AnonPages
|
||||
Non-file backed pages mapped into userspace page tables
|
||||
HardwareCorrupted
|
||||
The amount of RAM/memory in KB, the kernel identifies as
|
||||
corrupted.
|
||||
AnonHugePages
|
||||
Non-file backed huge pages mapped into userspace page tables
|
||||
Mapped
|
||||
files which have been mmaped, such as libraries
|
||||
Shmem
|
||||
Total memory used by shared memory (shmem) and tmpfs
|
||||
ShmemHugePages
|
||||
Memory used by shared memory (shmem) and tmpfs allocated
|
||||
with huge pages
|
||||
ShmemPmdMapped
|
||||
Shared memory mapped into userspace with huge pages
|
||||
KReclaimable
|
||||
Kernel allocations that the kernel will attempt to reclaim
|
||||
under memory pressure. Includes SReclaimable (below), and other
|
||||
@@ -1064,9 +1081,10 @@ SReclaimable
|
||||
Part of Slab, that might be reclaimed, such as caches
|
||||
SUnreclaim
|
||||
Part of Slab, that cannot be reclaimed on memory pressure
|
||||
KernelStack
|
||||
Memory consumed by the kernel stacks of all tasks
|
||||
PageTables
|
||||
amount of memory dedicated to the lowest level of page
|
||||
tables.
|
||||
Memory consumed by userspace page tables
|
||||
NFS_Unstable
|
||||
Always zero. Previous counted pages which had been written to
|
||||
the server, but has not been committed to stable storage.
|
||||
@@ -1098,7 +1116,7 @@ Committed_AS
|
||||
has been allocated by processes, even if it has not been
|
||||
"used" by them as of yet. A process which malloc()'s 1G
|
||||
of memory, but only touches 300M of it will show up as
|
||||
using 1G. This 1G is memory which has been "committed" to
|
||||
using 1G. This 1G is memory which has been "committed" to
|
||||
by the VM and can be used at any time by the allocating
|
||||
application. With strict overcommit enabled on the system
|
||||
(mode 2 in 'vm.overcommit_memory'), allocations which would
|
||||
@@ -1107,7 +1125,7 @@ Committed_AS
|
||||
not fail due to lack of memory once that memory has been
|
||||
successfully allocated.
|
||||
VmallocTotal
|
||||
total size of vmalloc memory area
|
||||
total size of vmalloc virtual address space
|
||||
VmallocUsed
|
||||
amount of vmalloc area which is used
|
||||
VmallocChunk
|
||||
@@ -1115,6 +1133,30 @@ VmallocChunk
|
||||
Percpu
|
||||
Memory allocated to the percpu allocator used to back percpu
|
||||
allocations. This stat excludes the cost of metadata.
|
||||
HardwareCorrupted
|
||||
The amount of RAM/memory in KB, the kernel identifies as
|
||||
corrupted.
|
||||
AnonHugePages
|
||||
Non-file backed huge pages mapped into userspace page tables
|
||||
ShmemHugePages
|
||||
Memory used by shared memory (shmem) and tmpfs allocated
|
||||
with huge pages
|
||||
ShmemPmdMapped
|
||||
Shared memory mapped into userspace with huge pages
|
||||
FileHugePages
|
||||
Memory used for filesystem data (page cache) allocated
|
||||
with huge pages
|
||||
FilePmdMapped
|
||||
Page cache mapped into userspace with huge pages
|
||||
CmaTotal
|
||||
Memory reserved for the Contiguous Memory Allocator (CMA)
|
||||
CmaFree
|
||||
Free remaining memory in the CMA reserves
|
||||
HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst.
|
||||
DirectMap4k, DirectMap2M, DirectMap1G
|
||||
Breakdown of page table sizes used in the kernel's
|
||||
identity mapping of RAM
|
||||
|
||||
vmallocinfo
|
||||
~~~~~~~~~~~
|
||||
|
||||
@@ -749,8 +749,9 @@ cache in your filesystem. The following members are defined:
|
||||
size_t count);
|
||||
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
|
||||
int (*error_remove_page) (struct mapping *mapping, struct page *page);
|
||||
int (*swap_activate)(struct file *);
|
||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||
int (*swap_deactivate)(struct file *);
|
||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||
};
|
||||
|
||||
``writepage``
|
||||
@@ -948,15 +949,21 @@ cache in your filesystem. The following members are defined:
|
||||
unless you have them locked or reference counts increased.
|
||||
|
||||
``swap_activate``
|
||||
Called when swapon is used on a file to allocate space if
|
||||
necessary and pin the block lookup information in memory. A
|
||||
return value of zero indicates success, in which case this file
|
||||
can be used to back swapspace.
|
||||
|
||||
Called to prepare the given file for swap. It should perform
|
||||
any validation and preparation necessary to ensure that writes
|
||||
can be performed with minimal memory allocation. It should call
|
||||
add_swap_extent(), or the helper iomap_swapfile_activate(), and
|
||||
return the number of extents added. If IO should be submitted
|
||||
through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
|
||||
be submitted directly to the block device ``sis->bdev``.
|
||||
|
||||
``swap_deactivate``
|
||||
Called during swapoff on files where swap_activate was
|
||||
successful.
|
||||
|
||||
``swap_rw``
|
||||
Called to read or write swap pages when SWP_FS_OPS is set.
|
||||
|
||||
The File Object
|
||||
===============
|
||||
|
||||
@@ -50,61 +50,74 @@ space when they use mm context tags.
|
||||
Temporary Virtual Mappings
|
||||
==========================
|
||||
|
||||
The kernel contains several ways of creating temporary mappings:
|
||||
The kernel contains several ways of creating temporary mappings. The following
|
||||
list shows them in order of preference of use.
|
||||
|
||||
* vmap(). This can be used to make a long duration mapping of multiple
|
||||
physical pages into a contiguous virtual space. It needs global
|
||||
synchronization to unmap.
|
||||
* kmap_local_page(). This function is used to require short term mappings.
|
||||
It can be invoked from any context (including interrupts) but the mappings
|
||||
can only be used in the context which acquired them.
|
||||
|
||||
* kmap(). This permits a short duration mapping of a single page. It needs
|
||||
global synchronization, but is amortized somewhat. It is also prone to
|
||||
deadlocks when using in a nested fashion, and so it is not recommended for
|
||||
new code.
|
||||
This function should be preferred, where feasible, over all the others.
|
||||
|
||||
These mappings are thread-local and CPU-local, meaning that the mapping
|
||||
can only be accessed from within this thread and the thread is bound the
|
||||
CPU while the mapping is active. Even if the thread is preempted (since
|
||||
preemption is never disabled by the function) the CPU can not be
|
||||
unplugged from the system via CPU-hotplug until the mapping is disposed.
|
||||
|
||||
It's valid to take pagefaults in a local kmap region, unless the context
|
||||
in which the local mapping is acquired does not allow it for other reasons.
|
||||
|
||||
kmap_local_page() always returns a valid virtual address and it is assumed
|
||||
that kunmap_local() will never fail.
|
||||
|
||||
Nesting kmap_local_page() and kmap_atomic() mappings is allowed to a certain
|
||||
extent (up to KMAP_TYPE_NR) but their invocations have to be strictly ordered
|
||||
because the map implementation is stack based. See kmap_local_page() kdocs
|
||||
(included in the "Functions" section) for details on how to manage nested
|
||||
mappings.
|
||||
|
||||
* kmap_atomic(). This permits a very short duration mapping of a single
|
||||
page. Since the mapping is restricted to the CPU that issued it, it
|
||||
performs well, but the issuing task is therefore required to stay on that
|
||||
CPU until it has finished, lest some other task displace its mappings.
|
||||
|
||||
kmap_atomic() may also be used by interrupt contexts, since it is does not
|
||||
sleep and the caller may not sleep until after kunmap_atomic() is called.
|
||||
kmap_atomic() may also be used by interrupt contexts, since it does not
|
||||
sleep and the callers too may not sleep until after kunmap_atomic() is
|
||||
called.
|
||||
|
||||
It may be assumed that k[un]map_atomic() won't fail.
|
||||
Each call of kmap_atomic() in the kernel creates a non-preemptible section
|
||||
and disable pagefaults. This could be a source of unwanted latency. Therefore
|
||||
users should prefer kmap_local_page() instead of kmap_atomic().
|
||||
|
||||
It is assumed that k[un]map_atomic() won't fail.
|
||||
|
||||
Using kmap_atomic
|
||||
=================
|
||||
* kmap(). This should be used to make short duration mapping of a single
|
||||
page with no restrictions on preemption or migration. It comes with an
|
||||
overhead as mapping space is restricted and protected by a global lock
|
||||
for synchronization. When mapping is no longer needed, the address that
|
||||
the page was mapped to must be released with kunmap().
|
||||
|
||||
When and where to use kmap_atomic() is straightforward. It is used when code
|
||||
wants to access the contents of a page that might be allocated from high memory
|
||||
(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
|
||||
functions, and they can be used in a manner similar to the following::
|
||||
Mapping changes must be propagated across all the CPUs. kmap() also
|
||||
requires global TLB invalidation when the kmap's pool wraps and it might
|
||||
block when the mapping space is fully utilized until a slot becomes
|
||||
available. Therefore, kmap() is only callable from preemptible context.
|
||||
|
||||
/* Find the page of interest. */
|
||||
struct page *page = find_get_page(mapping, offset);
|
||||
All the above work is necessary if a mapping must last for a relatively
|
||||
long time but the bulk of high-memory mappings in the kernel are
|
||||
short-lived and only used in one place. This means that the cost of
|
||||
kmap() is mostly wasted in such cases. kmap() was not intended for long
|
||||
term mappings but it has morphed in that direction and its use is
|
||||
strongly discouraged in newer code and the set of the preceding functions
|
||||
should be preferred.
|
||||
|
||||
/* Gain access to the contents of that page. */
|
||||
void *vaddr = kmap_atomic(page);
|
||||
On 64-bit systems, calls to kmap_local_page(), kmap_atomic() and kmap() have
|
||||
no real work to do because a 64-bit address space is more than sufficient to
|
||||
address all the physical memory whose pages are permanently mapped.
|
||||
|
||||
/* Do something to the contents of that page. */
|
||||
memset(vaddr, 0, PAGE_SIZE);
|
||||
|
||||
/* Unmap that page. */
|
||||
kunmap_atomic(vaddr);
|
||||
|
||||
Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
|
||||
not the argument.
|
||||
|
||||
If you need to map two pages because you want to copy from one page to
|
||||
another you need to keep the kmap_atomic calls strictly nested, like::
|
||||
|
||||
vaddr1 = kmap_atomic(page1);
|
||||
vaddr2 = kmap_atomic(page2);
|
||||
|
||||
memcpy(vaddr1, vaddr2, PAGE_SIZE);
|
||||
|
||||
kunmap_atomic(vaddr2);
|
||||
kunmap_atomic(vaddr1);
|
||||
* vmap(). This can be used to make a long duration mapping of multiple
|
||||
physical pages into a contiguous virtual space. It needs global
|
||||
synchronization to unmap.
|
||||
|
||||
|
||||
Cost of Temporary Mappings
|
||||
@@ -145,3 +158,10 @@ The general recommendation is that you don't use more than 8GiB on a 32-bit
|
||||
machine - although more might work for you and your workload, you're pretty
|
||||
much on your own - don't expect kernel developers to really care much if things
|
||||
come apart.
|
||||
|
||||
|
||||
Functions
|
||||
=========
|
||||
|
||||
.. kernel-doc:: include/linux/highmem.h
|
||||
.. kernel-doc:: include/linux/highmem-internal.h
|
||||
|
||||
@@ -63,5 +63,6 @@ above structured documentation, or deleted if it has served its purpose.
|
||||
transhuge
|
||||
unevictable-lru
|
||||
vmalloced-kernel-stacks
|
||||
vmemmap_dedup
|
||||
z3fold
|
||||
zsmalloc
|
||||
|
||||
@@ -121,6 +121,14 @@ Usage
|
||||
-r Sort by memory release time.
|
||||
-s Sort by stack trace.
|
||||
-t Sort by times (default).
|
||||
--sort <order> Specify sorting order. Sorting syntax is [+|-]key[,[+|-]key[,...]].
|
||||
Choose a key from the **STANDARD FORMAT SPECIFIERS** section. The "+" is
|
||||
optional since default direction is increasing numerical or lexicographic
|
||||
order. Mixed use of abbreviated and complete-form of keys is allowed.
|
||||
|
||||
Examples:
|
||||
./page_owner_sort <input> <output> --sort=n,+pid,-tgid
|
||||
./page_owner_sort <input> <output> --sort=at
|
||||
|
||||
additional function::
|
||||
|
||||
@@ -129,7 +137,6 @@ Usage
|
||||
Specify culling rules.Culling syntax is key[,key[,...]].Choose a
|
||||
multi-letter key from the **STANDARD FORMAT SPECIFIERS** section.
|
||||
|
||||
|
||||
<rules> is a single argument in the form of a comma-separated list,
|
||||
which offers a way to specify individual culling rules. The recognized
|
||||
keywords are described in the **STANDARD FORMAT SPECIFIERS** section below.
|
||||
@@ -137,7 +144,6 @@ Usage
|
||||
the STANDARD SORT KEYS section below. Mixed use of abbreviated and
|
||||
complete-form of keys is allowed.
|
||||
|
||||
|
||||
Examples:
|
||||
./page_owner_sort <input> <output> --cull=stacktrace
|
||||
./page_owner_sort <input> <output> --cull=st,pid,name
|
||||
@@ -147,17 +153,44 @@ Usage
|
||||
-f Filter out the information of blocks whose memory has been released.
|
||||
|
||||
Select:
|
||||
--pid <PID> Select by pid.
|
||||
--tgid <TGID> Select by tgid.
|
||||
--name <command> Select by task command name.
|
||||
--pid <pidlist> Select by pid. This selects the blocks whose process ID
|
||||
numbers appear in <pidlist>.
|
||||
--tgid <tgidlist> Select by tgid. This selects the blocks whose thread
|
||||
group ID numbers appear in <tgidlist>.
|
||||
--name <cmdlist> Select by task command name. This selects the blocks whose
|
||||
task command name appear in <cmdlist>.
|
||||
|
||||
<pidlist>, <tgidlist>, <cmdlist> are single arguments in the form of a comma-separated list,
|
||||
which offers a way to specify individual selecting rules.
|
||||
|
||||
|
||||
Examples:
|
||||
./page_owner_sort <input> <output> --pid=1
|
||||
./page_owner_sort <input> <output> --tgid=1,2,3
|
||||
./page_owner_sort <input> <output> --name name1,name2
|
||||
|
||||
STANDARD FORMAT SPECIFIERS
|
||||
==========================
|
||||
::
|
||||
|
||||
For --sort option:
|
||||
|
||||
KEY LONG DESCRIPTION
|
||||
p pid process ID
|
||||
tg tgid thread group ID
|
||||
n name task command name
|
||||
st stacktrace stack trace of the page allocation
|
||||
T txt full text of block
|
||||
ft free_ts timestamp of the page when it was released
|
||||
at alloc_ts timestamp of the page when it was allocated
|
||||
ator allocator memory allocator for pages
|
||||
|
||||
For --curl option:
|
||||
|
||||
KEY LONG DESCRIPTION
|
||||
p pid process ID
|
||||
tg tgid thread group ID
|
||||
n name task command name
|
||||
f free whether the page has been released or not
|
||||
st stacktrace stace trace of the page allocation
|
||||
st stacktrace stack trace of the page allocation
|
||||
ator allocator memory allocator for pages
|
||||
|
||||
223
Documentation/vm/vmemmap_dedup.rst
Normal file
223
Documentation/vm/vmemmap_dedup.rst
Normal file
@@ -0,0 +1,223 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================================
|
||||
A vmemmap diet for HugeTLB and Device DAX
|
||||
=========================================
|
||||
|
||||
HugeTLB
|
||||
=======
|
||||
|
||||
The struct page structures (page structs) are used to describe a physical
|
||||
page frame. By default, there is a one-to-one mapping from a page frame to
|
||||
it's corresponding page struct.
|
||||
|
||||
HugeTLB pages consist of multiple base page size pages and is supported by many
|
||||
architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
|
||||
details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
|
||||
currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
|
||||
consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages.
|
||||
For each base page, there is a corresponding page struct.
|
||||
|
||||
Within the HugeTLB subsystem, only the first 4 page structs are used to
|
||||
contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
|
||||
this upper limit. The only 'useful' information in the remaining page structs
|
||||
is the compound_head field, and this field is the same for all tail pages.
|
||||
|
||||
By removing redundant page structs for HugeTLB pages, memory can be returned
|
||||
to the buddy allocator for other uses.
|
||||
|
||||
Different architectures support different HugeTLB pages. For example, the
|
||||
following table is the HugeTLB page size supported by x86 and arm64
|
||||
architectures. Because arm64 supports 4k, 16k, and 64k base pages and
|
||||
supports contiguous entries, so it supports many kinds of sizes of HugeTLB
|
||||
page.
|
||||
|
||||
+--------------+-----------+-----------------------------------------------+
|
||||
| Architecture | Page Size | HugeTLB Page Size |
|
||||
+--------------+-----------+-----------+-----------+-----------+-----------+
|
||||
| x86-64 | 4KB | 2MB | 1GB | | |
|
||||
+--------------+-----------+-----------+-----------+-----------+-----------+
|
||||
| | 4KB | 64KB | 2MB | 32MB | 1GB |
|
||||
| +-----------+-----------+-----------+-----------+-----------+
|
||||
| arm64 | 16KB | 2MB | 32MB | 1GB | |
|
||||
| +-----------+-----------+-----------+-----------+-----------+
|
||||
| | 64KB | 2MB | 512MB | 16GB | |
|
||||
+--------------+-----------+-----------+-----------+-----------+-----------+
|
||||
|
||||
When the system boot up, every HugeTLB page has more than one struct page
|
||||
structs which size is (unit: pages)::
|
||||
|
||||
struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
|
||||
|
||||
Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
|
||||
of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
|
||||
relationship::
|
||||
|
||||
HugeTLB_Size = n * PAGE_SIZE
|
||||
|
||||
Then::
|
||||
|
||||
struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
|
||||
= n * sizeof(struct page) / PAGE_SIZE
|
||||
|
||||
We can use huge mapping at the pud/pmd level for the HugeTLB page.
|
||||
|
||||
For the HugeTLB page of the pmd level mapping, then::
|
||||
|
||||
struct_size = n * sizeof(struct page) / PAGE_SIZE
|
||||
= PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
|
||||
= sizeof(struct page) / sizeof(pte_t)
|
||||
= 64 / 8
|
||||
= 8 (pages)
|
||||
|
||||
Where n is how many pte entries which one page can contains. So the value of
|
||||
n is (PAGE_SIZE / sizeof(pte_t)).
|
||||
|
||||
This optimization only supports 64-bit system, so the value of sizeof(pte_t)
|
||||
is 8. And this optimization also applicable only when the size of struct page
|
||||
is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
|
||||
x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
|
||||
size of struct page structs of it is 8 page frames which size depends on the
|
||||
size of the base page.
|
||||
|
||||
For the HugeTLB page of the pud level mapping, then::
|
||||
|
||||
struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
|
||||
= PAGE_SIZE / 8 * 8 (pages)
|
||||
= PAGE_SIZE (pages)
|
||||
|
||||
Where the struct_size(pmd) is the size of the struct page structs of a
|
||||
HugeTLB page of the pmd level mapping.
|
||||
|
||||
E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
|
||||
HugeTLB page consists in 4096.
|
||||
|
||||
Next, we take the pmd level mapping of the HugeTLB page as an example to
|
||||
show the internal implementation of this optimization. There are 8 pages
|
||||
struct page structs associated with a HugeTLB page which is pmd mapped.
|
||||
|
||||
Here is how things look before optimization::
|
||||
|
||||
HugeTLB struct pages(8 pages) page frame(8 pages)
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | -------------> | 1 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 2 | -------------> | 2 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 3 | -------------> | 3 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 4 | -------------> | 4 |
|
||||
| PMD | +-----------+ +-----------+
|
||||
| level | | 5 | -------------> | 5 |
|
||||
| mapping | +-----------+ +-----------+
|
||||
| | | 6 | -------------> | 6 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 7 | -------------> | 7 |
|
||||
| | +-----------+ +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
||||
|
||||
The value of page->compound_head is the same for all tail pages. The first
|
||||
page of page structs (page 0) associated with the HugeTLB page contains the 4
|
||||
page structs necessary to describe the HugeTLB. The only use of the remaining
|
||||
pages of page structs (page 1 to page 7) is to point to page->compound_head.
|
||||
Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
|
||||
will be used for each HugeTLB page. This will allow us to free the remaining
|
||||
7 pages to the buddy allocator.
|
||||
|
||||
Here is how things look after remapping::
|
||||
|
||||
HugeTLB struct pages(8 pages) page frame(8 pages)
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
|
||||
| | +-----------+ | | | | | |
|
||||
| | | 2 | -----------------+ | | | | |
|
||||
| | +-----------+ | | | | |
|
||||
| | | 3 | -------------------+ | | | |
|
||||
| | +-----------+ | | | |
|
||||
| | | 4 | ---------------------+ | | |
|
||||
| PMD | +-----------+ | | |
|
||||
| level | | 5 | -----------------------+ | |
|
||||
| mapping | +-----------+ | |
|
||||
| | | 6 | -------------------------+ |
|
||||
| | +-----------+ |
|
||||
| | | 7 | ---------------------------+
|
||||
| | +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
||||
|
||||
When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
|
||||
vmemmap pages and restore the previous mapping relationship.
|
||||
|
||||
For the HugeTLB page of the pud level mapping. It is similar to the former.
|
||||
We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
|
||||
|
||||
Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
|
||||
(e.g. aarch64) provides a contiguous bit in the translation table entries
|
||||
that hints to the MMU to indicate that it is one of a contiguous set of
|
||||
entries that can be cached in a single TLB entry.
|
||||
|
||||
The contiguous bit is used to increase the mapping size at the pmd and pte
|
||||
(last) level. So this type of HugeTLB page can be optimized only when its
|
||||
size of the struct page structs is greater than 1 page.
|
||||
|
||||
Notice: The head vmemmap page is not freed to the buddy allocator and all
|
||||
tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
|
||||
more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
|
||||
associated with each HugeTLB page. The compound_head() can handle this
|
||||
correctly (more details refer to the comment above compound_head()).
|
||||
|
||||
Device DAX
|
||||
==========
|
||||
|
||||
The device-dax interface uses the same tail deduplication technique explained
|
||||
in the previous chapter, except when used with the vmemmap in
|
||||
the device (altmap).
|
||||
|
||||
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
|
||||
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
|
||||
|
||||
The differences with HugeTLB are relatively minor.
|
||||
|
||||
It only use 3 page structs for storing all information as opposed
|
||||
to 4 on HugeTLB pages.
|
||||
|
||||
There's no remapping of vmemmap given that device-dax memory is not part of
|
||||
System RAM ranges initialized at boot. Thus the tail page deduplication
|
||||
happens at a later stage when we populate the sections. HugeTLB reuses the
|
||||
the head vmemmap page representing, whereas device-dax reuses the tail
|
||||
vmemmap page. This results in only half of the savings compared to HugeTLB.
|
||||
|
||||
Deduplicated tail pages are not mapped read-only.
|
||||
|
||||
Here's how things look like on device-dax after the sections are populated::
|
||||
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | -------------> | 1 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 2 | ----------------^ ^ ^ ^ ^ ^
|
||||
| | +-----------+ | | | | |
|
||||
| | | 3 | ------------------+ | | | |
|
||||
| | +-----------+ | | | |
|
||||
| | | 4 | --------------------+ | | |
|
||||
| PMD | +-----------+ | | |
|
||||
| level | | 5 | ----------------------+ | |
|
||||
| mapping | +-----------+ | |
|
||||
| | | 6 | ------------------------+ |
|
||||
| | +-----------+ |
|
||||
| | | 7 | --------------------------+
|
||||
| | +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
||||
@@ -5027,6 +5027,7 @@ F: Documentation/admin-guide/cgroup-v1/
|
||||
F: Documentation/admin-guide/cgroup-v2.rst
|
||||
F: include/linux/cgroup*
|
||||
F: kernel/cgroup/
|
||||
F: tools/testing/selftests/cgroup/
|
||||
|
||||
CONTROL GROUP - BLOCK IO CONTROLLER (BLKIO)
|
||||
M: Tejun Heo <tj@kernel.org>
|
||||
@@ -5060,6 +5061,8 @@ L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
F: mm/memcontrol.c
|
||||
F: mm/swap_cgroup.c
|
||||
F: tools/testing/selftests/cgroup/test_kmem.c
|
||||
F: tools/testing/selftests/cgroup/test_memcontrol.c
|
||||
|
||||
CORETEMP HARDWARE MONITORING DRIVER
|
||||
M: Fenghua Yu <fenghua.yu@intel.com>
|
||||
@@ -9064,16 +9067,20 @@ S: Orphan
|
||||
F: Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
|
||||
F: drivers/net/ethernet/huawei/hinic/
|
||||
|
||||
HUGETLB FILESYSTEM
|
||||
HUGETLB SUBSYSTEM
|
||||
M: Mike Kravetz <mike.kravetz@oracle.com>
|
||||
M: Muchun Song <songmuchun@bytedance.com>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
F: Documentation/ABI/testing/sysfs-kernel-mm-hugepages
|
||||
F: Documentation/admin-guide/mm/hugetlbpage.rst
|
||||
F: Documentation/vm/hugetlbfs_reserv.rst
|
||||
F: Documentation/vm/vmemmap_dedup.rst
|
||||
F: fs/hugetlbfs/
|
||||
F: include/linux/hugetlb.h
|
||||
F: mm/hugetlb.c
|
||||
F: mm/hugetlb_vmemmap.c
|
||||
F: mm/hugetlb_vmemmap.h
|
||||
|
||||
HVA ST MEDIA DRIVER
|
||||
M: Jean-Christophe Trotin <jean-christophe.trotin@foss.st.com>
|
||||
|
||||
@@ -18,7 +18,7 @@ extern void clear_page(void *page);
|
||||
#define clear_user_page(page, vaddr, pg) clear_page(page)
|
||||
|
||||
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
|
||||
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vmaddr)
|
||||
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr)
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
|
||||
extern void copy_page(void * _to, void * _from);
|
||||
|
||||
@@ -45,6 +45,7 @@ config ARM64
|
||||
select ARCH_HAS_SYSCALL_WRAPPER
|
||||
select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
|
||||
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
|
||||
select ARCH_HAS_VM_GET_PAGE_PROT
|
||||
select ARCH_HAS_ZONE_DMA_SET if EXPERT
|
||||
select ARCH_HAVE_ELF_PROT
|
||||
select ARCH_HAVE_NMI_SAFE_CMPXCHG
|
||||
@@ -91,11 +92,13 @@ config ARM64
|
||||
select ARCH_SUPPORTS_ATOMIC_RMW
|
||||
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
|
||||
select ARCH_SUPPORTS_NUMA_BALANCING
|
||||
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
|
||||
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
|
||||
select ARCH_WANT_DEFAULT_BPF_JIT
|
||||
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
|
||||
select ARCH_WANT_FRAME_POINTERS
|
||||
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
|
||||
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
select ARCH_WANT_LD_ORPHAN_WARN
|
||||
select ARCH_WANTS_NO_INSTR
|
||||
select ARCH_HAS_UBSAN_SANITIZE_ALL
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user