Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
This commit is contained in:
Linus Torvalds
2023-08-29 14:25:26 -07:00
471 changed files with 9586 additions and 7102 deletions

View File

@@ -29,8 +29,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
file updates contents of schemes stats files of the kdamond.
Writing 'update_schemes_tried_regions' to the file updates
contents of 'tried_regions' directory of every scheme directory
of this kdamond. Writing 'clear_schemes_tried_regions' to the
file removes contents of the 'tried_regions' directory.
of this kdamond. Writing 'update_schemes_tried_bytes' to the
file updates only '.../tried_regions/total_bytes' files of this
kdamond. Writing 'clear_schemes_tried_regions' to the file
removes contents of the 'tried_regions' directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022
@@ -269,8 +271,10 @@ What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the type of
the memory of the interest. 'anon' for anonymous pages, or
'memcg' for specific memory cgroup can be written and read.
the memory of the interest. 'anon' for anonymous pages,
'memcg' for specific memory cgroup, 'addr' for address range
(an open-ended interval), or 'target' for DAMON monitoring
target can be written and read.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
Date: Dec 2022
@@ -279,6 +283,27 @@ Description: If 'memcg' is written to the 'type' file, writing to and
reading from this file sets and gets the path to the memory
cgroup of the interest.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_start
Date: Jul 2023
Contact: SeongJae Park <sj@kernel.org>
Description: If 'addr' is written to the 'type' file, writing to or reading
from this file sets or gets the start address of the address
range for the filter.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_end
Date: Jul 2023
Contact: SeongJae Park <sj@kernel.org>
Description: If 'addr' is written to the 'type' file, writing to or reading
from this file sets or gets the end address of the address
range for the filter.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/target_idx
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: If 'target' is written to the 'type' file, writing to or
reading from this file sets or gets the index of the DAMON
monitoring target of the interest.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
@@ -317,6 +342,13 @@ Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the number of the exceed events of
the scheme's quotas.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/total_bytes
Date: Jul 2023
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the total amount of memory that
corresponding DAMON-based Operation Scheme's action has tried
to be applied.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
Date: Oct 2022
Contact: SeongJae Park <sj@kernel.org>

View File

@@ -10,7 +10,7 @@ Description:
dropping it if possible. The kernel will then be placed
on the bad page list and never be reused.
The offlining is done in kernel specific granuality.
The offlining is done in kernel specific granularity.
Normally it's the base page size of the kernel, but
this might change.
@@ -35,7 +35,7 @@ Description:
to access this page assuming it's poisoned by the
hardware.
The offlining is done in kernel specific granuality.
The offlining is done in kernel specific granularity.
Normally it's the base page size of the kernel, but
this might change.

View File

@@ -92,8 +92,6 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
memory.kmem.limit_in_bytes This knob is deprecated and writing to
it will return -ENOTSUPP.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits

View File

@@ -141,8 +141,8 @@ nodemask_t
The size of a nodemask_t type. Used to compute the number of online
nodes.
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
-------------------------------------------------------------------------------------------------
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
----------------------------------------------------------------------------------
User-space tools compute their values based on the offset of these
variables. The variables are used when excluding unnecessary pages.
@@ -325,8 +325,8 @@ NR_FREE_PAGES
On linux-2.6.21 or later, the number of free pages is in
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
------------------------------------------------------------------------------
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
-----------------------------------------------------------------------------------------
Page attributes. These flags are used to filter various unnecessary for
dumping pages.
@@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
dumping pages.
HUGETLB_PAGE_DTOR
-----------------
The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
excludes these pages.
x86_64
======

View File

@@ -87,7 +87,7 @@ comma (","). ::
│ │ │ │ │ │ │ filters/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ │ tried_regions/
│ │ │ │ │ │ │ tried_regions/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
@@ -127,14 +127,18 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
operation scheme action tried regions directory for each DAMON-based operation
scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
file clears the DAMON-based operating scheme action tried regions directory for
each DAMON-based operation scheme of the kdamond. For details of the
DAMON-based operation scheme action tried regions directory, please refer to
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
Writing ``update_schemes_tried_regions`` to ``state`` file updates the
DAMON-based operation scheme action tried regions directory for each
DAMON-based operation scheme of the kdamond. Writing
``update_schemes_tried_bytes`` to ``state`` file updates only
``.../tried_regions/total_bytes`` files. Writing
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
operating scheme action tried regions directory for each DAMON-based operation
scheme of the kdamond. For details of the DAMON-based operation scheme action
tried regions directory, please refer to :ref:`tried_regions section
<sysfs_schemes_tried_regions>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@@ -359,15 +363,21 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
Each filter directory contains three files, namely ``type``, ``matcing``, and
``memcg_path``. You can write one of two special keywords, ``anon`` for
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
the memory cgroup filtering, you can specify the memory cgroup of the interest
by writing the path of the memory cgroup from the cgroups mount point to
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
filter out pages that does or does not match to the type, respectively. Then,
the scheme's action will not be applied to the pages that specified to be
filtered out.
Each filter directory contains six files, namely ``type``, ``matcing``,
``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
file, you can write one of four special keywords: ``anon`` for anonymous pages,
``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
open-ended interval), or ``target`` for specific DAMON monitoring target
filtering. In case of the memory cgroup filtering, you can specify the memory
cgroup of the interest by writing the path of the memory cgroup from the
cgroups mount point to ``memcg_path`` file. In case of the address range
filtering, you can specify the start and end address of the range to
``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
target filtering, you can specify the index of the target between the list of
the DAMON context's monitoring targets list to ``target_idx`` file. You can
write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
not match to the type, respectively. Then, the scheme's action will not be
applied to the pages that specified to be filtered out.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
@@ -381,8 +391,14 @@ pages of all memory cgroups except ``/having_care_already``.::
echo /having_care_already > 1/memcg_path
echo N > 1/matching
Note that filters are currently supported only when ``paddr``
`implementation <sysfs_contexts>` is being used.
Note that ``anon`` and ``memcg`` filters are currently supported only when
``paddr`` `implementation <sysfs_contexts>` is being used.
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
are not counted as the scheme has tried to those, while regions that filtered
out by other type filters are counted as the scheme has tried to. The
difference is applied to :ref:`stats <damos_stats>` and
:ref:`tried regions <sysfs_schemes_tried_regions>`.
.. _sysfs_schemes_stats:
@@ -406,13 +422,21 @@ stats by writing a special keyword, ``update_schemes_stats`` to the relevant
schemes/<N>/tried_regions/
--------------------------
This directory initially has one file, ``total_bytes``.
When a special keyword, ``update_schemes_tried_regions``, is written to the
relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
starting from ``0`` under this directory. Each directory contains files
exposing detailed information about each of the memory region that the
corresponding scheme's ``action`` has tried to be applied under this directory,
during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
information includes address range, ``nr_accesses``, and ``age`` of the region.
relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
that reading it returns the total size of the scheme tried regions, and creates
directories named integer starting from ``0`` under this directory. Each
directory contains files exposing detailed information about each of the memory
region that the corresponding scheme's ``action`` has tried to be applied under
this directory, during next :ref:`aggregation interval
<sysfs_monitoring_attrs>`. The information includes address range,
``nr_accesses``, and ``age`` of the region.
Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
file will only update the ``total_bytes`` file, and will not create the
subdirectories.
The directories will be removed when another special keyword,
``clear_schemes_tried_regions``, is written to the relevant

View File

@@ -159,6 +159,8 @@ The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
general_profit
how effective is KSM. The calculation is explained below.
pages_scanned
how many pages are being scanned for ksm
pages_shared
how many shared pages are being used
pages_sharing
@@ -173,6 +175,13 @@ stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
ksm_zero_pages
how many zero pages that are still mapped into processes were mapped by
KSM when deduplicating.
When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
``ksm_zero_pages`` represents the actual number of pages saved by KSM.
if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
@@ -196,21 +205,25 @@ several times, which are unprofitable memory consumed.
1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
where ksm_saved_pages equals to the sum of ``pages_sharing`` +
``ksm_zero_pages`` of the system, and all_rmap_items can be easily
obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
and ``pages_volatile``.
2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::
process_profit =~ ksm_merging_pages * sizeof(page) -
process_profit =~ ksm_saved_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
``ksm_zero_pages``, both of which are shown under the directory
``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
``/proc/<pid>/ksm_stat``. The process profit is also shown in
``/proc/<pid>/ksm_stat`` as ksm_process_profit.
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or

View File

@@ -433,6 +433,18 @@ The following module parameters are currently defined:
memory in a way that huge pages in bigger
granularity cannot be formed on hotplugged
memory.
With value "force" it could result in memory
wastage due to memmap size limitations. For
example, if the memmap for a memory block
requires 1 MiB, but the pageblock size is 2
MiB, 1 MiB of hotplugged memory will be wasted.
Note that there are still cases where the
feature cannot be enforced: for example, if the
memmap is smaller than a single page, or if the
architecture does not support the forced mode
in all configurations.
``online_policy`` read-write: Set the basic policy used for
automatic zone selection when onlining memory
blocks without specifying a target zone.
@@ -669,7 +681,7 @@ when still encountering permanently unmovable pages within ZONE_MOVABLE
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
When offlining is triggered from user space, the offlining context can be
terminated by sending a fatal signal. A timeout based offlining can easily be
terminated by sending a signal. A timeout based offlining can easily be
implemented via::
% timeout $TIMEOUT offline_block | failure_handling

View File

@@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.
Memory Poisioning Emulation
---------------------------
In response to a fault (either missing or minor), an action userspace can
take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
future faulters to either get a SIGBUS, or in KVM's case the guest will
receive an MCE as if there were hardware memory poisoning.
This is used to emulate hardware memory poisoning. Imagine a VM running on a
machine which experiences a real hardware memory error. Later, we live migrate
the VM to another physical machine. Since we want the migration to be
transparent to the guest, we want that same address range to act as if it was
still poisoned, even though it's on a new physical host which ostensibly
doesn't have a memory error in the exact same spot.
QEMU/KVM
========

View File

@@ -49,7 +49,7 @@ compressed pool.
Design
======
Zswap receives pages for compression through the Frontswap API and is able to
Zswap receives pages for compression from the swap subsystem and is able to
evict pages from its own compressed pool on an LRU basis and write them back to
the backing swap device in the case that the compressed pool is full.
@@ -70,19 +70,19 @@ means the compression ratio will always be 2:1 or worse (because of half-full
zbud pages). The zsmalloc type zpool has a more complex compressed page
storage method, and it can achieve greater storage densities.
When a swap page is passed from frontswap to zswap, zswap maintains a mapping
When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool
handle that references that compressed swap page. This mapping is achieved
with a red-black tree per swap type. The swap offset is the search key for the
tree nodes.
During a page fault on a PTE that is a swap entry, frontswap calls the zswap
load function to decompress the page into the page allocated by the page fault
handler.
During a page fault on a PTE that is a swap entry, the swapin code calls the
zswap load function to decompress the page into the page allocated by the page
fault handler.
Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
in the swap_map goes to 0) the swap code calls the zswap invalidate function,
via frontswap, to free the compressed entry.
in the swap_map goes to 0) the swap code calls the zswap invalidate function
to free the compressed entry.
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
controlled policy:

View File

@@ -134,6 +134,7 @@ Usage of helpers:
bio_for_each_bvec_all()
bio_first_bvec_all()
bio_first_page_all()
bio_first_folio_all()
bio_last_bvec_all()
* The following helpers iterate over single-page segment. The passed 'struct

View File

@@ -88,13 +88,17 @@ changes occur:
This is used primarily during fault processing.
5) ``void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)``
5) ``void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long address, pte_t *ptep,
unsigned int nr)``
At the end of every page fault, this routine is invoked to
tell the architecture specific code that a translation
now exists at virtual address "address" for address space
"vma->vm_mm", in the software page tables.
At the end of every page fault, this routine is invoked to tell
the architecture specific code that translations now exists
in the software page tables for address space "vma->vm_mm"
at virtual address "address" for "nr" consecutive pages.
This routine is also invoked in various other places which pass
a NULL "vmf".
A port may use this information in any way it so chooses.
For example, it could use this event to pre-load TLB
@@ -269,7 +273,7 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.
``void flush_dcache_page(struct page *page)``
``void flush_dcache_folio(struct folio *folio)``
This routines must be called when:
@@ -277,7 +281,7 @@ maps this page at its virtual address.
and / or in high memory
b) the kernel is about to read from a page cache page and user space
shared/writable mappings of this page potentially exist. Note
that {get,pin}_user_pages{_fast} already call flush_dcache_page
that {get,pin}_user_pages{_fast} already call flush_dcache_folio
on any page found in the user address space and thus driver
code rarely needs to take this into account.
@@ -291,7 +295,7 @@ maps this page at its virtual address.
The phrase "kernel writes to a page cache page" means, specifically,
that the kernel executes store instructions that dirty data in that
page at the page->virtual mapping of that page. It is important to
page at the kernel virtual mapping of that page. It is important to
flush here to handle D-cache aliasing, to make sure these kernel stores
are visible to user space mappings of that page.
@@ -302,21 +306,22 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, this routine may simply be defined
as a nop on that architecture.
There is a bit set aside in page->flags (PG_arch_1) as "architecture
There is a bit set aside in folio->flags (PG_arch_1) as "architecture
private". The kernel guarantees that, for pagecache pages, it will
clear this bit when such a page first enters the pagecache.
This allows these interfaces to be implemented much more efficiently.
It allows one to "defer" (perhaps indefinitely) the actual flush if
there are currently no user processes mapping this page. See sparc64's
flush_dcache_page and update_mmu_cache implementations for an example
of how to go about doing this.
This allows these interfaces to be implemented much more
efficiently. It allows one to "defer" (perhaps indefinitely) the
actual flush if there are currently no user processes mapping this
page. See sparc64's flush_dcache_folio and update_mmu_cache_range
implementations for an example of how to go about doing this.
The idea is, first at flush_dcache_page() time, if page_file_mapping()
returns a mapping, and mapping_mapped on that mapping returns %false,
just mark the architecture private page flag bit. Later, in
update_mmu_cache(), a check is made of this flag bit, and if set the
flush is done and the flag bit is cleared.
The idea is, first at flush_dcache_folio() time, if
folio_flush_mapping() returns a mapping, and mapping_mapped() on that
mapping returns %false, just mark the architecture private page
flag bit. Later, in update_mmu_cache_range(), a check is made
of this flag bit, and if set the flush is done and the flag bit
is cleared.
.. important::
@@ -326,12 +331,6 @@ maps this page at its virtual address.
dirty. Again, see sparc64 for examples of how
to deal with this.
``void flush_dcache_folio(struct folio *folio)``
This function is called under the same circumstances as
flush_dcache_page(). It allows the architecture to
optimise for flushing the entire folio of pages instead
of flushing one page at a time.
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
@@ -352,7 +351,7 @@ maps this page at its virtual address.
When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
get_user_pages()). Note: flush_dcache_page() deliberately
get_user_pages()). Note: flush_dcache_folio() deliberately
doesn't work for an anonymous page. The default
implementation is a nop (and should remain so for all coherent
architectures). For incoherent architectures, it should flush
@@ -369,7 +368,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
All the functionality of flush_icache_page can be implemented in
flush_dcache_page and update_mmu_cache. In the future, the hope
flush_dcache_folio and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.
The final category of APIs is for I/O to deliberately aliased address

View File

@@ -115,3 +115,28 @@ More Memory Management Functions
.. kernel-doc:: include/linux/mmzone.h
.. kernel-doc:: mm/util.c
:functions: folio_mapping
.. kernel-doc:: mm/rmap.c
.. kernel-doc:: mm/migrate.c
.. kernel-doc:: mm/mmap.c
.. kernel-doc:: mm/kmemleak.c
.. #kernel-doc:: mm/hmm.c (build warnings)
.. kernel-doc:: mm/memremap.c
.. kernel-doc:: mm/hugetlb.c
.. kernel-doc:: mm/swap.c
.. kernel-doc:: mm/zpool.c
.. kernel-doc:: mm/memcontrol.c
.. #kernel-doc:: mm/memory-tiers.c (build warnings)
.. kernel-doc:: mm/shmem.c
.. kernel-doc:: mm/migrate_device.c
.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files)
.. kernel-doc:: mm/mapping_dirty_helpers.c
.. #kernel-doc:: mm/memory-failure.c (build warnings)
.. kernel-doc:: mm/percpu.c
.. kernel-doc:: mm/maccess.c
.. kernel-doc:: mm/vmscan.c
.. kernel-doc:: mm/memory_hotplug.c
.. kernel-doc:: mm/mmu_notifier.c
.. kernel-doc:: mm/balloon_compaction.c
.. kernel-doc:: mm/huge_memory.c
.. kernel-doc:: mm/io-mapping.c

View File

@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
| arm64: | N/A |
| arm64: | ok |
| csky: | TODO |
| hexagon: | TODO |
| ia64: | TODO |

View File

@@ -636,26 +636,29 @@ vm_operations_struct
prototypes::
void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*);
vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
void (*open)(struct vm_area_struct *);
void (*close)(struct vm_area_struct *);
vm_fault_t (*fault)(struct vm_fault *);
vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
============= ========= ===========================
============= ========== ===========================
ops mmap_lock PageLocked(page)
============= ========= ===========================
open: yes
close: yes
fault: yes can return with page locked
map_pages: read
page_mkwrite: yes can return with page locked
pfn_mkwrite: yes
access: yes
============= ========= ===========================
============= ========== ===========================
open: write
close: read/write
fault: read can return with page locked
huge_fault: maybe-read
map_pages: maybe-read
page_mkwrite: read can return with page locked
pfn_mkwrite: read
access: read
============= ========== ===========================
->fault() is called when a previously not present pte is about to be faulted
in. The filesystem must find and return the page associated with the passed in
@@ -665,11 +668,18 @@ then ensure the page is not already truncated (invalidate_lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
->huge_fault() is called when there is no PUD or PMD entry present. This
gives the filesystem the opportunity to install a PUD or PMD sized page.
Filesystems can also use the ->fault method to return a PMD sized page,
so implementing this function may not be necessary. In particular,
filesystems should not call filemap_fault() from ->huge_fault().
The mmap_lock may not be held when this method is called.
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
filesystem should skip it. Filesystem should use set_pte_range() to setup
page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte".

View File

@@ -938,3 +938,14 @@ file pointer instead of struct dentry pointer. d_tmpfile() is similarly
changed to simplify callers. The passed file is in a non-open state and on
success must be opened before returning (e.g. by calling
finish_open_simple()).
---
**mandatory**
Calling convention for ->huge_fault has changed. It now takes a page
order instead of an enum page_entry_size, and it may be called without the
mmap_lock held. All in-tree users have been audited and do not seem to
depend on the mmap_lock being held, but out of tree users should verify
for themselves. If they do need it, they can return VM_FAULT_RETRY to
be called with the mmap_lock held.

View File

@@ -380,12 +380,24 @@ number of filters for each scheme. Each filter specifies the type of target
memory, and whether it should exclude the memory of the type (filter-out), or
all except the memory of the type (filter-in).
As of this writing, anonymous page type and memory cgroup type are supported by
the feature. Some filter target types can require additional arguments. For
example, the memory cgroup filter type asks users to specify the file path of
the memory cgroup for the filter. Hence, users can apply specific schemes to
only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages
excluding those of specific cgroups, and any combination of those.
Currently, anonymous page, memory cgroup, address range, and DAMON monitoring
target type filters are supported by the feature. Some filter target types
require additional arguments. The memory cgroup filter type asks users to
specify the file path of the memory cgroup for the filter. The address range
type asks the start and end addresses of the range. The DAMON monitoring
target type asks the index of the target from the context's monitoring targets
list. Hence, users can apply specific schemes to only anonymous pages,
non-anonymous pages, pages of specific cgroups, all pages excluding those of
specific cgroups, pages in specific address range, pages in specific DAMON
monitoring targets, and any combination of those.
To handle filters efficiently, the address range and DAMON monitoring target
type filters are handled by the core layer, while others are handled by
operations set. If a memory region is filtered by a core layer-handled filter,
it is not counted as the scheme has tried to the region. In contrast, if a
memory regions is filtered by an operations set layer-handled filter, it is
counted as the scheme has tried. The difference in accounting leads to changes
in the statistics.
Application Programming Interface

View File

@@ -1,264 +0,0 @@
=========
Frontswap
=========
Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained because
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be
a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
to the requirements of transcendent memory (such as Xen's "tmem", or
in-kernel compressed memory, aka "zcache", or future RAM-like devices);
this pseudo-RAM device is not directly accessible or addressable by the
kernel and is of unknown and possibly time-varying size. The driver
links itself to frontswap by calling frontswap_register_ops to set the
frontswap_ops funcs appropriately and the functions it provides must
conform to certain policies as follows:
An "init" prepares the device to receive frontswap pages associated
with the specified swap device number (aka "type"). A "store" will
copy the page to transcendent memory and associate it with the type and
offset associated with the page. A "load" will copy the page, if found,
from transcendent memory into kernel memory, but will NOT remove the page
from transcendent memory. An "invalidate_page" will remove the page
from transcendent memory and an "invalidate_area" will remove ALL pages
associated with the swap type (e.g., like swapoff) and notify the "device"
to refuse further stores with that swap type.
Once a page is successfully stored, a matching load on the page will normally
succeed. So when the kernel finds itself in a situation where it needs
to swap out a page, it first attempts to use frontswap. If the store returns
success, the data has been successfully saved to transcendent memory and
a disk write and, if the data is later read back, a disk read are avoided.
If a store returns failure, transcendent memory has rejected the data, and the
page can be written to swap as usual.
Note that if a page is stored and the page already exists in transcendent memory
(a "duplicate" store), either the store succeeds and the data is overwritten,
or the store fails AND the page is invalidated. This ensures stale data may
never be obtained from frontswap.
If properly configured, monitoring of frontswap is done via debugfs in
the `/sys/kernel/debug/frontswap` directory. The effectiveness of
frontswap can be measured (across all swap devices) with:
``failed_stores``
how many store attempts have failed
``loads``
how many loads were attempted (all should succeed)
``succ_stores``
how many store attempts have succeeded
``invalidates``
how many invalidates were attempted
A backend implementation may provide additional metrics.
FAQ
===
* Where's the value?
When a workload starts swapping, performance falls through the floor.
Frontswap significantly increases performance in many such workloads by
providing a clean, dynamic interface to read and write swap pages to
"transcendent memory" that is otherwise not directly addressable to the kernel.
This interface is ideal when data is transformed to a different form
and size (such as with compression) or secretly moved (as might be
useful for write-balancing for some RAM-like devices). Swap pages (and
evicted page-cache pages) are a great use for this kind of slower-than-RAM-
but-much-faster-than-disk "pseudo-RAM device".
Frontswap with a fairly small impact on the kernel,
provides a huge amount of flexibility for more dynamic, flexible RAM
utilization in various system configurations:
In the single kernel case, aka "zcache", pages are compressed and
stored in local memory, thus increasing the total anonymous pages
that can be safely kept in RAM. Zcache essentially trades off CPU
cycles used in compression/decompression for better memory utilization.
Benchmarks have shown little or no impact when memory pressure is
low while providing a significant performance improvement (25%+)
on some workloads under high memory pressure.
"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
support for clustered systems. Frontswap pages are locally compressed
as in zcache, but then "remotified" to another system's RAM. This
allows RAM to be dynamically load-balanced back-and-forth as needed,
i.e. when system A is overcommitted, it can swap to system B, and
vice versa. RAMster can also be configured as a memory server so
many servers in a cluster can swap, dynamically as needed, to a single
server configured with a large amount of RAM... without pre-configuring
how much of the RAM is available for each of the clients!
In the virtual case, the whole point of virtualization is to statistically
multiplex physical resources across the varying demands of multiple
virtual machines. This is really hard to do with RAM and efforts to do
it well with no kernel changes have essentially failed (except in some
well-publicized special-case workloads).
Specifically, the Xen Transcendent Memory backend allows otherwise
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
virtual machines, but the pages can be compressed and deduplicated to
optimize RAM utilization. And when guest OS's are induced to surrender
underutilized RAM (e.g. with "selfballooning"), sudden unexpected
memory pressure may result in swapping; frontswap allows those pages
to be swapped to and from hypervisor RAM (if overall host system memory
conditions allow), thus mitigating the potentially awful performance impact
of unplanned swapping.
A KVM implementation is underway and has been RFC'ed to lkml. And,
using frontswap, investigation is also underway on the use of NVM as
a memory extension technology.
* Sure there may be performance advantages in some situations, but
what's the space/time overhead of frontswap?
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
nothingness and the only overhead is a few extra bytes per swapon'ed
swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
registers, there is one extra global variable compared to zero for
every swap page read or written. If CONFIG_FRONTSWAP is enabled
AND a frontswap backend registers AND the backend fails every "store"
request (i.e. provides no memory despite claiming it might),
CPU overhead is still negligible -- and since every frontswap fail
precedes a swap page write-to-disk, the system is highly likely
to be I/O bound and using a small fraction of a percent of a CPU
will be irrelevant anyway.
As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
registers, one bit is allocated for every swap page for every swap
device that is swapon'd. This is added to the EIGHT bits (which
was sixteen until about 2.6.34) that the kernel already allocates
for every swap page for every swap device that is swapon'd. (Hugh
Dickins has observed that frontswap could probably steal one of
the existing eight bits, but let's worry about that minor optimization
later.) For very large swap disks (which are rare) on a standard
4K pagesize, this is 1MB per 32GB swap.
When swap pages are stored in transcendent memory instead of written
out to disk, there is a side effect that this may create more memory
pressure that can potentially outweigh the other advantages. A
backend, such as zcache, must implement policies to carefully (but
dynamically) manage memory limits to ensure this doesn't happen.
* OK, how about a quick overview of what this frontswap patch does
in terms that a kernel hacker can grok?
Let's assume that a frontswap "backend" has registered during
kernel initialization; this registration indicates that this
frontswap backend has access to some "memory" that is not directly
accessible by the kernel. Exactly how much memory it provides is
entirely dynamic and random.
Whenever a swap-device is swapon'd frontswap_init() is called,
passing the swap device number (aka "type") as a parameter.
This notifies frontswap to expect attempts to "store" swap pages
associated with that number.
Whenever the swap subsystem is readying a page to write to a swap
device (c.f swap_writepage()), frontswap_store is called. Frontswap
consults with the frontswap backend and if the backend says it does NOT
have room, frontswap_store returns -1 and the kernel swaps the page
to the swap device as normal. Note that the response from the frontswap
backend is unpredictable to the kernel; it may choose to never accept a
page, it could accept every ninth page, or it might accept every
page. But if the backend does accept a page, the data from the page
has already been copied and associated with the type and offset,
and the backend guarantees the persistence of the data. In this case,
frontswap sets a bit in the "frontswap_map" for the swap device
corresponding to the page offset on the swap device to which it would
otherwise have written the data.
When the swap subsystem needs to swap-in a page (swap_readpage()),
it first calls frontswap_load() which checks the frontswap_map to
see if the page was earlier accepted by the frontswap backend. If
it was, the page of data is filled from the frontswap backend and
the swap-in is complete. If not, the normal swap-in code is
executed to obtain the page of data from the real swap device.
So every time the frontswap backend accepts a page, a swap device read
and (potentially) a swap device write are replaced by a "frontswap backend
store" and (possibly) a "frontswap backend loads", which are presumably much
faster.
* Can't frontswap be configured as a "special" swap device that is
just higher priority than any real swap device (e.g. like zswap,
or maybe swap-over-nbd/NFS)?
No. First, the existing swap subsystem doesn't allow for any kind of
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
but this would require fairly drastic changes. Even if it were
rewritten, the existing swap subsystem uses the block I/O layer which
assumes a swap device is fixed size and any page in it is linearly
addressable. Frontswap barely touches the existing swap subsystem,
and works around the constraints of the block I/O subsystem to provide
a great deal of flexibility and dynamicity.
For example, the acceptance of any swap page by the frontswap backend is
entirely unpredictable. This is critical to the definition of frontswap
backends because it grants completely dynamic discretion to the
backend. In zcache, one cannot know a priori how compressible a page is.
"Poorly" compressible pages can be rejected, and "poorly" can itself be
defined dynamically depending on current memory constraints.
Further, frontswap is entirely synchronous whereas a real swap
device is, by definition, asynchronous and uses block I/O. The
block I/O layer is not only unnecessary, but may perform "optimizations"
that are inappropriate for a RAM-oriented device including delaying
the write of some pages for a significant amount of time. Synchrony is
required to ensure the dynamicity of the backend and to avoid thorny race
conditions that would unnecessarily and greatly complicate frontswap
and/or the block I/O subsystem. That said, only the initial "store"
and "load" operations need be synchronous. A separate asynchronous thread
is free to manipulate the pages stored by frontswap. For example,
the "remotification" thread in RAMster uses standard asynchronous
kernel sockets to move compressed frontswap pages to a remote machine.
Similarly, a KVM guest-side implementation could do in-guest compression
and use "batched" hypercalls.
In a virtualized environment, the dynamicity allows the hypervisor
(or host OS) to do "intelligent overcommit". For example, it can
choose to accept pages only until host-swapping might be imminent,
then force guests to do their own swapping.
There is a downside to the transcendent memory specifications for
frontswap: Since any "store" might fail, there must always be a real
slot on a real swap device to swap the page. Thus frontswap must be
implemented as a "shadow" to every swapon'd device with the potential
capability of holding every page that the swap device might have held
and the possibility that it might hold no pages at all. This means
that frontswap cannot contain more pages than the total of swapon'd
swap devices. For example, if NO swap device is configured on some
installation, frontswap is useless. Swapless portable devices
can still use frontswap but a backend for such devices must configure
some kind of "ghost" swap device and ensure that it is never used.
* Why this weird definition about "duplicate stores"? If a page
has been previously successfully stored, can't it always be
successfully overwritten?
Nearly always it can, but no, sometimes it cannot. Consider an example
where data is compressed and the original 4K page has been compressed
to 1K. Now an attempt is made to overwrite the page with data that
is non-compressible and so would take the entire 4K. But the backend
has no more space. In this case, the store must be rejected. Whenever
frontswap rejects a store that would overwrite, it also must invalidate
the old data and ensure that it is no longer accessible. Since the
swap subsystem then writes the new data to the read swap device,
this is the correct course of action to ensure coherency.
* Why does the frontswap patch create the new include file swapfile.h?
The frontswap code depends on some swap-subsystem-internal data
structures that have, over the years, moved back and forth between
static and global. This seemed a reasonable compromise: Define
them as global but declare them in a new include file that isn't
included by the large number of source files that include swap.h.
Dan Magenheimer, last updated April 9, 2012

View File

@@ -206,4 +206,5 @@ Functions
=========
.. kernel-doc:: include/linux/highmem.h
.. kernel-doc:: mm/highmem.c
.. kernel-doc:: include/linux/highmem-internal.h

View File

@@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
Freeing Huge Pages
==================
Huge page freeing is performed by the routine free_huge_page(). This routine
is the destructor for hugetlbfs compound pages. As a result, it is only
passed a pointer to the page struct. When a huge page is freed, reservation
accounting may need to be performed. This would be the case if the page was
associated with a subpool that contained reserves, or the page is being freed
on an error path where a global reserve count must be restored.
Huge pages are freed by free_huge_folio(). It is only passed a pointer
to the folio as it is called from the generic MM code. When a huge page
is freed, reservation accounting may need to be performed. This would
be the case if the page was associated with a subpool that contained
reserves, or the page is being freed on an error path where a global
reserve count must be restored.
The page->private field points to any subpool associated with the page.
If the PagePrivate flag is set, it indicates the global reserve count should
@@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
page is allocated but before it is instantiated. In this case, the page
allocation has consumed the reservation and made the appropriate subpool,
reservation map and global count adjustments. If the page is freed at this
time (before instantiation and clearing of PagePrivate), then free_huge_page
time (before instantiation and clearing of PagePrivate), then free_huge_folio
will increment the global reservation count. However, the reservation map
indicates the reservation was consumed. This resulting inconsistent state
will cause the 'leak' of a reserved huge page. The global reserve count will

View File

@@ -44,7 +44,6 @@ above structured documentation, or deleted if it has served its purpose.
balance
damon/index
free_page_reporting
frontswap
hmm
hwpoison
hugetlbfs_reserv

Some files were not shown because too many files have changed in this diff Show More