Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/<pid>/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
This commit is contained in:
Linus Torvalds
2024-07-21 17:15:46 -07:00
328 changed files with 12459 additions and 9219 deletions

View File

@@ -155,6 +155,12 @@ Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the action
of the scheme.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/target_nid
Date: Jun 2024
Contact: SeongJae Park <sj@kernel.org>
Description: Action's target NUMA node id. Supported by only relevant
actions.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/apply_interval_us
Date: Sep 2023
Contact: SeongJae Park <sj@kernel.org>

View File

@@ -1306,17 +1306,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.
This file accepts a single key, the number of bytes to reclaim.
No nested keys are currently supported.
Example::
echo "1G" > memory.reclaim
The interface can be later extended with nested keys to
configure the reclaim behavior. For example, specify the
type of memory to reclaim from (anon, file, ..).
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1328,6 +1321,17 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.
The following nested keys are defined.
========== ================================
swappiness Swappiness value to reclaim with
========== ================================
Specifying a swappiness value instructs the kernel to perform
the reclaim with that swappiness value. Note that this has the
same semantics as vm.swappiness applied to memcg reclaim with
all the existing limitations and potential future extensions.
memory.peak
A read-only single value file which exists on non-root
cgroups.

View File

@@ -7239,9 +7239,12 @@
vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an
exact size of <nn>. This can be used to increase
the minimum size (128MB on x86). It can also be
used to decrease the size and leave more room
for directly mapped kernel RAM.
the minimum size (128MB on x86, arm32 platforms).
It can also be used to decrease the size and leave more room
for directly mapped kernel RAM. Note that this parameter does
not exist on many other platforms (including arm64, alpha,
loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc,
parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc).
vmcp_cma=nn[MG] [KNL,S390,EARLY]
Sets the memory size reserved for contiguous memory

View File

@@ -34,18 +34,56 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
mounted.
Snapshot Data Access Patterns
=============================
The commands below show the memory access pattern of a program at the moment of
the execution. ::
$ git clone https://github.com/sjp38/masim; cd masim; make
$ sudo damo start "./masim ./configs/stairs.cfg --quiet"
$ sudo ./damo show
0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s
1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s
2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s
3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms
4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms
5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns
6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s
7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s
8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms
9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s
total size: 577.590 MiB
$ sudo ./damo stop
The first command of the above example downloads and builds an artificial
memory access generator program called ``masim``. The second command asks DAMO
to execute the artificial generator process start via the given command and
make DAMON monitors the generator process. The third command retrieves the
current snapshot of the monitored access pattern of the process from DAMON and
shows the pattern in a human readable format.
Each line of the output shows which virtual address range (``addr [XX, XX)``)
of the process is how frequently (``access XX %``) accessed for how long time
(``age XX``). For example, the fifth region of ~9 MiB size is being most
frequently accessed for last 300 milliseconds. Finally, the fourth command
stops DAMON.
Note that DAMON can monitor not only virtual address spaces but multiple types
of address spaces including the physical address space.
Recording Data Access Patterns
==============================
The commands below record the memory access patterns of a program and save the
monitoring results to a file. ::
$ git clone https://github.com/sjp38/masim
$ cd masim; make; ./masim ./configs/zigzag.cfg &
$ ./masim ./configs/zigzag.cfg &
$ sudo damo record -o damon.data $(pidof masim)
The first two lines of the commands download an artificial memory access
generator program and run it in the background. The generator will repeatedly
The line of the commands run the artificial memory access
generator program again. The generator will repeatedly
access two 100 MiB sized memory regions one by one. You can substitute this
with your real workload. The last line asks ``damo`` to record the access
pattern in the ``damon.data`` file.

View File

@@ -78,7 +78,7 @@ comma (",").
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
@@ -289,14 +289,18 @@ schemes/<N>/
------------
In each scheme directory, five directories (``access_pattern``, ``quotas``,
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
(``action`` and ``apply_interval``) exist.
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
(``action``, ``target_nid`` and ``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
from the file and their meaning are same to those of the list on
:ref:`design doc <damon_design_damos_action>`.
The ``target_nid`` file is for setting the migration target node, which is
only meaningful when the ``action`` is either ``migrate_hot`` or
``migrate_cold``.
The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.

View File

@@ -118,7 +118,7 @@ Short descriptions to the page flags
21 - KSM
Identical memory pages dynamically shared between one or more processes.
22 - THP
Contiguous pages which construct transparent hugepages.
Contiguous pages which construct THP of any size and mapped by any granularity.
23 - OFFLINE
The page is logically offline.
24 - ZERO_PAGE
@@ -173,27 +173,6 @@ LRU related page flags
The page-types tool in the tools/mm directory can be used to query the
above flags.
Using pagemap to do something useful
====================================
The general procedure for using pagemap to find out about a process' memory
usage goes like this:
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
mapped to what.
2. Select the maps you are interested in -- all of them, or a particular
library, or the stack or the heap, etc.
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap.
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
just read, seek to that entry in the file, and read the data you want.
For example, to find the "unique set size" (USS), which is the amount of
memory that a process is using that is not shared with any other process,
you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced
once.
Exceptions for Shared Memory
============================
@@ -252,7 +231,7 @@ Following flags about pages are currently supported:
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.

View File

@@ -202,12 +202,11 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
khugepaged will be automatically started when one or more hugepage
sizes are enabled (either by directly setting "always" or "madvise",
or by setting "inherit" while the top-level enabled is set to "always"
or "madvise"), and it'll be automatically shutdown when the last
hugepage size is disabled (either by directly setting "never", or by
setting "inherit" while the top-level enabled is set to "never").
khugepaged will be automatically started when PMD-sized THP is enabled
(either of the per-size anon control or the top-level control are set
to "always" or "madvise"), and it'll be automatically shutdown when
PMD-sized THP is disabled (when both the per-size anon control and the
top-level control are "never")
Khugepaged controls
-------------------
@@ -332,6 +331,31 @@ deny
force
Force the huge option on for all - very useful for testing;
Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
control mTHP allocation:
'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
and its value for each mTHP is essentially consistent with the global
setting. An 'inherit' option is added to ensure compatibility with these
global settings. Conversely, the options 'force' and 'deny' are dropped,
which are rather testing artifacts from the old ages.
always
Attempt to allocate <size> huge pages every time we need a new page;
inherit
Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
have enabled="inherit" and all other hugepage sizes have enabled="never";
never
Do not allocate <size> huge pages;
within_size
Only allocate <size> huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;
advise
Only allocate <size> huge pages if requested with fadvise()/madvise();
Need of application restart
===========================
@@ -344,10 +368,6 @@ also applies to the regions registered in khugepaged.
Monitoring usage
================
.. note::
Currently the below counters only record events relating to
PMD-sized THP. Events relating to other THP sizes are not included.
The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
To identify what applications are using PMD-sized anonymous transparent huge
@@ -392,20 +412,23 @@ thp_collapse_alloc_failed
the allocation.
thp_file_alloc
is incremented every time a file huge page is successfully
allocated.
is incremented every time a shmem huge page is successfully
allocated (Note that despite being named after "file", the counter
measures only shmem).
thp_file_fallback
is incremented if a file huge page is attempted to be allocated
but fails and instead falls back to using small pages.
is incremented if a shmem huge page is attempted to be allocated
but fails and instead falls back to using small pages. (Note that
despite being named after "file", the counter measures only shmem).
thp_file_fallback_charge
is incremented if a file huge page cannot be charged and instead
is incremented if a shmem huge page cannot be charged and instead
falls back to using small pages even though the allocation was
successful.
successful. (Note that despite being named after "file", the
counter measures only shmem).
thp_file_mapped
is incremented every time a file huge page is mapped into
is incremented every time a file or shmem huge page is mapped into
user address space.
thp_split_page
@@ -476,6 +499,34 @@ swpout_fallback
Usually because failed to allocate some continuous swap space
for the huge page.
shmem_alloc
is incremented every time a shmem huge page is successfully
allocated.
shmem_fallback
is incremented if a shmem huge page is attempted to be allocated
but fails and instead falls back to using small pages.
shmem_fallback_charge
is incremented if a shmem huge page cannot be charged and instead
falls back to using small pages even though the allocation was
successful.
split
is incremented every time a huge page is successfully split into
smaller orders. This can happen for a variety of reasons but a
common reason is that a huge page is old and is being reclaimed.
split_failed
is incremented if kernel fails to split huge
page. This can happen if the page was pinned by somebody.
split_deferred
is incremented when a huge page is put onto split queue.
This happens when a huge page is partially unmapped and splitting
it would free up some memory. Pages on split queue are going to
be split under memory pressure, if splitting is possible.
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help

View File

@@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm:
- dirtytime_expire_seconds
- dirty_writeback_centisecs
- drop_caches
- enable_soft_offline
- extfrag_threshold
- highmem_is_dirtyable
- hugetlb_shm_group
@@ -267,6 +268,43 @@ used::
These are informational only. They do not mean that anything is wrong
with your system. To disable them, echo 4 (bit 2) into drop_caches.
enable_soft_offline
===================
Correctable memory errors are very common on servers. Soft-offline is kernel's
solution for memory pages having (excessive) corrected memory errors.
For different types of page, soft-offline has different behaviors / costs.
- For a raw error page, soft-offline migrates the in-use page's content to
a new raw page.
- For a page that is part of a transparent hugepage, soft-offline splits the
transparent hugepage into raw pages, then migrates only the raw error page.
As a result, user is transparently backed by 1 less hugepage, impacting
memory access performance.
- For a page that is part of a HugeTLB hugepage, soft-offline first migrates
the entire HugeTLB hugepage, during which a free hugepage will be consumed
as migration target. Then the original hugepage is dissolved into raw
pages without compensation, reducing the capacity of the HugeTLB pool by 1.
It is user's call to choose between reliability (staying away from fragile
physical memory) vs performance / capacity implications in transparent and
HugeTLB cases.
For all architectures, enable_soft_offline controls whether to soft offline
memory pages. When set to 1, kernel attempts to soft offline the pages
whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
the request to soft offline the pages. Its default value is 1.
It is worth mentioning that after setting enable_soft_offline to 0, the
following requests to soft offline pages will not be performed:
- Request to soft offline pages from RAS Correctable Errors Collector.
- On ARM, the request to soft offline pages from GHES driver.
- On PARISC, the request to soft offline pages from Page Deallocation Table.
extfrag_threshold
=================

View File

@@ -132,7 +132,7 @@ CASE 1: Direct IO (DIO)
-----------------------
There are GUP references to pages that are serving
as DIO buffers. These buffers are needed for a relatively short time (so they
are not "long term"). No special synchronization with page_mkclean() or
are not "long term"). No special synchronization with folio_mkclean() or
munmap() is provided. Therefore, flags to set at the call site are: ::
FOLL_PIN
@@ -144,7 +144,7 @@ CASE 2: RDMA
------------
There are GUP references to pages that are serving as DMA
buffers. These buffers are needed for a long time ("long term"). No special
synchronization with page_mkclean() or munmap() is provided. Therefore, flags
synchronization with folio_mkclean() or munmap() is provided. Therefore, flags
to set at the call site are: ::
FOLL_PIN | FOLL_LONGTERM
@@ -170,7 +170,7 @@ callback, simply remove the range from the device's page tables.
Either way, as long as the driver unpins the pages upon mmu notifier callback,
then there is proper synchronization with both filesystem and mm
(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
(folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
CASE 4: Pinning for struct page manipulation only
-------------------------------------------------
@@ -196,20 +196,20 @@ INCORRECT (uses FOLL_GET calls):
write to the data within the pages
put_page()
page_maybe_dma_pinned(): the whole point of pinning
===================================================
folio_maybe_dma_pinned(): the whole point of pinning
====================================================
The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able
to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean()
(and file system writeback code in general) to make informed decisions about
what to do when a page cannot be unmapped due to such pins.
what to do when a folio cannot be unmapped due to such pins.
What to do in those cases is the subject of a years-long series of discussions
and debates (see the References at the end of this document). It's a TODO item
here: fill in the details once that's worked out. Meanwhile, it's safe to say
that having this available: ::
static inline bool page_maybe_dma_pinned(struct page *page)
static inline bool folio_maybe_dma_pinned(struct folio *folio)
...is a prerequisite to solving the long-running gup+DMA problem.

View File

@@ -110,6 +110,13 @@ in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every
function in the file or directory. Most users won't need KMSAN_SANITIZE, unless
their code gets broken by KMSAN (e.g. runs at early boot time).
KMSAN checks can also be temporarily disabled for the current task using
``kmsan_disable_current()`` and ``kmsan_enable_current()`` calls. Each
``kmsan_enable_current()`` call must be preceded by a
``kmsan_disable_current()`` call; these call pairs may be nested. One needs to
be careful with these calls, keeping the regions short and preferring other
ways to disable instrumentation, where possible.
Support
=======
@@ -338,11 +345,11 @@ Per-task KMSAN state
~~~~~~~~~~~~~~~~~~~~
Every task_struct has an associated KMSAN task state that holds the KMSAN
context (see above) and a per-task flag disallowing KMSAN reports::
context (see above) and a per-task counter disallowing KMSAN reports::
struct kmsan_context {
...
bool allow_reporting;
unsigned int depth;
struct kmsan_context_state cstate;
...
}

View File

@@ -443,6 +443,15 @@ is not associated with a file:
or if empty, the mapping is anonymous.
Starting with 6.11 kernel, /proc/PID/maps provides an alternative
ioctl()-based API that gives ability to flexibly and efficiently query and
filter individual VMAs. This interface is binary and is meant for more
efficient and easy programmatic use. `struct procmap_query`, defined in
linux/fs.h UAPI header, serves as an input/output argument to the
`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for
details on query semantics, supported flags, data returned, and general API
usage information.
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following::

View File

@@ -90,8 +90,6 @@ PMD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pmd_leaf | Tests a leaf mapped PMD |
+---------------------------+--------------------------------------------------+
| pmd_huge | Tests a HugeTLB mapped PMD |
+---------------------------+--------------------------------------------------+
| pmd_trans_huge | Tests a Transparent Huge Page (THP) at PMD |
+---------------------------+--------------------------------------------------+
| pmd_present | Tests whether pmd_page() points to valid memory |
@@ -169,8 +167,6 @@ PUD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pud_leaf | Tests a leaf mapped PUD |
+---------------------------+--------------------------------------------------+
| pud_huge | Tests a HugeTLB mapped PUD |
+---------------------------+--------------------------------------------------+
| pud_trans_huge | Tests a Transparent Huge Page (THP) at PUD |
+---------------------------+--------------------------------------------------+
| pud_present | Tests a valid mapped PUD |

View File

@@ -16,53 +16,24 @@ called DAMON ``context``. DAMON executes each context with a kernel thread
called ``kdamond``. Multiple kdamonds could run in parallel, for different
types of monitoring.
To know how user-space can do the configurations and start/stop DAMON, refer to
:ref:`DAMON sysfs interface <sysfs_interface>` documentation.
Overall Architecture
====================
DAMON subsystem is configured with three layers including
- Operations Set: Implements fundamental operations for DAMON that depends on
the given monitoring target address-space and available set of
software/hardware primitives,
- Core: Implements core logics including monitoring overhead/accurach control
and access-aware system operations on top of the operations set layer, and
- Modules: Implements kernel modules for various purposes that provides
interfaces for the user space, on top of the core layer.
.. _damon_design_configurable_operations_set:
Configurable Operations Set
---------------------------
For data access monitoring and additional low level work, DAMON needs a set of
implementations for specific operations that are dependent on and optimized for
the given target address space. On the other hand, the accuracy and overhead
tradeoff mechanism, which is the core logic of DAMON, is in the pure logic
space. DAMON separates the two parts in different layers, namely DAMON
Operations Set and DAMON Core Logics Layers, respectively. It further defines
the interface between the layers to allow various operations sets to be
configured with the core logic.
Due to this design, users can extend DAMON for any address space by configuring
the core logic to use the appropriate operations set. If any appropriate set
is unavailable, users can implement one on their own.
For example, physical memory, virtual memory, swap space, those for specific
processes, NUMA nodes, files, and backing memory devices would be supportable.
Also, if some architectures or devices supporting special optimized access
check primitives, those will be easily configurable.
Programmable Modules
--------------------
Core layer of DAMON is implemented as a framework, and exposes its application
programming interface to all kernel space components such as subsystems and
modules. For common use cases of DAMON, DAMON subsystem provides kernel
modules that built on top of the core layer using the API, which can be easily
used by the user space end users.
- :ref:`Operations Set <damon_operations_set>`: Implements fundamental
operations for DAMON that depends on the given monitoring target
address-space and available set of software/hardware primitives,
- :ref:`Core <damon_core_logic>`: Implements core logics including monitoring
overhead/accuracy control and access-aware system operations on top of the
operations set layer, and
- :ref:`Modules <damon_modules>`: Implements kernel modules for various
purposes that provides interfaces for the user space, on top of the core
layer.
.. _damon_operations_set:
@@ -70,11 +41,32 @@ used by the user space end users.
Operations Set Layer
====================
The monitoring operations are defined in two parts:
.. _damon_design_configurable_operations_set:
For data access monitoring and additional low level work, DAMON needs a set of
implementations for specific operations that are dependent on and optimized for
the given target address space. For example, below two operations for access
monitoring are address-space dependent.
1. Identification of the monitoring target address range for the address space.
2. Access check of specific address range in the target space.
DAMON consolidates these implementations in a layer called DAMON Operations
Set, and defines the interface between it and the upper layer. The upper layer
is dedicated for DAMON's core logics including the mechanism for control of the
monitoring accruracy and the overhead.
Hence, DAMON can easily be extended for any address space and/or available
hardware features by configuring the core logic to use the appropriate
operations set. If there is no available operations set for a given purpose, a
new operations set can be implemented following the interface between the
layers.
For example, physical memory, virtual memory, swap space, those for specific
processes, NUMA nodes, files, and backing memory devices would be supportable.
Also, if some architectures or devices support special optimized access check
features, those will be easily configurable.
DAMON currently provides below three operation sets. Below two subsections
describe how those work.
@@ -82,6 +74,10 @@ describe how those work.
- fvaddr: Monitor fixed virtual address ranges
- paddr: Monitor the physical address space of the system
To know how user-space can do the configuration via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`operations <sysfs_context>` file part of the
documentation.
.. _damon_design_vaddr_target_regions_construction:
@@ -140,9 +136,12 @@ conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags,
as Idle page tracking does.
.. _damon_core_logic:
Core Logics
===========
.. _damon_design_monitoring:
Monitoring
----------
@@ -152,6 +151,10 @@ monitoring attributes, ``sampling interval``, ``aggregation interval``,
``update interval``, ``minimum number of regions``, and ``maximum number of
regions``.
To know how user-space can set the attributes via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`monitoring_attrs <sysfs_monitoring_attrs>`
part of the documentation.
Access Frequency Monitoring
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -192,7 +195,7 @@ one page in the region is required to be checked. Thus, for each ``sampling
interval``, DAMON randomly picks one page in each region, waits for one
``sampling interval``, checks whether the page is accessed meanwhile, and
increases the access frequency counter of the region if so. The counter is
called ``nr_regions`` of the region. Therefore, the monitoring overhead is
called ``nr_accesses`` of the region. Therefore, the monitoring overhead is
controllable by setting the number of regions. DAMON allows users to set the
minimum and the maximum number of regions for the trade-off.
@@ -209,11 +212,18 @@ the data access pattern can be dynamically changed. This will result in low
monitoring quality. To keep the assumption as much as possible, DAMON
adaptively merges and splits each region based on their access frequency.
For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small. Then,
after it reports and clears the aggregated access frequency of each region, it
splits each region into two or three regions if the total number of regions
will not exceed the user-specified maximum number of regions after the split.
For each ``aggregation interval``, it compares the access frequencies
(``nr_accesses``) of adjacent regions. If the difference is small, and if the
sum of the two regions' sizes is smaller than the size of total regions divided
by the ``minimum number of regions``, DAMON merges the two regions. If the
resulting number of total regions is still higher than ``maximum number of
regions``, it repeats the merging with increasing access frequenceis difference
threshold until the upper-limit of the number of regions is met, or the
threshold becomes higher than possible maximum value (``aggregation interval``
divided by ``sampling interval``). Then, after it reports and clears the
aggregated access frequency of each region, it splits each region into two or
three regions if the total number of regions will not exceed the user-specified
maximum number of regions after the split.
In this way, DAMON provides its best-effort quality and minimal overhead while
keeping the bounds users set for their trade-off.
@@ -248,6 +258,11 @@ and applies it to monitoring operations-related data structures such as the
abstracted monitoring target memory area only for each of a user-specified time
interval (``update interval``).
User-space can get the monitoring results via DAMON sysfs interface and/or
tracepoints. For more details, please refer to the documentations for
:ref:`DAMOS tried regions <sysfs_schemes_tried_regions>` and :ref:`tracepoint`,
respectively.
.. _damon_design_damos:
@@ -288,6 +303,10 @@ the access pattern of interest, and applies the user-desired operation actions
to the regions, for every user-specified time interval called
``apply_interval``.
To know how user-space can set ``apply_interval`` via :ref:`DAMON sysfs
interface <sysfs_interface>`, refer to :ref:`apply_interval_us <sysfs_scheme>`
part of the documentation.
.. _damon_design_damos_action:
@@ -325,6 +344,10 @@ that supports each action are as below.
Supported by ``paddr`` operations set.
- ``lru_deprio``: Deprioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
- ``migrate_hot``: Migrate the regions prioritizing warmer regions.
Supported by ``paddr`` operations set.
- ``migrate_cold``: Migrate the regions prioritizing colder regions.
Supported by ``paddr`` operations set.
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.
@@ -332,6 +355,10 @@ Applying the actions except ``stat`` to a region is considered as changing the
region's characteristics. Hence, DAMOS resets the age of regions when any such
actions are applied to those.
To know how user-space can set the action via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`action <sysfs_scheme>` part of the
documentation.
.. _damon_design_damos_access_pattern:
@@ -345,6 +372,10 @@ interest by setting minimum and maximum values of the three properties. If a
region's three properties are in the ranges, DAMOS classifies it as one of the
regions that the scheme is having an interest in.
To know how user-space can set the access pattern via :ref:`DAMON sysfs
interface <sysfs_interface>`, refer to :ref:`access_pattern
<sysfs_access_pattern>` part of the documentation.
.. _damon_design_damos_quotas:
@@ -364,6 +395,10 @@ feature called quotas. It lets users specify an upper limit of time that DAMOS
can use for applying the action, and/or a maximum bytes of memory regions that
the action can be applied within a user-specified time duration.
To know how user-space can set the basic quotas via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`quotas <sysfs_quotas>` part of the
documentation.
.. _damon_design_damos_quotas_prioritization:
@@ -391,6 +426,10 @@ information to the underlying mechanism. Nevertheless, how and even whether
the weight will be respected are up to the underlying prioritization mechanism
implementation.
To know how user-space can set the prioritization weights via :ref:`DAMON sysfs
interface <sysfs_interface>`, refer to :ref:`weights <sysfs_quotas>` part of
the documentation.
.. _damon_design_damos_quotas_auto_tuning:
@@ -420,6 +459,10 @@ Currently, two ``target_metric`` are provided.
DAMOS does the measurement on its own, so only ``target_value`` need to be
set by users at the initial time. In other words, DAMOS does self-feedback.
To know how user-space can set the tuning goal metric, the target value, and/or
the current value via :ref:`DAMON sysfs interface <sysfs_interface>`, refer to
:ref:`quota goals <sysfs_schemes_quota_goals>` part of the documentation.
.. _damon_design_damos_watermarks:
@@ -442,6 +485,10 @@ is activated. If all schemes are deactivated by the watermarks, the monitoring
is also deactivated. In this case, the DAMON worker thread only periodically
checks the watermarks and therefore incurs nearly zero overhead.
To know how user-space can set the watermarks via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`watermarks <sysfs_watermarks>` part of the
documentation.
.. _damon_design_damos_filters:
@@ -488,6 +535,10 @@ Below types of filters are currently supported.
- Applied to pages that belonging to a given DAMON monitoring target.
- Handled by the core logic.
To know how user-space can set the watermarks via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`filters <sysfs_filters>` part of the
documentation.
Application Programming Interface
---------------------------------
@@ -501,6 +552,8 @@ interface, namely ``include/linux/damon.h``. Please refer to the API
:doc:`document </mm/damon/api>` for details of the interface.
.. _damon_modules:
Modules
=======

View File

@@ -6,7 +6,7 @@ DAMON: Data Access MONitor
DAMON is a Linux kernel subsystem that provides a framework for data access
monitoring and the monitoring results based system operations. The core
monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it
monitoring :ref:`mechanisms <damon_design_monitoring>` of DAMON make it
- *accurate* (the monitoring output is useful enough for DRAM level memory
management; It might not appropriate for CPU Cache levels, though),
@@ -16,15 +16,16 @@ monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it
of the size of target workloads).
Using this framework, therefore, the kernel can operate system in an
access-aware fashion. Because the features are also exposed to the user space,
users who have special information about their workloads can write personalized
applications for better understanding and optimizations of their workloads and
systems.
access-aware fashion. Because the features are also exposed to the :doc:`user
space </admin-guide/mm/damon/index>`, users who have special information about
their workloads can write personalized applications for better understanding
and optimizations of their workloads and systems.
For easier development of such systems, DAMON provides a feature called DAMOS
(DAMon-based Operation Schemes) in addition to the monitoring. Using the
feature, DAMON users in both kernel and user spaces can do access-aware system
operations with no code but simple configurations.
For easier development of such systems, DAMON provides a feature called
:ref:`DAMOS <damon_design_damos>` (DAMon-based Operation Schemes) in addition
to the monitoring. Using the feature, DAMON users in both kernel and :doc:`user
spaces </admin-guide/mm/damon/index>` can do access-aware system operations
with no code but simple configurations.
.. toctree::
:maxdepth: 2
@@ -33,3 +34,6 @@ operations with no code but simple configurations.
design
api
maintainer-profile
To utilize and control DAMON from the user-space, please refer to the
administration :doc:`guide </admin-guide/mm/damon/index>`.

View File

@@ -53,6 +53,40 @@ Mon-Fri) in PT (Pacific Time). The response to patches will occasionally be
slow. Do not hesitate to send a ping if you have not heard back within a week
of sending a patch.
Mailing tool
------------
Like many other Linux kernel subsystems, DAMON uses the mailing lists
(damon@lists.linux.dev and linux-mm@kvack.org) as the major communication
channel. There is a simple tool called HacKerMaiL (``hkml``) [8]_ , which is
for people who are not very familiar with the mailing lists based
communication. The tool could be particularly helpful for DAMON community
members since it is developed and maintained by DAMON maintainer. The tool is
also officially announced to support DAMON and general Linux kernel development
workflow.
In other words, ``hkml`` [8]_ is a mailing tool for DAMON community, which
DAMON maintainer is committed to support. Please feel free to try and report
issues or feature requests for the tool to the maintainer.
Community meetup
----------------
DAMON community is maintaining two bi-weekly meetup series for community
members who prefer synchronous conversations over mails.
The first one is for any discussion between every community member. No
reservation is needed.
The seconds one is for discussions on specific topics between restricted
members including the maintainer. The maintainer shares the available time
slots, and attendees should reserve one of those at least 24 hours before the
time slot, by reaching out to the maintainer.
Schedules and available reservation time slots are available at the Google doc
[9]_ . DAMON maintainer will also provide periodic reminder to the mailing
list (damon@lists.linux.dev).
.. [1] https://git.kernel.org/akpm/mm/h/mm-unstable
.. [2] https://git.kernel.org/sj/h/damon/next
@@ -61,3 +95,5 @@ of sending a patch.
.. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh
.. [6] https://github.com/awslabs/damon-tests/tree/master/corr
.. [7] https://github.com/awslabs/damon-tests/tree/master/perf
.. [8] https://github.com/damonitor/hackermail
.. [9] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing

View File

@@ -191,13 +191,13 @@ have become evictable again (via munlock() for example) and have been "rescued"
from the unevictable list. However, there may be situations where we decide,
for the sake of expediency, to leave an unevictable folio on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
folios in all of the shrink_{active|inactive|page}_list() functions and will
folios in all of the shrink_{active|inactive|folio}_list() functions and will
"cull" such folios that it encounters: that is, it diverts those folios to the
unevictable list for the memory cgroup and node being scanned.
There may be situations where a folio is mapped into a VM_LOCKED VMA,
but the folio does not have the mlocked flag set. Such folios will make
it all the way to shrink_active_list() or shrink_page_list() where they
it all the way to shrink_active_list() or shrink_folio_list() where they
will be detected when vmscan walks the reverse map in folio_referenced()
or try_to_unmap(). The folio is culled to the unevictable list when it
is released by the shrinker.
@@ -269,7 +269,7 @@ the LRU. Such pages can be "noticed" by memory management in several places:
(4) in the fault path and when a VM_LOCKED stack segment is expanded; or
(5) as mentioned above, in vmscan:shrink_page_list() when attempting to
(5) as mentioned above, in vmscan:shrink_folio_list() when attempting to
reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap().
mlocked pages become unlocked and rescued from the unevictable list when:
@@ -548,12 +548,12 @@ Some examples of these unevictable pages on the LRU lists are:
(3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked,
but events left mlock_count too low, so they were munlocked too early.
vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously
vmscan's shrink_inactive_list() and shrink_folio_list() also divert obviously
unevictable pages found on the inactive lists to the appropriate memory cgroup
and node unevictable list.
rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or
shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(),
shrink_folio_list(), and rmap's try_to_unmap_one() called via shrink_folio_list(),
check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio()
to correct them. Such pages are culled to the unevictable list when released
by the shrinker.

View File

@@ -5701,6 +5701,8 @@ L: linux-mm@kvack.org
S: Maintained
F: include/linux/memcontrol.h
F: mm/memcontrol.c
F: mm/memcontrol-v1.c
F: mm/memcontrol-v1.h
F: mm/swap_cgroup.c
F: samples/cgroup/*
F: tools/testing/selftests/cgroup/memcg_protection.m

View File

@@ -283,7 +283,7 @@ void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr,
* flush_dcache_page is used when the kernel has written to the page
* cache page at virtual address page->virtual.
*
* If this page isn't mapped (ie, page_mapping == NULL), or it might
* If this page isn't mapped (ie, folio_mapping == NULL), or it might
* have userspace mappings, then we _must_ always clean + invalidate
* the dcache entries associated with the kernel mapping.
*

View File

@@ -13,12 +13,12 @@
/*
* If our huge pte is non-zero then mark the valid bit.
* This allows pte_present(huge_ptep_get(ptep)) to return true for non-zero
* This allows pte_present(huge_ptep_get(mm,addr,ptep)) to return true for non-zero
* ptes.
* (The valid bit is automatically cleared by set_pte_at for PROT_NONE ptes).
*/
#define __HAVE_ARCH_HUGE_PTEP_GET
static inline pte_t huge_ptep_get(pte_t *ptep)
static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
pte_t retval = *ptep;
if (pte_val(retval))

View File

@@ -117,7 +117,7 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
* flush_dcache_folio is used when the kernel has written to the page
* cache page at virtual address page->virtual.
*
* If this page isn't mapped (ie, page_mapping == NULL), or it might
* If this page isn't mapped (ie, folio_mapping == NULL), or it might
* have userspace mappings, then we _must_ always clean + invalidate
* the dcache entries associated with the kernel mapping.
*

Some files were not shown because too many files have changed in this diff Show More