mirror of
https://github.com/armbian/linux-cix.git
synced 2026-01-06 12:30:45 -08:00
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
This commit is contained in:
@@ -29,8 +29,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
||||
file updates contents of schemes stats files of the kdamond.
|
||||
Writing 'update_schemes_tried_regions' to the file updates
|
||||
contents of 'tried_regions' directory of every scheme directory
|
||||
of this kdamond. Writing 'clear_schemes_tried_regions' to the
|
||||
file removes contents of the 'tried_regions' directory.
|
||||
of this kdamond. Writing 'update_schemes_tried_bytes' to the
|
||||
file updates only '.../tried_regions/total_bytes' files of this
|
||||
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
||||
removes contents of the 'tried_regions' directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
|
||||
Date: Mar 2022
|
||||
@@ -269,8 +271,10 @@ What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the type of
|
||||
the memory of the interest. 'anon' for anonymous pages, or
|
||||
'memcg' for specific memory cgroup can be written and read.
|
||||
the memory of the interest. 'anon' for anonymous pages,
|
||||
'memcg' for specific memory cgroup, 'addr' for address range
|
||||
(an open-ended interval), or 'target' for DAMON monitoring
|
||||
target can be written and read.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
|
||||
Date: Dec 2022
|
||||
@@ -279,6 +283,27 @@ Description: If 'memcg' is written to the 'type' file, writing to and
|
||||
reading from this file sets and gets the path to the memory
|
||||
cgroup of the interest.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_start
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'addr' is written to the 'type' file, writing to or reading
|
||||
from this file sets or gets the start address of the address
|
||||
range for the filter.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_end
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'addr' is written to the 'type' file, writing to or reading
|
||||
from this file sets or gets the end address of the address
|
||||
range for the filter.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/target_idx
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'target' is written to the 'type' file, writing to or
|
||||
reading from this file sets or gets the index of the DAMON
|
||||
monitoring target of the interest.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
@@ -317,6 +342,13 @@ Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading this file returns the number of the exceed events of
|
||||
the scheme's quotas.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/total_bytes
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading this file returns the total amount of memory that
|
||||
corresponding DAMON-based Operation Scheme's action has tried
|
||||
to be applied.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
|
||||
Date: Oct 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
|
||||
@@ -10,7 +10,7 @@ Description:
|
||||
dropping it if possible. The kernel will then be placed
|
||||
on the bad page list and never be reused.
|
||||
|
||||
The offlining is done in kernel specific granuality.
|
||||
The offlining is done in kernel specific granularity.
|
||||
Normally it's the base page size of the kernel, but
|
||||
this might change.
|
||||
|
||||
@@ -35,7 +35,7 @@ Description:
|
||||
to access this page assuming it's poisoned by the
|
||||
hardware.
|
||||
|
||||
The offlining is done in kernel specific granuality.
|
||||
The offlining is done in kernel specific granularity.
|
||||
Normally it's the base page size of the kernel, but
|
||||
this might change.
|
||||
|
||||
|
||||
@@ -92,8 +92,6 @@ Brief summary of control files.
|
||||
memory.oom_control set/show oom controls.
|
||||
memory.numa_stat show the number of memory usage per numa
|
||||
node
|
||||
memory.kmem.limit_in_bytes This knob is deprecated and writing to
|
||||
it will return -ENOTSUPP.
|
||||
memory.kmem.usage_in_bytes show current kernel memory allocation
|
||||
memory.kmem.failcnt show the number of kernel memory usage
|
||||
hits limits
|
||||
|
||||
@@ -141,8 +141,8 @@ nodemask_t
|
||||
The size of a nodemask_t type. Used to compute the number of online
|
||||
nodes.
|
||||
|
||||
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
|
||||
-------------------------------------------------------------------------------------------------
|
||||
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
|
||||
----------------------------------------------------------------------------------
|
||||
|
||||
User-space tools compute their values based on the offset of these
|
||||
variables. The variables are used when excluding unnecessary pages.
|
||||
@@ -325,8 +325,8 @@ NR_FREE_PAGES
|
||||
On linux-2.6.21 or later, the number of free pages is in
|
||||
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
|
||||
|
||||
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
|
||||
------------------------------------------------------------------------------
|
||||
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
|
||||
-----------------------------------------------------------------------------------------
|
||||
|
||||
Page attributes. These flags are used to filter various unnecessary for
|
||||
dumping pages.
|
||||
@@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
|
||||
dumping pages.
|
||||
|
||||
|
||||
HUGETLB_PAGE_DTOR
|
||||
-----------------
|
||||
|
||||
The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
|
||||
excludes these pages.
|
||||
|
||||
x86_64
|
||||
======
|
||||
|
||||
|
||||
@@ -87,7 +87,7 @@ comma (","). ::
|
||||
│ │ │ │ │ │ │ filters/nr_filters
|
||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||
│ │ │ │ │ │ │ tried_regions/
|
||||
│ │ │ │ │ │ │ tried_regions/total_bytes
|
||||
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
||||
│ │ │ │ │ │ │ │ ...
|
||||
│ │ │ │ │ │ ...
|
||||
@@ -127,14 +127,18 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
|
||||
user inputs in the sysfs files except ``state`` file again. Writing
|
||||
``update_schemes_stats`` to ``state`` file updates the contents of stats files
|
||||
for each DAMON-based operation scheme of the kdamond. For details of the
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
|
||||
``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
|
||||
operation scheme action tried regions directory for each DAMON-based operation
|
||||
scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
|
||||
file clears the DAMON-based operating scheme action tried regions directory for
|
||||
each DAMON-based operation scheme of the kdamond. For details of the
|
||||
DAMON-based operation scheme action tried regions directory, please refer to
|
||||
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||
|
||||
Writing ``update_schemes_tried_regions`` to ``state`` file updates the
|
||||
DAMON-based operation scheme action tried regions directory for each
|
||||
DAMON-based operation scheme of the kdamond. Writing
|
||||
``update_schemes_tried_bytes`` to ``state`` file updates only
|
||||
``.../tried_regions/total_bytes`` files. Writing
|
||||
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
|
||||
operating scheme action tried regions directory for each DAMON-based operation
|
||||
scheme of the kdamond. For details of the DAMON-based operation scheme action
|
||||
tried regions directory, please refer to :ref:`tried_regions section
|
||||
<sysfs_schemes_tried_regions>`.
|
||||
|
||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||
|
||||
@@ -359,15 +363,21 @@ number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each filter. The filters are evaluated
|
||||
in the numeric order.
|
||||
|
||||
Each filter directory contains three files, namely ``type``, ``matcing``, and
|
||||
``memcg_path``. You can write one of two special keywords, ``anon`` for
|
||||
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
|
||||
the memory cgroup filtering, you can specify the memory cgroup of the interest
|
||||
by writing the path of the memory cgroup from the cgroups mount point to
|
||||
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
|
||||
filter out pages that does or does not match to the type, respectively. Then,
|
||||
the scheme's action will not be applied to the pages that specified to be
|
||||
filtered out.
|
||||
Each filter directory contains six files, namely ``type``, ``matcing``,
|
||||
``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
|
||||
file, you can write one of four special keywords: ``anon`` for anonymous pages,
|
||||
``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
|
||||
open-ended interval), or ``target`` for specific DAMON monitoring target
|
||||
filtering. In case of the memory cgroup filtering, you can specify the memory
|
||||
cgroup of the interest by writing the path of the memory cgroup from the
|
||||
cgroups mount point to ``memcg_path`` file. In case of the address range
|
||||
filtering, you can specify the start and end address of the range to
|
||||
``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
|
||||
target filtering, you can specify the index of the target between the list of
|
||||
the DAMON context's monitoring targets list to ``target_idx`` file. You can
|
||||
write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
|
||||
not match to the type, respectively. Then, the scheme's action will not be
|
||||
applied to the pages that specified to be filtered out.
|
||||
|
||||
For example, below restricts a DAMOS action to be applied to only non-anonymous
|
||||
pages of all memory cgroups except ``/having_care_already``.::
|
||||
@@ -381,8 +391,14 @@ pages of all memory cgroups except ``/having_care_already``.::
|
||||
echo /having_care_already > 1/memcg_path
|
||||
echo N > 1/matching
|
||||
|
||||
Note that filters are currently supported only when ``paddr``
|
||||
`implementation <sysfs_contexts>` is being used.
|
||||
Note that ``anon`` and ``memcg`` filters are currently supported only when
|
||||
``paddr`` `implementation <sysfs_contexts>` is being used.
|
||||
|
||||
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
|
||||
are not counted as the scheme has tried to those, while regions that filtered
|
||||
out by other type filters are counted as the scheme has tried to. The
|
||||
difference is applied to :ref:`stats <damos_stats>` and
|
||||
:ref:`tried regions <sysfs_schemes_tried_regions>`.
|
||||
|
||||
.. _sysfs_schemes_stats:
|
||||
|
||||
@@ -406,13 +422,21 @@ stats by writing a special keyword, ``update_schemes_stats`` to the relevant
|
||||
schemes/<N>/tried_regions/
|
||||
--------------------------
|
||||
|
||||
This directory initially has one file, ``total_bytes``.
|
||||
|
||||
When a special keyword, ``update_schemes_tried_regions``, is written to the
|
||||
relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
|
||||
starting from ``0`` under this directory. Each directory contains files
|
||||
exposing detailed information about each of the memory region that the
|
||||
corresponding scheme's ``action`` has tried to be applied under this directory,
|
||||
during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
|
||||
information includes address range, ``nr_accesses``, and ``age`` of the region.
|
||||
relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
|
||||
that reading it returns the total size of the scheme tried regions, and creates
|
||||
directories named integer starting from ``0`` under this directory. Each
|
||||
directory contains files exposing detailed information about each of the memory
|
||||
region that the corresponding scheme's ``action`` has tried to be applied under
|
||||
this directory, during next :ref:`aggregation interval
|
||||
<sysfs_monitoring_attrs>`. The information includes address range,
|
||||
``nr_accesses``, and ``age`` of the region.
|
||||
|
||||
Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
|
||||
file will only update the ``total_bytes`` file, and will not create the
|
||||
subdirectories.
|
||||
|
||||
The directories will be removed when another special keyword,
|
||||
``clear_schemes_tried_regions``, is written to the relevant
|
||||
|
||||
@@ -159,6 +159,8 @@ The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
general_profit
|
||||
how effective is KSM. The calculation is explained below.
|
||||
pages_scanned
|
||||
how many pages are being scanned for ksm
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
@@ -173,6 +175,13 @@ stable_node_chains
|
||||
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||
stable_node_dups
|
||||
number of duplicated KSM pages
|
||||
ksm_zero_pages
|
||||
how many zero pages that are still mapped into processes were mapped by
|
||||
KSM when deduplicating.
|
||||
|
||||
When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
|
||||
``ksm_zero_pages`` represents the actual number of pages saved by KSM.
|
||||
if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
|
||||
|
||||
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||
@@ -196,21 +205,25 @@ several times, which are unprofitable memory consumed.
|
||||
1) How to determine whether KSM save memory or consume memory in system-wide
|
||||
range? Here is a simple approximate calculation for reference::
|
||||
|
||||
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
|
||||
general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
|
||||
sizeof(rmap_item);
|
||||
|
||||
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
|
||||
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
|
||||
where ksm_saved_pages equals to the sum of ``pages_sharing`` +
|
||||
``ksm_zero_pages`` of the system, and all_rmap_items can be easily
|
||||
obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
|
||||
and ``pages_volatile``.
|
||||
|
||||
2) The KSM profit inner a single process can be similarly obtained by the
|
||||
following approximate calculation::
|
||||
|
||||
process_profit =~ ksm_merging_pages * sizeof(page) -
|
||||
process_profit =~ ksm_saved_pages * sizeof(page) -
|
||||
ksm_rmap_items * sizeof(rmap_item).
|
||||
|
||||
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
|
||||
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
|
||||
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
|
||||
where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
|
||||
``ksm_zero_pages``, both of which are shown under the directory
|
||||
``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
|
||||
``/proc/<pid>/ksm_stat``. The process profit is also shown in
|
||||
``/proc/<pid>/ksm_stat`` as ksm_process_profit.
|
||||
|
||||
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
|
||||
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
|
||||
|
||||
@@ -433,6 +433,18 @@ The following module parameters are currently defined:
|
||||
memory in a way that huge pages in bigger
|
||||
granularity cannot be formed on hotplugged
|
||||
memory.
|
||||
|
||||
With value "force" it could result in memory
|
||||
wastage due to memmap size limitations. For
|
||||
example, if the memmap for a memory block
|
||||
requires 1 MiB, but the pageblock size is 2
|
||||
MiB, 1 MiB of hotplugged memory will be wasted.
|
||||
Note that there are still cases where the
|
||||
feature cannot be enforced: for example, if the
|
||||
memmap is smaller than a single page, or if the
|
||||
architecture does not support the forced mode
|
||||
in all configurations.
|
||||
|
||||
``online_policy`` read-write: Set the basic policy used for
|
||||
automatic zone selection when onlining memory
|
||||
blocks without specifying a target zone.
|
||||
@@ -669,7 +681,7 @@ when still encountering permanently unmovable pages within ZONE_MOVABLE
|
||||
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
|
||||
|
||||
When offlining is triggered from user space, the offlining context can be
|
||||
terminated by sending a fatal signal. A timeout based offlining can easily be
|
||||
terminated by sending a signal. A timeout based offlining can easily be
|
||||
implemented via::
|
||||
|
||||
% timeout $TIMEOUT offline_block | failure_handling
|
||||
|
||||
@@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
|
||||
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
|
||||
respectively) to configure the mapping this way.
|
||||
|
||||
Memory Poisioning Emulation
|
||||
---------------------------
|
||||
|
||||
In response to a fault (either missing or minor), an action userspace can
|
||||
take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
|
||||
future faulters to either get a SIGBUS, or in KVM's case the guest will
|
||||
receive an MCE as if there were hardware memory poisoning.
|
||||
|
||||
This is used to emulate hardware memory poisoning. Imagine a VM running on a
|
||||
machine which experiences a real hardware memory error. Later, we live migrate
|
||||
the VM to another physical machine. Since we want the migration to be
|
||||
transparent to the guest, we want that same address range to act as if it was
|
||||
still poisoned, even though it's on a new physical host which ostensibly
|
||||
doesn't have a memory error in the exact same spot.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
|
||||
@@ -49,7 +49,7 @@ compressed pool.
|
||||
Design
|
||||
======
|
||||
|
||||
Zswap receives pages for compression through the Frontswap API and is able to
|
||||
Zswap receives pages for compression from the swap subsystem and is able to
|
||||
evict pages from its own compressed pool on an LRU basis and write them back to
|
||||
the backing swap device in the case that the compressed pool is full.
|
||||
|
||||
@@ -70,19 +70,19 @@ means the compression ratio will always be 2:1 or worse (because of half-full
|
||||
zbud pages). The zsmalloc type zpool has a more complex compressed page
|
||||
storage method, and it can achieve greater storage densities.
|
||||
|
||||
When a swap page is passed from frontswap to zswap, zswap maintains a mapping
|
||||
When a swap page is passed from swapout to zswap, zswap maintains a mapping
|
||||
of the swap entry, a combination of the swap type and swap offset, to the zpool
|
||||
handle that references that compressed swap page. This mapping is achieved
|
||||
with a red-black tree per swap type. The swap offset is the search key for the
|
||||
tree nodes.
|
||||
|
||||
During a page fault on a PTE that is a swap entry, frontswap calls the zswap
|
||||
load function to decompress the page into the page allocated by the page fault
|
||||
handler.
|
||||
During a page fault on a PTE that is a swap entry, the swapin code calls the
|
||||
zswap load function to decompress the page into the page allocated by the page
|
||||
fault handler.
|
||||
|
||||
Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
|
||||
in the swap_map goes to 0) the swap code calls the zswap invalidate function,
|
||||
via frontswap, to free the compressed entry.
|
||||
in the swap_map goes to 0) the swap code calls the zswap invalidate function
|
||||
to free the compressed entry.
|
||||
|
||||
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
|
||||
controlled policy:
|
||||
|
||||
@@ -134,6 +134,7 @@ Usage of helpers:
|
||||
bio_for_each_bvec_all()
|
||||
bio_first_bvec_all()
|
||||
bio_first_page_all()
|
||||
bio_first_folio_all()
|
||||
bio_last_bvec_all()
|
||||
|
||||
* The following helpers iterate over single-page segment. The passed 'struct
|
||||
|
||||
@@ -88,13 +88,17 @@ changes occur:
|
||||
|
||||
This is used primarily during fault processing.
|
||||
|
||||
5) ``void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep)``
|
||||
5) ``void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long address, pte_t *ptep,
|
||||
unsigned int nr)``
|
||||
|
||||
At the end of every page fault, this routine is invoked to
|
||||
tell the architecture specific code that a translation
|
||||
now exists at virtual address "address" for address space
|
||||
"vma->vm_mm", in the software page tables.
|
||||
At the end of every page fault, this routine is invoked to tell
|
||||
the architecture specific code that translations now exists
|
||||
in the software page tables for address space "vma->vm_mm"
|
||||
at virtual address "address" for "nr" consecutive pages.
|
||||
|
||||
This routine is also invoked in various other places which pass
|
||||
a NULL "vmf".
|
||||
|
||||
A port may use this information in any way it so chooses.
|
||||
For example, it could use this event to pre-load TLB
|
||||
@@ -269,7 +273,7 @@ maps this page at its virtual address.
|
||||
If D-cache aliasing is not an issue, these two routines may
|
||||
simply call memcpy/memset directly and do nothing more.
|
||||
|
||||
``void flush_dcache_page(struct page *page)``
|
||||
``void flush_dcache_folio(struct folio *folio)``
|
||||
|
||||
This routines must be called when:
|
||||
|
||||
@@ -277,7 +281,7 @@ maps this page at its virtual address.
|
||||
and / or in high memory
|
||||
b) the kernel is about to read from a page cache page and user space
|
||||
shared/writable mappings of this page potentially exist. Note
|
||||
that {get,pin}_user_pages{_fast} already call flush_dcache_page
|
||||
that {get,pin}_user_pages{_fast} already call flush_dcache_folio
|
||||
on any page found in the user address space and thus driver
|
||||
code rarely needs to take this into account.
|
||||
|
||||
@@ -291,7 +295,7 @@ maps this page at its virtual address.
|
||||
|
||||
The phrase "kernel writes to a page cache page" means, specifically,
|
||||
that the kernel executes store instructions that dirty data in that
|
||||
page at the page->virtual mapping of that page. It is important to
|
||||
page at the kernel virtual mapping of that page. It is important to
|
||||
flush here to handle D-cache aliasing, to make sure these kernel stores
|
||||
are visible to user space mappings of that page.
|
||||
|
||||
@@ -302,21 +306,22 @@ maps this page at its virtual address.
|
||||
If D-cache aliasing is not an issue, this routine may simply be defined
|
||||
as a nop on that architecture.
|
||||
|
||||
There is a bit set aside in page->flags (PG_arch_1) as "architecture
|
||||
There is a bit set aside in folio->flags (PG_arch_1) as "architecture
|
||||
private". The kernel guarantees that, for pagecache pages, it will
|
||||
clear this bit when such a page first enters the pagecache.
|
||||
|
||||
This allows these interfaces to be implemented much more efficiently.
|
||||
It allows one to "defer" (perhaps indefinitely) the actual flush if
|
||||
there are currently no user processes mapping this page. See sparc64's
|
||||
flush_dcache_page and update_mmu_cache implementations for an example
|
||||
of how to go about doing this.
|
||||
This allows these interfaces to be implemented much more
|
||||
efficiently. It allows one to "defer" (perhaps indefinitely) the
|
||||
actual flush if there are currently no user processes mapping this
|
||||
page. See sparc64's flush_dcache_folio and update_mmu_cache_range
|
||||
implementations for an example of how to go about doing this.
|
||||
|
||||
The idea is, first at flush_dcache_page() time, if page_file_mapping()
|
||||
returns a mapping, and mapping_mapped on that mapping returns %false,
|
||||
just mark the architecture private page flag bit. Later, in
|
||||
update_mmu_cache(), a check is made of this flag bit, and if set the
|
||||
flush is done and the flag bit is cleared.
|
||||
The idea is, first at flush_dcache_folio() time, if
|
||||
folio_flush_mapping() returns a mapping, and mapping_mapped() on that
|
||||
mapping returns %false, just mark the architecture private page
|
||||
flag bit. Later, in update_mmu_cache_range(), a check is made
|
||||
of this flag bit, and if set the flush is done and the flag bit
|
||||
is cleared.
|
||||
|
||||
.. important::
|
||||
|
||||
@@ -326,12 +331,6 @@ maps this page at its virtual address.
|
||||
dirty. Again, see sparc64 for examples of how
|
||||
to deal with this.
|
||||
|
||||
``void flush_dcache_folio(struct folio *folio)``
|
||||
This function is called under the same circumstances as
|
||||
flush_dcache_page(). It allows the architecture to
|
||||
optimise for flushing the entire folio of pages instead
|
||||
of flushing one page at a time.
|
||||
|
||||
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
unsigned long user_vaddr, void *dst, void *src, int len)``
|
||||
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
@@ -352,7 +351,7 @@ maps this page at its virtual address.
|
||||
|
||||
When the kernel needs to access the contents of an anonymous
|
||||
page, it calls this function (currently only
|
||||
get_user_pages()). Note: flush_dcache_page() deliberately
|
||||
get_user_pages()). Note: flush_dcache_folio() deliberately
|
||||
doesn't work for an anonymous page. The default
|
||||
implementation is a nop (and should remain so for all coherent
|
||||
architectures). For incoherent architectures, it should flush
|
||||
@@ -369,7 +368,7 @@ maps this page at its virtual address.
|
||||
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
|
||||
|
||||
All the functionality of flush_icache_page can be implemented in
|
||||
flush_dcache_page and update_mmu_cache. In the future, the hope
|
||||
flush_dcache_folio and update_mmu_cache_range. In the future, the hope
|
||||
is to remove this interface completely.
|
||||
|
||||
The final category of APIs is for I/O to deliberately aliased address
|
||||
|
||||
@@ -115,3 +115,28 @@ More Memory Management Functions
|
||||
.. kernel-doc:: include/linux/mmzone.h
|
||||
.. kernel-doc:: mm/util.c
|
||||
:functions: folio_mapping
|
||||
|
||||
.. kernel-doc:: mm/rmap.c
|
||||
.. kernel-doc:: mm/migrate.c
|
||||
.. kernel-doc:: mm/mmap.c
|
||||
.. kernel-doc:: mm/kmemleak.c
|
||||
.. #kernel-doc:: mm/hmm.c (build warnings)
|
||||
.. kernel-doc:: mm/memremap.c
|
||||
.. kernel-doc:: mm/hugetlb.c
|
||||
.. kernel-doc:: mm/swap.c
|
||||
.. kernel-doc:: mm/zpool.c
|
||||
.. kernel-doc:: mm/memcontrol.c
|
||||
.. #kernel-doc:: mm/memory-tiers.c (build warnings)
|
||||
.. kernel-doc:: mm/shmem.c
|
||||
.. kernel-doc:: mm/migrate_device.c
|
||||
.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files)
|
||||
.. kernel-doc:: mm/mapping_dirty_helpers.c
|
||||
.. #kernel-doc:: mm/memory-failure.c (build warnings)
|
||||
.. kernel-doc:: mm/percpu.c
|
||||
.. kernel-doc:: mm/maccess.c
|
||||
.. kernel-doc:: mm/vmscan.c
|
||||
.. kernel-doc:: mm/memory_hotplug.c
|
||||
.. kernel-doc:: mm/mmu_notifier.c
|
||||
.. kernel-doc:: mm/balloon_compaction.c
|
||||
.. kernel-doc:: mm/huge_memory.c
|
||||
.. kernel-doc:: mm/io-mapping.c
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | TODO |
|
||||
| arm64: | N/A |
|
||||
| arm64: | ok |
|
||||
| csky: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
||||
@@ -636,26 +636,29 @@ vm_operations_struct
|
||||
|
||||
prototypes::
|
||||
|
||||
void (*open)(struct vm_area_struct*);
|
||||
void (*close)(struct vm_area_struct*);
|
||||
vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
|
||||
void (*open)(struct vm_area_struct *);
|
||||
void (*close)(struct vm_area_struct *);
|
||||
vm_fault_t (*fault)(struct vm_fault *);
|
||||
vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
|
||||
vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
|
||||
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
|
||||
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
|
||||
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
|
||||
|
||||
locking rules:
|
||||
|
||||
============= ========= ===========================
|
||||
============= ========== ===========================
|
||||
ops mmap_lock PageLocked(page)
|
||||
============= ========= ===========================
|
||||
open: yes
|
||||
close: yes
|
||||
fault: yes can return with page locked
|
||||
map_pages: read
|
||||
page_mkwrite: yes can return with page locked
|
||||
pfn_mkwrite: yes
|
||||
access: yes
|
||||
============= ========= ===========================
|
||||
============= ========== ===========================
|
||||
open: write
|
||||
close: read/write
|
||||
fault: read can return with page locked
|
||||
huge_fault: maybe-read
|
||||
map_pages: maybe-read
|
||||
page_mkwrite: read can return with page locked
|
||||
pfn_mkwrite: read
|
||||
access: read
|
||||
============= ========== ===========================
|
||||
|
||||
->fault() is called when a previously not present pte is about to be faulted
|
||||
in. The filesystem must find and return the page associated with the passed in
|
||||
@@ -665,11 +668,18 @@ then ensure the page is not already truncated (invalidate_lock will block
|
||||
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
|
||||
locked. The VM will unlock the page.
|
||||
|
||||
->huge_fault() is called when there is no PUD or PMD entry present. This
|
||||
gives the filesystem the opportunity to install a PUD or PMD sized page.
|
||||
Filesystems can also use the ->fault method to return a PMD sized page,
|
||||
so implementing this function may not be necessary. In particular,
|
||||
filesystems should not call filemap_fault() from ->huge_fault().
|
||||
The mmap_lock may not be held when this method is called.
|
||||
|
||||
->map_pages() is called when VM asks to map easy accessible pages.
|
||||
Filesystem should find and map pages associated with offsets from "start_pgoff"
|
||||
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
|
||||
not block. If it's not possible to reach a page without blocking,
|
||||
filesystem should skip it. Filesystem should use do_set_pte() to setup
|
||||
filesystem should skip it. Filesystem should use set_pte_range() to setup
|
||||
page table entry. Pointer to entry associated with the page is passed in
|
||||
"pte" field in vm_fault structure. Pointers to entries for other offsets
|
||||
should be calculated relative to "pte".
|
||||
|
||||
@@ -938,3 +938,14 @@ file pointer instead of struct dentry pointer. d_tmpfile() is similarly
|
||||
changed to simplify callers. The passed file is in a non-open state and on
|
||||
success must be opened before returning (e.g. by calling
|
||||
finish_open_simple()).
|
||||
|
||||
---
|
||||
|
||||
**mandatory**
|
||||
|
||||
Calling convention for ->huge_fault has changed. It now takes a page
|
||||
order instead of an enum page_entry_size, and it may be called without the
|
||||
mmap_lock held. All in-tree users have been audited and do not seem to
|
||||
depend on the mmap_lock being held, but out of tree users should verify
|
||||
for themselves. If they do need it, they can return VM_FAULT_RETRY to
|
||||
be called with the mmap_lock held.
|
||||
|
||||
@@ -380,12 +380,24 @@ number of filters for each scheme. Each filter specifies the type of target
|
||||
memory, and whether it should exclude the memory of the type (filter-out), or
|
||||
all except the memory of the type (filter-in).
|
||||
|
||||
As of this writing, anonymous page type and memory cgroup type are supported by
|
||||
the feature. Some filter target types can require additional arguments. For
|
||||
example, the memory cgroup filter type asks users to specify the file path of
|
||||
the memory cgroup for the filter. Hence, users can apply specific schemes to
|
||||
only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages
|
||||
excluding those of specific cgroups, and any combination of those.
|
||||
Currently, anonymous page, memory cgroup, address range, and DAMON monitoring
|
||||
target type filters are supported by the feature. Some filter target types
|
||||
require additional arguments. The memory cgroup filter type asks users to
|
||||
specify the file path of the memory cgroup for the filter. The address range
|
||||
type asks the start and end addresses of the range. The DAMON monitoring
|
||||
target type asks the index of the target from the context's monitoring targets
|
||||
list. Hence, users can apply specific schemes to only anonymous pages,
|
||||
non-anonymous pages, pages of specific cgroups, all pages excluding those of
|
||||
specific cgroups, pages in specific address range, pages in specific DAMON
|
||||
monitoring targets, and any combination of those.
|
||||
|
||||
To handle filters efficiently, the address range and DAMON monitoring target
|
||||
type filters are handled by the core layer, while others are handled by
|
||||
operations set. If a memory region is filtered by a core layer-handled filter,
|
||||
it is not counted as the scheme has tried to the region. In contrast, if a
|
||||
memory regions is filtered by an operations set layer-handled filter, it is
|
||||
counted as the scheme has tried. The difference in accounting leads to changes
|
||||
in the statistics.
|
||||
|
||||
|
||||
Application Programming Interface
|
||||
|
||||
@@ -1,264 +0,0 @@
|
||||
=========
|
||||
Frontswap
|
||||
=========
|
||||
|
||||
Frontswap provides a "transcendent memory" interface for swap pages.
|
||||
In some environments, dramatic performance savings may be obtained because
|
||||
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
|
||||
|
||||
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
|
||||
|
||||
Frontswap is so named because it can be thought of as the opposite of
|
||||
a "backing" store for a swap device. The storage is assumed to be
|
||||
a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
|
||||
to the requirements of transcendent memory (such as Xen's "tmem", or
|
||||
in-kernel compressed memory, aka "zcache", or future RAM-like devices);
|
||||
this pseudo-RAM device is not directly accessible or addressable by the
|
||||
kernel and is of unknown and possibly time-varying size. The driver
|
||||
links itself to frontswap by calling frontswap_register_ops to set the
|
||||
frontswap_ops funcs appropriately and the functions it provides must
|
||||
conform to certain policies as follows:
|
||||
|
||||
An "init" prepares the device to receive frontswap pages associated
|
||||
with the specified swap device number (aka "type"). A "store" will
|
||||
copy the page to transcendent memory and associate it with the type and
|
||||
offset associated with the page. A "load" will copy the page, if found,
|
||||
from transcendent memory into kernel memory, but will NOT remove the page
|
||||
from transcendent memory. An "invalidate_page" will remove the page
|
||||
from transcendent memory and an "invalidate_area" will remove ALL pages
|
||||
associated with the swap type (e.g., like swapoff) and notify the "device"
|
||||
to refuse further stores with that swap type.
|
||||
|
||||
Once a page is successfully stored, a matching load on the page will normally
|
||||
succeed. So when the kernel finds itself in a situation where it needs
|
||||
to swap out a page, it first attempts to use frontswap. If the store returns
|
||||
success, the data has been successfully saved to transcendent memory and
|
||||
a disk write and, if the data is later read back, a disk read are avoided.
|
||||
If a store returns failure, transcendent memory has rejected the data, and the
|
||||
page can be written to swap as usual.
|
||||
|
||||
Note that if a page is stored and the page already exists in transcendent memory
|
||||
(a "duplicate" store), either the store succeeds and the data is overwritten,
|
||||
or the store fails AND the page is invalidated. This ensures stale data may
|
||||
never be obtained from frontswap.
|
||||
|
||||
If properly configured, monitoring of frontswap is done via debugfs in
|
||||
the `/sys/kernel/debug/frontswap` directory. The effectiveness of
|
||||
frontswap can be measured (across all swap devices) with:
|
||||
|
||||
``failed_stores``
|
||||
how many store attempts have failed
|
||||
|
||||
``loads``
|
||||
how many loads were attempted (all should succeed)
|
||||
|
||||
``succ_stores``
|
||||
how many store attempts have succeeded
|
||||
|
||||
``invalidates``
|
||||
how many invalidates were attempted
|
||||
|
||||
A backend implementation may provide additional metrics.
|
||||
|
||||
FAQ
|
||||
===
|
||||
|
||||
* Where's the value?
|
||||
|
||||
When a workload starts swapping, performance falls through the floor.
|
||||
Frontswap significantly increases performance in many such workloads by
|
||||
providing a clean, dynamic interface to read and write swap pages to
|
||||
"transcendent memory" that is otherwise not directly addressable to the kernel.
|
||||
This interface is ideal when data is transformed to a different form
|
||||
and size (such as with compression) or secretly moved (as might be
|
||||
useful for write-balancing for some RAM-like devices). Swap pages (and
|
||||
evicted page-cache pages) are a great use for this kind of slower-than-RAM-
|
||||
but-much-faster-than-disk "pseudo-RAM device".
|
||||
|
||||
Frontswap with a fairly small impact on the kernel,
|
||||
provides a huge amount of flexibility for more dynamic, flexible RAM
|
||||
utilization in various system configurations:
|
||||
|
||||
In the single kernel case, aka "zcache", pages are compressed and
|
||||
stored in local memory, thus increasing the total anonymous pages
|
||||
that can be safely kept in RAM. Zcache essentially trades off CPU
|
||||
cycles used in compression/decompression for better memory utilization.
|
||||
Benchmarks have shown little or no impact when memory pressure is
|
||||
low while providing a significant performance improvement (25%+)
|
||||
on some workloads under high memory pressure.
|
||||
|
||||
"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
|
||||
support for clustered systems. Frontswap pages are locally compressed
|
||||
as in zcache, but then "remotified" to another system's RAM. This
|
||||
allows RAM to be dynamically load-balanced back-and-forth as needed,
|
||||
i.e. when system A is overcommitted, it can swap to system B, and
|
||||
vice versa. RAMster can also be configured as a memory server so
|
||||
many servers in a cluster can swap, dynamically as needed, to a single
|
||||
server configured with a large amount of RAM... without pre-configuring
|
||||
how much of the RAM is available for each of the clients!
|
||||
|
||||
In the virtual case, the whole point of virtualization is to statistically
|
||||
multiplex physical resources across the varying demands of multiple
|
||||
virtual machines. This is really hard to do with RAM and efforts to do
|
||||
it well with no kernel changes have essentially failed (except in some
|
||||
well-publicized special-case workloads).
|
||||
Specifically, the Xen Transcendent Memory backend allows otherwise
|
||||
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
|
||||
virtual machines, but the pages can be compressed and deduplicated to
|
||||
optimize RAM utilization. And when guest OS's are induced to surrender
|
||||
underutilized RAM (e.g. with "selfballooning"), sudden unexpected
|
||||
memory pressure may result in swapping; frontswap allows those pages
|
||||
to be swapped to and from hypervisor RAM (if overall host system memory
|
||||
conditions allow), thus mitigating the potentially awful performance impact
|
||||
of unplanned swapping.
|
||||
|
||||
A KVM implementation is underway and has been RFC'ed to lkml. And,
|
||||
using frontswap, investigation is also underway on the use of NVM as
|
||||
a memory extension technology.
|
||||
|
||||
* Sure there may be performance advantages in some situations, but
|
||||
what's the space/time overhead of frontswap?
|
||||
|
||||
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
|
||||
nothingness and the only overhead is a few extra bytes per swapon'ed
|
||||
swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
|
||||
registers, there is one extra global variable compared to zero for
|
||||
every swap page read or written. If CONFIG_FRONTSWAP is enabled
|
||||
AND a frontswap backend registers AND the backend fails every "store"
|
||||
request (i.e. provides no memory despite claiming it might),
|
||||
CPU overhead is still negligible -- and since every frontswap fail
|
||||
precedes a swap page write-to-disk, the system is highly likely
|
||||
to be I/O bound and using a small fraction of a percent of a CPU
|
||||
will be irrelevant anyway.
|
||||
|
||||
As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
|
||||
registers, one bit is allocated for every swap page for every swap
|
||||
device that is swapon'd. This is added to the EIGHT bits (which
|
||||
was sixteen until about 2.6.34) that the kernel already allocates
|
||||
for every swap page for every swap device that is swapon'd. (Hugh
|
||||
Dickins has observed that frontswap could probably steal one of
|
||||
the existing eight bits, but let's worry about that minor optimization
|
||||
later.) For very large swap disks (which are rare) on a standard
|
||||
4K pagesize, this is 1MB per 32GB swap.
|
||||
|
||||
When swap pages are stored in transcendent memory instead of written
|
||||
out to disk, there is a side effect that this may create more memory
|
||||
pressure that can potentially outweigh the other advantages. A
|
||||
backend, such as zcache, must implement policies to carefully (but
|
||||
dynamically) manage memory limits to ensure this doesn't happen.
|
||||
|
||||
* OK, how about a quick overview of what this frontswap patch does
|
||||
in terms that a kernel hacker can grok?
|
||||
|
||||
Let's assume that a frontswap "backend" has registered during
|
||||
kernel initialization; this registration indicates that this
|
||||
frontswap backend has access to some "memory" that is not directly
|
||||
accessible by the kernel. Exactly how much memory it provides is
|
||||
entirely dynamic and random.
|
||||
|
||||
Whenever a swap-device is swapon'd frontswap_init() is called,
|
||||
passing the swap device number (aka "type") as a parameter.
|
||||
This notifies frontswap to expect attempts to "store" swap pages
|
||||
associated with that number.
|
||||
|
||||
Whenever the swap subsystem is readying a page to write to a swap
|
||||
device (c.f swap_writepage()), frontswap_store is called. Frontswap
|
||||
consults with the frontswap backend and if the backend says it does NOT
|
||||
have room, frontswap_store returns -1 and the kernel swaps the page
|
||||
to the swap device as normal. Note that the response from the frontswap
|
||||
backend is unpredictable to the kernel; it may choose to never accept a
|
||||
page, it could accept every ninth page, or it might accept every
|
||||
page. But if the backend does accept a page, the data from the page
|
||||
has already been copied and associated with the type and offset,
|
||||
and the backend guarantees the persistence of the data. In this case,
|
||||
frontswap sets a bit in the "frontswap_map" for the swap device
|
||||
corresponding to the page offset on the swap device to which it would
|
||||
otherwise have written the data.
|
||||
|
||||
When the swap subsystem needs to swap-in a page (swap_readpage()),
|
||||
it first calls frontswap_load() which checks the frontswap_map to
|
||||
see if the page was earlier accepted by the frontswap backend. If
|
||||
it was, the page of data is filled from the frontswap backend and
|
||||
the swap-in is complete. If not, the normal swap-in code is
|
||||
executed to obtain the page of data from the real swap device.
|
||||
|
||||
So every time the frontswap backend accepts a page, a swap device read
|
||||
and (potentially) a swap device write are replaced by a "frontswap backend
|
||||
store" and (possibly) a "frontswap backend loads", which are presumably much
|
||||
faster.
|
||||
|
||||
* Can't frontswap be configured as a "special" swap device that is
|
||||
just higher priority than any real swap device (e.g. like zswap,
|
||||
or maybe swap-over-nbd/NFS)?
|
||||
|
||||
No. First, the existing swap subsystem doesn't allow for any kind of
|
||||
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
|
||||
but this would require fairly drastic changes. Even if it were
|
||||
rewritten, the existing swap subsystem uses the block I/O layer which
|
||||
assumes a swap device is fixed size and any page in it is linearly
|
||||
addressable. Frontswap barely touches the existing swap subsystem,
|
||||
and works around the constraints of the block I/O subsystem to provide
|
||||
a great deal of flexibility and dynamicity.
|
||||
|
||||
For example, the acceptance of any swap page by the frontswap backend is
|
||||
entirely unpredictable. This is critical to the definition of frontswap
|
||||
backends because it grants completely dynamic discretion to the
|
||||
backend. In zcache, one cannot know a priori how compressible a page is.
|
||||
"Poorly" compressible pages can be rejected, and "poorly" can itself be
|
||||
defined dynamically depending on current memory constraints.
|
||||
|
||||
Further, frontswap is entirely synchronous whereas a real swap
|
||||
device is, by definition, asynchronous and uses block I/O. The
|
||||
block I/O layer is not only unnecessary, but may perform "optimizations"
|
||||
that are inappropriate for a RAM-oriented device including delaying
|
||||
the write of some pages for a significant amount of time. Synchrony is
|
||||
required to ensure the dynamicity of the backend and to avoid thorny race
|
||||
conditions that would unnecessarily and greatly complicate frontswap
|
||||
and/or the block I/O subsystem. That said, only the initial "store"
|
||||
and "load" operations need be synchronous. A separate asynchronous thread
|
||||
is free to manipulate the pages stored by frontswap. For example,
|
||||
the "remotification" thread in RAMster uses standard asynchronous
|
||||
kernel sockets to move compressed frontswap pages to a remote machine.
|
||||
Similarly, a KVM guest-side implementation could do in-guest compression
|
||||
and use "batched" hypercalls.
|
||||
|
||||
In a virtualized environment, the dynamicity allows the hypervisor
|
||||
(or host OS) to do "intelligent overcommit". For example, it can
|
||||
choose to accept pages only until host-swapping might be imminent,
|
||||
then force guests to do their own swapping.
|
||||
|
||||
There is a downside to the transcendent memory specifications for
|
||||
frontswap: Since any "store" might fail, there must always be a real
|
||||
slot on a real swap device to swap the page. Thus frontswap must be
|
||||
implemented as a "shadow" to every swapon'd device with the potential
|
||||
capability of holding every page that the swap device might have held
|
||||
and the possibility that it might hold no pages at all. This means
|
||||
that frontswap cannot contain more pages than the total of swapon'd
|
||||
swap devices. For example, if NO swap device is configured on some
|
||||
installation, frontswap is useless. Swapless portable devices
|
||||
can still use frontswap but a backend for such devices must configure
|
||||
some kind of "ghost" swap device and ensure that it is never used.
|
||||
|
||||
* Why this weird definition about "duplicate stores"? If a page
|
||||
has been previously successfully stored, can't it always be
|
||||
successfully overwritten?
|
||||
|
||||
Nearly always it can, but no, sometimes it cannot. Consider an example
|
||||
where data is compressed and the original 4K page has been compressed
|
||||
to 1K. Now an attempt is made to overwrite the page with data that
|
||||
is non-compressible and so would take the entire 4K. But the backend
|
||||
has no more space. In this case, the store must be rejected. Whenever
|
||||
frontswap rejects a store that would overwrite, it also must invalidate
|
||||
the old data and ensure that it is no longer accessible. Since the
|
||||
swap subsystem then writes the new data to the read swap device,
|
||||
this is the correct course of action to ensure coherency.
|
||||
|
||||
* Why does the frontswap patch create the new include file swapfile.h?
|
||||
|
||||
The frontswap code depends on some swap-subsystem-internal data
|
||||
structures that have, over the years, moved back and forth between
|
||||
static and global. This seemed a reasonable compromise: Define
|
||||
them as global but declare them in a new include file that isn't
|
||||
included by the large number of source files that include swap.h.
|
||||
|
||||
Dan Magenheimer, last updated April 9, 2012
|
||||
@@ -206,4 +206,5 @@ Functions
|
||||
=========
|
||||
|
||||
.. kernel-doc:: include/linux/highmem.h
|
||||
.. kernel-doc:: mm/highmem.c
|
||||
.. kernel-doc:: include/linux/highmem-internal.h
|
||||
|
||||
@@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
|
||||
Freeing Huge Pages
|
||||
==================
|
||||
|
||||
Huge page freeing is performed by the routine free_huge_page(). This routine
|
||||
is the destructor for hugetlbfs compound pages. As a result, it is only
|
||||
passed a pointer to the page struct. When a huge page is freed, reservation
|
||||
accounting may need to be performed. This would be the case if the page was
|
||||
associated with a subpool that contained reserves, or the page is being freed
|
||||
on an error path where a global reserve count must be restored.
|
||||
Huge pages are freed by free_huge_folio(). It is only passed a pointer
|
||||
to the folio as it is called from the generic MM code. When a huge page
|
||||
is freed, reservation accounting may need to be performed. This would
|
||||
be the case if the page was associated with a subpool that contained
|
||||
reserves, or the page is being freed on an error path where a global
|
||||
reserve count must be restored.
|
||||
|
||||
The page->private field points to any subpool associated with the page.
|
||||
If the PagePrivate flag is set, it indicates the global reserve count should
|
||||
@@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
|
||||
page is allocated but before it is instantiated. In this case, the page
|
||||
allocation has consumed the reservation and made the appropriate subpool,
|
||||
reservation map and global count adjustments. If the page is freed at this
|
||||
time (before instantiation and clearing of PagePrivate), then free_huge_page
|
||||
time (before instantiation and clearing of PagePrivate), then free_huge_folio
|
||||
will increment the global reservation count. However, the reservation map
|
||||
indicates the reservation was consumed. This resulting inconsistent state
|
||||
will cause the 'leak' of a reserved huge page. The global reserve count will
|
||||
|
||||
@@ -44,7 +44,6 @@ above structured documentation, or deleted if it has served its purpose.
|
||||
balance
|
||||
damon/index
|
||||
free_page_reporting
|
||||
frontswap
|
||||
hmm
|
||||
hwpoison
|
||||
hugetlbfs_reserv
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user