We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
pages. Put a page with higher ->inuse count first within its
->fullness_list, which will give us better chances to fill up this page
with new objects (find_get_zspage() return ->fullness_list head for new
object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
quicker.
It performs a trivial and cheap ->inuse compare which does not slow down
zsmalloc and in the worst case keeps the list pages in no particular
order.
A more expensive solution could sort fullness_list by ->inuse count.
[minchan@kernel.org: code adjustments]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Perform automatic pool compaction by a shrinker when system is getting
tight on memory.
User-space has a very little knowledge regarding zsmalloc fragmentation
and basically has no mechanism to tell whether compaction will result in
any memory gain. Another issue is that user space is not always aware
of the fact that system is getting tight on memory. Which leads to very
uncomfortable scenarios when user space may start issuing compaction
'randomly' or from crontab (for example). Fragmentation is not always
necessarily bad, allocated and unused objects, after all, may be filled
with the data later, w/o the need of allocating a new zspage. On the
other hand, we obviously don't want to waste memory when the system
needs it.
Compaction now has a relatively quick pool scan so we are able to
estimate the number of pages that will be freed easily, which makes it
possible to call this function from a shrinker->count_objects()
callback. We also abort compaction as soon as we detect that we can't
free any pages any more, preventing wasteful objects migrations.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number. Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction. So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.
This requires:
(a) a rename of `num_migrated' to 'pages_compacted'
(b) a internal API change -- return first_page's fullness_group from
putback_zspage(), so we know when putback_zspage() did
free_zspage(). It helps us to account compaction stats correctly.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram. It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction. However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.
Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there. It provides only `num_migrated', as of this writing, but it
surely can be extended.
A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.
Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zs_object_copy() argument order to be (DST, SRC) rather than
(SRC, DST). copy/move functions usually have (to, from) arguments
order.
Rename alloc_target_page() to isolate_target_page(). This function
doesn't allocate anything, it isolates target page, pretty much like
isolate_source_page().
Tweak __zs_compact() comment.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function checks if class compaction will free any pages.
Rephrasing -- do we have enough unused objects to form at least one
ZS_EMPTY page and free it. It aborts compaction if class compaction
will not result in any (further) savings.
EXAMPLE (this debug output is not part of this patch set):
- class size
- number of allocated objects
- number of used objects
- max objects per zspage
- pages per zspage
- estimated number of pages that will be freed
[..]
class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0
... class-512 compaction is useless. break
class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2
class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1
class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0
... class-496 compaction is useless. break
class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4
class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3
class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2
class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1
class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0
... class-448 compaction is useless. break
class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1
class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0
... class-432 compaction is useless. break
class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2
class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1
class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0
... class-416 compaction is useless. break
class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1
class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0
... class-400 compaction is useless. break
class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0
... class-384 compaction is useless. break
[..]
Every "compaction is useless" indicates that we saved CPU cycles.
class-512 has
544 object allocated
540 objects used
8 objects per-page
Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
migrate all of its objects and free this zspage; so compaction will not
make a lot of sense, it's better to just leave it as is.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always account per-class `zs_size_stat' stats. This data will help us
make better decisions during compaction. We are especially interested
in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
will result in any memory gain.
For instance, we know the number of allocated objects in the class, the
number of objects being used (so we also know how many objects are not
used) and the number of objects per-page. So we can ensure if we have
enough unused objects to form at least one ZS_EMPTY zspage during
compaction.
We calculate this value on per-class basis so we can calculate a total
number of zspages that can be released. Which is exactly what a
shrinker wants to know.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset tweaks compaction and makes it possible to trigger pool
compaction automatically when system is getting low on memory.
zsmalloc in some cases can suffer from a notable fragmentation and
compaction can release some considerable amount of memory. The problem
here is that currently we fully rely on user space to perform compaction
when needed. However, performing zsmalloc compaction is not always an
obvious thing to do. For example, suppose we have a `idle' fragmented
(compaction was never performed) zram device and system is getting low
on memory due to some 3rd party user processes (gcc LTO, or firefox,
etc.). It's quite unlikely that user space will issue zpool compaction
in this case. Besides, user space cannot tell for sure how badly pool
is fragmented; however, this info is known to zsmalloc and, hence, to a
shrinker.
This patch (of 7):
__zs_compact() does not use `nr_to_migrate', drop it.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use sysctl_lowmem_reserve_ratio rather than
sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
is in defending lowmem from the possibility of being captured into
pinned user memory. To avoid misleading, correct it in some comments.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1dfb059b94 ("thp: reduce khugepaged freezing latency") fixed
khugepaged to do not block a system suspend. But the result is that it
could not get interrupted before the given timeout because the condition
for the wait event is "false".
This patch puts back the original approach but it uses
freezable_schedule_timeout_interruptible() instead of
schedule_timeout_interruptible(). It does the right thing. I am pretty
sure that the freezable variant was not used in the original fix only
because it was not available at that time.
The regression has been there for ages. It was not critical. It just
did the allocation throttling a little bit more aggressively.
I found this problem when converting the kthread to kthread worker API
and trying to understand the code.
This bug is thought to have minimal userspace-visible impact. Somebody
could set a high alloc_sleep value by mistake, and then try to fix it
back, but khugepaged would keep sleeping until the high value expires.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cache isolate_start_pfn before entering isolate_migratepages(). If
pageblock is skipped in isolate_migratepages() due to whatever reason,
cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
that were freed. For example, the following scenario can be possible:
- assume order-9 compaction, pageblock order is 9
- start_isolate_pfn is 0x200
- isolate_migratepages()
- skip a number of pageblocks
- start to isolate from pfn 0x600
- cc->migrate_pfn = 0x620
- return
- last_migrated_pfn is set to 0x200
- check flushing condition
- current_block_start is set to 0x600
- last_migrated_pfn < current_block_start then do useless flush
This wrong flush would not help the performance and success rate so this
patch tries to fix it. One simple way to know the exact position where
we start to isolate migratable pages is that we cache it in
isolate_migratepages() before entering actual isolation. This patch
implements that and fixes the problem.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pages_exact_node() was introduced in commit 6484eb3e2a ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac3 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is merely a politeness: I've not found that shrink_page_list()
leads to deadlock with the page it holds locked across
wait_on_page_writeback(); but nevertheless, why hold others off by
keeping the page locked there?
And while we're at it: remove the mistaken "not " from the commentary on
this Case 3 (and a distracting blank line from Case 2, if I may).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In log_early function, crt_early_log should also count once when
'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count
from kmemleak_init is one less than 'actual number'.
Then, in kmemleak_init, if early_log buffer size equal actual number,
kmemleak will init sucessful, so change warning condition to
'crt_early_log > ARRAY_SIZE(early_log)'.
Signed-off-by: Wang Kai <morgan.wang@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
undoes range or swaps. But mm can drop clean page without notifying
shmem. This makes fstat sometimes return out-of-date block size.
The problem can be partially solved when we add
inode_operations->getattr which calls shmem_recalc_inode to update
i_blocks for fstat.
shmem_recalc_inode also updates counter used by statfs and
vm_committed_as. For them the situation is not changed. They still
suffer from the discrepancy after dropping clean page and before the
function is called by aforementioned triggers.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit e3239ff92a ("memblock: Rename memblock_region to
memblock_type and memblock_property to memblock_region"), all local
variables of the membock_type type were renamed to 'type'. This commit
renames all remaining local variables with the memblock_type type to the
same view.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory_failure() can be called at any page at any time, which means that
we can't eliminate the possibility of containment failure. In such case
the best option is to leak the page intentionally (and never touch it
later.)
We have an unpoison function for testing, and it cannot handle such
containment-failed pages, which results in kernel panic (visible with
various calltraces.) So this patch suggests that we limit the
unpoisonable pages to properly contained pages and ignore any other
ones.
Testers are recommended to keep in mind that there're un-unpoisonable
pages when writing test programs.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>