mirror of
https://github.com/ukui/kernel.git
synced 2026-03-09 10:07:04 -07:00
94de07e5565be51d91dd91dcae6ede54c0f9be27
17976 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cda4351252 |
Merge tag 'folio-5.18d' of git://git.infradead.org/users/willy/pagecache
Pull more filesystem folio updates from Matthew Wilcox: "A mixture of odd changes that didn't quite make it into the original pull and fixes for things that did. Also the readpages changes had to wait for the NFS tree to be pulled first. - Remove ->readpages infrastructure - Remove AOP_FLAG_CONT_EXPAND - Move read_descriptor_t to networking code - Pass the iocb to generic_perform_write - Minor updates to iomap, btrfs, ext4, f2fs, ntfs" * tag 'folio-5.18d' of git://git.infradead.org/users/willy/pagecache: btrfs: Remove a use of PAGE_SIZE in btrfs_invalidate_folio() ntfs: Correct mark_ntfs_record_dirty() folio conversion f2fs: Get the superblock from the mapping instead of the page f2fs: Correct f2fs_dirty_data_folio() conversion ext4: Correct ext4_journalled_dirty_folio() conversion filemap: Remove AOP_FLAG_CONT_EXPAND fs: Pass an iocb to generic_perform_write() fs, net: Move read_descriptor_t to net.h fs: Remove read_actor_t iomap: Simplify is_partially_uptodate a little readahead: Update comments mm: remove the skip_page argument to read_pages mm: remove the pages argument to read_pages fs: Remove ->readpages address space operation readahead: Remove read_cache_pages() |
||
|
|
78049e94a1 |
mm/damon: prevent activated scheme from sleeping by deactivated schemes
In the DAMON, the minimum wait time of the schemes decides whether the kernel wakes up 'kdamon_fn()'. But since the minimum wait time is initialized to zero, there are corner cases against the original objective. For example, if we have several schemes for one target, and if the wait time of the first scheme is zero, the minimum wait time will set zero, which means 'kdamond_fn()' should wake up to apply this scheme. However, in the following scheme, wait time can be set to non-zero. Thus, the mininum wait time will be set to non-zero, which can cause sleeping this interval for 'kdamon_fn()' due to one deactivated last scheme. This commit prevents making DAMON monitoring inactive state due to other deactivated schemes. Link: https://lkml.kernel.org/r/20220330105302.32114-1-tome01@ajou.ac.kr Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bfc8089f00 |
mm/kmemleak: reset tag when compare object pointer
When we use HW-tag based kasan and enable vmalloc support, we hit the following bug. It is due to comparison between tagged object and non-tagged pointer. We need to reset the kasan tag when we need to compare tagged object and non-tagged pointer. kmemleak: [name:kmemleak&]Scan area larger than object 0xffffffe77076f440 CPU: 4 PID: 1 Comm: init Tainted: G S W 5.15.25-android13-0-g5cacf919c2bc #1 Hardware name: MT6983(ENG) (DT) Call trace: add_scan_area+0xc4/0x244 kmemleak_scan_area+0x40/0x9c layout_and_allocate+0x1e8/0x288 load_module+0x2c8/0xf00 __se_sys_finit_module+0x190/0x1d0 __arm64_sys_finit_module+0x20/0x30 invoke_syscall+0x60/0x170 el0_svc_common+0xc8/0x114 do_el0_svc+0x28/0xa0 el0_svc+0x60/0xf8 el0t_64_sync_handler+0x88/0xec el0t_64_sync+0x1b4/0x1b8 kmemleak: [name:kmemleak&]Object 0xf5ffffe77076b000 (size 32768): kmemleak: [name:kmemleak&] comm "init", pid 1, jiffies 4294894197 kmemleak: [name:kmemleak&] min_count = 0 kmemleak: [name:kmemleak&] count = 0 kmemleak: [name:kmemleak&] flags = 0x1 kmemleak: [name:kmemleak&] checksum = 0 kmemleak: [name:kmemleak&] backtrace: module_alloc+0x9c/0x120 move_module+0x34/0x19c layout_and_allocate+0x1c4/0x288 load_module+0x2c8/0xf00 __se_sys_finit_module+0x190/0x1d0 __arm64_sys_finit_module+0x20/0x30 invoke_syscall+0x60/0x170 el0_svc_common+0xc8/0x114 do_el0_svc+0x28/0xa0 el0_svc+0x60/0xf8 el0t_64_sync_handler+0x88/0xec el0t_64_sync+0x1b4/0x1b8 Link: https://lkml.kernel.org/r/20220318034051.30687-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3149c79f3c |
mm,hwpoison: unmap poisoned page before invalidation
In some cases it appears the invalidation of a hwpoisoned page fails
because the page is still mapped in another process. This can cause a
program to be continuously restarted and die when it page faults on the
page that was not invalidated. Avoid that problem by unmapping the
hwpoisoned page when we find it.
Another issue is that sometimes we end up oopsing in finish_fault, if
the code tries to do something with the now-NULL vmf->page. I did not
hit this error when submitting the previous patch because there are
several opportunities for alloc_set_pte to bail out before accessing
vmf->page, and that apparently happened on those systems, and most of
the time on other systems, too.
However, across several million systems that error does occur a handful
of times a day. It can be avoided by returning VM_FAULT_NOPAGE which
will cause do_read_fault to return before calling finish_fault.
Link: https://lkml.kernel.org/r/20220325161428.5068d97e@imladris.surriel.com
Fixes:
|
||
|
|
8f0b364973 |
mm: kfence: fix objcgs vector allocation
If the kfence object is allocated to be used for objects vector, then
this slot of the pool eventually being occupied permanently since the
vector is never freed. The solutions could be (1) freeing vector when
the kfence object is freed or (2) allocating all vectors statically.
Since the memory consumption of object vectors is low, it is better to
chose (2) to fix the issue and it is also can reduce overhead of vectors
allocating in the future.
Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com
Fixes:
|
||
|
|
adb11e78c5 |
mm/munlock: protect the per-CPU pagevec by a local_lock_t
The access to mlock_pvec is protected by disabling preemption via get_cpu_var() or implicit by having preemption disabled by the caller (in mlock_page_drain() case). This breaks on PREEMPT_RT since folio_lruvec_lock_irq() acquires a sleeping lock in this section. Create struct mlock_pvec which consits of the local_lock_t and the pagevec. Acquire the local_lock() before accessing the per-CPU pagevec. Replace mlock_page_drain() with a _local() version which is invoked on the local CPU and acquires the local_lock_t and a _remote() version which uses the pagevec from a remote CPU which offline. Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ece369c7e1 |
mm/munlock: add lru_add_drain() to fix memcg_stat_test
Mike reports that LTP memcg_stat_test usually leads to
memcg_stat_test 3 TINFO: Test unevictable with MAP_LOCKED
memcg_stat_test 3 TINFO: Running memcg_process --mmap-lock1 -s 135168
memcg_stat_test 3 TINFO: Warming up pid: 3460
memcg_stat_test 3 TINFO: Process is still here after warm up: 3460
memcg_stat_test 3 TFAIL: unevictable is 122880, 135168 expected
but may also lead to
memcg_stat_test 4 TINFO: Test unevictable with mlock
memcg_stat_test 4 TINFO: Running memcg_process --mmap-lock2 -s 135168
memcg_stat_test 4 TINFO: Warming up pid: 4271
memcg_stat_test 4 TINFO: Process is still here after warm up: 4271
memcg_stat_test 4 TFAIL: unevictable is 122880, 135168 expected
or both. A wee bit flaky.
follow_page_pte() used to have an lru_add_drain() per each page mlocked,
and the test came to rely on accurate stats. The pagevec to be drained
is different now, but still covered by lru_add_drain(); and, never mind
the test, I believe it's in everyone's interest that a bulk faulting
interface like populate_vma_page_range() or faultin_vma_page_range()
should drain its local pagevecs at the end, to save others sometimes
needing the much more expensive lru_add_drain_all().
This does not absolutely guarantee exact stats - the mlocking task can
be migrated between CPUs as it proceeds - but it's good enough and the
tests pass.
Link: https://lkml.kernel.org/r/47f6d39c-a075-50cb-1cfb-26dd957a48af@google.com
Fixes:
|
||
|
|
e6b0a7b357 |
Revert "mm: madvise: skip unmapped vma holes passed to process_madvise"
This reverts commit |
||
|
|
800ba29547 |
fs: Pass an iocb to generic_perform_write()
We can extract both the file pointer and the pos from the iocb. This simplifies each caller as well as allowing generic_perform_write() to see more of the iocb contents in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
1e4702806f |
readahead: Update comments
- Refer to folios where appropriate, not pages (Matthew Wilcox) - Eliminate references to the internal PG_readhead - Use "readahead" consistently - not "read-ahead" or "read ahead" (mostly Neil Brown) - Clarify some sections that, on reflection, weren't very clear (Neil Brown) - Minor punctuation/spelling fixes (Neil Brown) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
|
b4e089d705 |
mm: remove the skip_page argument to read_pages
The skip_page argument to read_pages controls if rac->_index is incremented before returning from the function. Just open code that in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
|
dfd8b4fc76 |
mm: remove the pages argument to read_pages
This is always an empty list or NULL with the removal of the ->readahead support, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
|
704528d895 |
fs: Remove ->readpages address space operation
All filesystems have now been converted to use ->readahead, so remove the ->readpages operation and fix all the comments that used to refer to it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
ebf921a9fa |
readahead: Remove read_cache_pages()
With no remaining users, remove this function and the related infrastructure. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
f4f5d7cfb2 |
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin: - vdpa generic device type support - more virtio hardening for broken devices (but on the same theme, revert some virtio hotplug hardening patches - they were misusing some interrupt flags and had to be reverted) - RSS support in virtio-net - max device MTU support in mlx5 vdpa - akcipher support in virtio-crypto - shared IRQ support in ifcvf vdpa - a minor performance improvement in vhost - enable virtio mem for ARM64 - beginnings of advance dma support - cleanups, fixes all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (33 commits) vdpa/mlx5: Avoid processing works if workqueue was destroyed vhost: handle error while adding split ranges to iotlb vdpa: support exposing the count of vqs to userspace vdpa: change the type of nvqs to u32 vdpa: support exposing the config size to userspace vdpa/mlx5: re-create forwarding rules after mac modified virtio: pci: check bar values read from virtio config space Revert "virtio_pci: harden MSI-X interrupts" Revert "virtio-pci: harden INTX interrupts" drivers/net/virtio_net: Added RSS hash report control. drivers/net/virtio_net: Added RSS hash report. drivers/net/virtio_net: Added basic RSS support. drivers/net/virtio_net: Fixed padded vheader to use v1 with hash. virtio: use virtio_device_ready() in virtio_device_restore() tools/virtio: compile with -pthread tools/virtio: fix after premapped buf support virtio_ring: remove flags check for unmap packed indirect desc virtio_ring: remove flags check for unmap split indirect desc virtio_ring: rename vring_unmap_state_packed() to vring_unmap_extra_packed() net/mlx5: Add support for configuring max device MTU ... |
||
|
|
787af64d05 |
mm: page_alloc: validate buddy before check its migratetype.
Whenever a buddy page is found, page_is_buddy() should be called to
check its validity. Add the missing check during pageblock merge check.
Fixes:
|
||
|
|
1930a6e739 |
Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
|
||
|
|
0a815d0135 |
Merge tag 'ucount-rlimit-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull shm ucounts fix from Eric Biederman: "The introduction of a new failure mode when the code was converted to ucounts resulted in user_shm_lock misbehaving. The change simplifies the code to make the code easier to follow and removes the known misbehaviors" * tag 'ucount-rlimit-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mm/mlock: fix two bugs in user_shm_lock() |
||
|
|
504c1cabe3 |
mm/balloon_compaction: make balloon page compaction callbacks static
Since commit
|
||
|
|
ae085d7f93 |
mm: kfence: fix missing objcg housekeeping for SLAB
The objcg is not cleared and put for kfence object when it is freed,
which could lead to memory leak for struct obj_cgroup and wrong
statistics of NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B.
Since the last freed object's objcg is not cleared,
mem_cgroup_from_obj() could return the wrong memcg when this kfence
object, which is not charged to any objcgs, is reallocated to other
users.
A real word issue [1] is caused by this bug.
Link: https://lore.kernel.org/all/000000000000cabcb505dae9e577@google.com/ [1]
Reported-by: syzbot+f8c45ccc7d5d45fc5965@syzkaller.appspotmail.com
Fixes:
|
||
|
|
02f9a04d76 |
Merge tag 'memblock-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport: "Test suite and a small cleanup: - A small cleanup of unused variable in __next_mem_pfn_range_in_zone - Initial test suite to simulate memblock behaviour in userspace" * tag 'memblock-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: (27 commits) memblock tests: Add TODO and README files memblock tests: Add memblock_alloc_try_nid tests for bottom up memblock tests: Add memblock_alloc_try_nid tests for top down memblock tests: Add memblock_alloc_from tests for bottom up memblock tests: Add memblock_alloc_from tests for top down memblock tests: Add memblock_alloc tests for bottom up memblock tests: Add memblock_alloc tests for top down memblock tests: Add simulation of physical memory memblock tests: Split up reset_memblock function memblock tests: Fix testing with 32-bit physical addresses memblock: __next_mem_pfn_range_in_zone: remove unneeded local variable nid memblock tests: Add memblock_free tests memblock tests: Add memblock_add_node test memblock tests: Add memblock_remove tests memblock tests: Add memblock_reserve tests memblock tests: Add memblock_add tests memblock tests: Add memblock reset function memblock tests: Add skeleton of the memblock simulator tools/include: Add debugfs.h stub tools/include: Add pfn.h stub ... |
||
|
|
a060c9409e |
Merge tag 'write-page-prefaulting' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull iomap fixlet from Andreas Gruenbacher: "Fix buffered write page prefaulting. I forgot to send it the previous merge window. I've only improved the patch description since" * tag 'write-page-prefaulting' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: fs/iomap: Fix buffered write page prefaulting |
||
|
|
631f871f07 |
fs/iomap: Fix buffered write page prefaulting
When part of the user buffer passed to generic_perform_write() or
iomap_file_buffered_write() cannot be faulted in for reading, the entire
write currently fails. The correct behavior would be to write all the
data that can be written, up to the point of failure.
Commit
|
||
|
|
9457056ac4 |
mm: madvise: MADV_DONTNEED_LOCKED
MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well. Users mlock memory to protect secrets. There are allocators for secure buffers that want in-use memory generally mlocked, but cleared and invalidated memory to give up the physical pages. This could be done with explicit munlock -> mlock calls on free -> alloc of course, but that adds two unnecessary syscalls, heavy mmap_sem write locks, vma splits and re-merges - only to get rid of the backing pages. Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are okay with on-demand initial population. It seems valid to selectively free some memory during the lifetime of such a process, without having to mess with its overall policy. Why add a separate flag? Isn't this a pretty niche usecase? - MADV_DONTNEED has been bailing on locked vmas forever. It's at least conceivable that someone, somewhere is relying on mlock to protect data from perhaps broader invalidation calls. Changing this behavior now could lead to quiet data corruption. - It also clarifies expectations around MADV_FREE and maybe MADV_REMOVE. It avoids the situation where one quietly behaves different than the others. MADV_FREE_LOCKED can be added later. - The combination of mlock() and madvise() in the first place is probably niche. But where it happens, I'd say that dropping pages from a locked region once they don't contain secrets or won't page anymore is much saner than relying on mlock to protect memory from speculative or errant invalidation calls. It's just that we can't change the default behavior because of the two previous points. Given that, an explicit new flag seems to make the most sense. [hannes@cmpxchg.org: fix mips build] Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
6c8e2a2569 |
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
Problem:
=======
Userspace might read the zero-page instead of actual data from a direct IO
read on a block device if the buffers have been called madvise(MADV_FREE)
on earlier (this is discussed below) due to a race between page reclaim on
MADV_FREE and blkdev direct IO read.
- Race condition:
==============
During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
if the page is not dirty, then discards its rmap PTE(s) (vs. remap back
if the page is dirty).
However, after try_to_unmap_one() returns to shrink_page_list(), it might
keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
_one_ page reference, from the isolation for page reclaim).
Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it only sets them dirty _later_.
So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
IO read from block devices, and page reclaim happens during
__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
returns, but BEFORE the pages are set dirty, the situation happens.
The direct IO read eventually completes. Now, when userspace reads the
buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!
A synthetic reproducer is provided.
- Page faults:
===========
If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
direct IO. The userspace reads don't fault as the PTE is there (thus
zero-page is not used/setup).
But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
is no longer there; the subsequent page faults can't help:
The data-read from the block device probably won't generate faults due to
DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
different virtual addresses (not user-mapped addresses) because `struct
bio_vec` stores `struct page` to figure addresses out (which are different
from user-mapped addresses) for the read.
Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/ map to
other memory than the `struct page` used by `struct bio_vec` for the read.
(The original `struct page` is not available, since it wasn't freed, as
page_ref_freeze() failed due to more page refs. And even if it were
available, its data cannot be trusted anymore.)
Solution:
========
One solution is to check for the expected page reference count in
try_to_unmap_one().
There should be one reference from the isolation (that is also checked in
shrink_page_list() with page_ref_freeze()) plus one or more references
from page mapping(s) (put in discard: label). Further references mean
that rmap/PTE cannot be unmapped/nuked.
(Note: there might be more than one reference from mapping due to
fork()/clone() without CLONE_VM, which use the same `struct page` for
references, until the copy-on-write page gets copied.)
So, additional page references (e.g., from direct IO read) now prevent the
rmap/PTE from being unmapped/dropped; similarly to the page is not freed
per shrink_page_list()/page_ref_freeze()).
- Races and Barriers:
==================
The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
done under the PTE lock.
The fast path doesn't take the lock, but it checks if the PTE has changed
and if so, it drops the reference and leaves the page for the slow path
(which does take that lock).
The fast path requires synchronization w/ full memory barrier: it writes
the page reference count first then it reads the PTE later, while
try_to_unmap() writes PTE first then it reads page refcount.
And a second barrier is needed, as the page dirty flag should not be read
before the page reference count (as in __remove_mapping()). (This can be
a load memory barrier only; no writes are involved.)
Call stack/comments:
- try_to_unmap_one()
- page_vma_mapped_walk()
- map_pte() # see pte_offset_map_lock():
pte_offset_map()
spin_lock()
- ptep_get_and_clear() # write PTE
- smp_mb() # (new barrier) GUP fast path
- page_ref_count() # (new check) read refcount
- page_vma_mapped_walk_done() # see pte_unmap_unlock():
pte_unmap()
spin_unlock()
- bio_iov_iter_get_pages()
- __bio_iov_iter_get_pages()
- iov_iter_get_pages()
- get_user_pages_fast()
- internal_get_user_pages_fast()
# fast path
- lockless_pages_from_mm()
- gup_{pgd,p4d,pud,pmd,pte}_range()
ptep = pte_offset_map() # not _lock()
pte = ptep_get_lockless(ptep)
page = pte_page(pte)
try_grab_compound_head(page) # inc refcount
# (RMW/barrier
# on success)
if (pte_val(pte) != pte_val(*ptep)) # read PTE
put_compound_head(page) # dec refcount
# go slow path
# slow path
- __gup_longterm_unlocked()
- get_user_pages_unlocked()
- __get_user_pages_locked()
- __get_user_pages()
- follow_{page,p4d,pud,pmd}_mask()
- follow_page_pte()
ptep = pte_offset_map_lock()
pte = *ptep
page = vm_normal_page(pte)
try_grab_page(page) # inc refcount
pte_unmap_unlock()
- Huge Pages:
==========
Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
(aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list() before
try_to_unmap[_one](), so it deals with normal pages only.
(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
which should not or be rare, the page refcount should be greater than
mapcount: the head page is referenced by tail pages. That also prevents
checking the head `page` then incorrectly call page_remove_rmap(subpage)
for a tail page, that isn't even in the shrink_page_list()'s page_list (an
effect of split huge pmd/pmvw), as it might happen today in this unlikely
scenario.)
MADV_FREE'd buffers:
===================
So, back to the "if MADV_FREE pages are used as buffers" note. The case
is arguable, and subject to multiple interpretations.
The madvise(2) manual page on the MADV_FREE advice value says:
1) 'After a successful MADV_FREE ... data will be lost when
the kernel frees the pages.'
2) 'the free operation will be canceled if the caller writes
into the page' / 'subsequent writes ... will succeed and
then [the] kernel cannot free those dirtied pages'
3) 'If there is no subsequent write, the kernel can free the
pages at any time.'
Thoughts, questions, considerations... respectively:
1) Since the kernel didn't actually free the page (page_ref_freeze()
failed), should the data not have been lost? (on userspace read.)
2) Should writes performed by the direct IO read be able to cancel
the free operation?
- Should the direct IO read be considered as 'the caller' too,
as it's been requested by 'the caller'?
- Should the bio technique to dirty pages on return to userspace
(bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
be considered in another/special way here?
3) Should an upcoming write from a previously requested direct IO
read be considered as a subsequent write, so the kernel should
not free the pages? (as it's known at the time of page reclaim.)
And lastly:
Technically, the last point would seem a reasonable consideration and
balance, as the madvise(2) manual page apparently (and fairly) seem to
assume that 'writes' are memory access from the userspace process (not
explicitly considering writes from the kernel or its corner cases; again,
fairly).. plus the kernel fix implementation for the corner case of the
largely 'non-atomic write' encompassed by a direct IO read operation, is
relatively simple; and it helps.
Reproducer:
==========
@ test.c (simplified, but works)
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main() {
int fd, i;
char *buf;
fd = open(DEV, O_RDONLY | O_DIRECT);
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
buf[i] = 1; // init to non-zero
madvise(buf, BUF_SIZE, MADV_FREE);
read(fd, buf, BUF_SIZE);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
printf("%p: 0x%x\n", &buf[i], buf[i]);
return 0;
}
@ block/fops.c (formerly fs/block_dev.c)
+#include <linux/swap.h>
...
... __blkdev_direct_IO[_simple](...)
{
...
+ if (!strcmp(current->comm, "good"))
+ shrink_all_memory(ULONG_MAX);
+
ret = bio_iov_iter_get_pages(...);
+
+ if (!strcmp(current->comm, "bad"))
+ shrink_all_memory(ULONG_MAX);
...
}
@ shell
# NUM_PAGES=4
# PAGE_SIZE=$(getconf PAGE_SIZE)
# yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
# DEV=$(losetup -f --show test.img)
# gcc -DDEV=\"$DEV\" \
-DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
-DPAGE_SIZE=${PAGE_SIZE} \
test.c -o test
# od -tx1 $DEV
0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
*
0040000
# mv test good
# ./good
0x7f7c10418000: 0x79
0x7f7c10419000: 0x79
0x7f7c1041a000: 0x79
0x7f7c1041b000: 0x79
# mv good bad
# ./bad
0x7fa1b8050000: 0x0
0x7fa1b8051000: 0x0
0x7fa1b8052000: 0x0
0x7fa1b8053000: 0x0
Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap
do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
- v5.17-rc3:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x0
# free | grep Swap
Swap: 0 0 0
- v4.5:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
2702 0x0
1298 0x79
# swapoff -av
swapoff /swap
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
Ceph/TCMalloc:
=============
For documentation purposes, the use case driving the analysis/fix is Ceph
on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
release unused memory to the system from the mmap'ed page heap (might be
committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
-> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
TCMalloc_SystemCommit() -> do nothing.
Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
just 'disappeared' on Ceph on later Ubuntu releases but is still present
in the kernel, and can be hit by other use cases.
The observed issue seems to be the old Ceph bug #22464 [1], where checksum
mismatches are observed (and instrumentation with buffer dumps shows
zero-pages read from mmap'ed/MADV_FREE'd page ranges).
The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
mostly worked around with a retry mechanism, but other parts of Ceph could
still hit that (rocksdb). Anyway, it's less likely to be hit again as
TCMalloc switched out of MADV_FREE by default.
(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad
[1] https://tracker.ceph.com/issues/22464
Thanks:
======
Several people contributed to analysis/discussions/tests/reproducers in
the first stages when drilling down on ceph/tcmalloc/linux kernel:
- Dan Hill
- Dan Streetman
- Dongdong Tao
- Gavin Guo
- Gerald Yang
- Heitor Alves de Siqueira
- Ioanna Alifieraki
- Jay Vosburgh
- Matthew Ruffell
- Ponnuvel Palaniyappan
Reviews, suggestions, corrections, comments:
- Minchan Kim
- Yu Zhao
- Huang, Ying
- John Hubbard
- Christoph Hellwig
[mfo@canonical.com: v4]
Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
Fixes:
|