Patch series "Constify struct page arguments".
While working on various solutions to the 32-bit struct page size
regression, one of the problems I found was the networking stack expects
to be able to pass const struct page pointers around, and the mm doesn't
provide a lot of const-friendly functions to call. The root tangle of
problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls
dump_page(), which calls a lot of functions which don't take a const
struct page (but could be const).
This patch (of 6):
The only caller of __dump_page() now opencodes dump_page(), so remove it
as an externally visible symbol.
Link: https://lkml.kernel.org/r/20210416231531.2521383-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20210416231531.2521383-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.
But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.
Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.
This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.
This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.
It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.
This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);
page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.
To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
If a compound page is being split while dump_page() is being run on that
page, we can end up calling compound_mapcount() on a page that is no
longer compound. This leads to a crash (already seen at least once in the
field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount().
(The above is from Matthew Wilcox's analysis of Qian Cai's bug report.)
A similar problem is possible, via compound_pincount() instead of
compound_mapcount().
In order to avoid this kind of crash, make dump_page() slightly more
robust, by providing a pair of simpler routines that don't contain
assertions: head_mapcount() and head_pincount().
For debug tools, we don't want to go *too* far in this direction, but this
is a simple small fix, and the crash has already been seen, so it's a good
trade-off.
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Except for historical confusion in the kprobes/uprobes and bpf tracers,
which has been fixed now, there is no good reason to ever allow user
memory accesses from probe_kernel_read. Switch probe_kernel_read to only
read from kernel memory.
[akpm@linux-foundation.org: update it for "mm, dump_page(): do not crash with invalid mapping pointer"]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-17-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have seen a following problem on a RPi4 with 1G RAM:
BUG: Bad page state in process systemd-hwdb pfn:35601
page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
Unable to handle kernel paging request at virtual address efd8fe765bc80080
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[efd8fe765bc80080] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] SMP
Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
Supported: No, Unreleased kernel
CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
pstate: 40000085 (nZcv daIf -PAN -UAO)
pc : __dump_page+0x268/0x368
lr : __dump_page+0xc4/0x368
sp : ffff000012563860
x29: ffff000012563860 x28: ffff80003ddc4300
x27: 0000000000000010 x26: 000000000000003f
x25: ffff7e0000d58040 x24: 000000000000000f
x23: efd8fe765bc80080 x22: 0000000000020095
x21: efd8fe765bc80080 x20: ffff000010ede8b0
x19: ffff7e0000d58040 x18: ffffffffffffffff
x17: 0000000000000001 x16: 0000000000000007
x15: ffff000011689708 x14: 3030386362353637
x13: 6566386466653a67 x12: 6e697070616d2031
x11: 32323133313a746e x10: 756f6370616d2035
x9 : ffff00001168a840 x8 : ffff00001077a670
x7 : 000000000000013d x6 : ffff0000118a43b5
x5 : 0000000000000001 x4 : ffff80003dd9e2c8
x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
x1 : dead000000000100 x0 : efd8fe765bc80080
Call trace:
__dump_page+0x268/0x368
bad_page+0xd4/0x168
check_new_page_bad+0x80/0xb8
rmqueue_bulk.constprop.26+0x4d8/0x788
get_page_from_freelist+0x4d4/0x1228
__alloc_pages_nodemask+0x134/0xe48
alloc_pages_vma+0x198/0x1c0
do_anonymous_page+0x1a4/0x4d8
__handle_mm_fault+0x4e8/0x560
handle_mm_fault+0x104/0x1e0
do_page_fault+0x1e8/0x4c0
do_translation_fault+0xb0/0xc0
do_mem_abort+0x50/0xb0
el0_da+0x24/0x28
Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)
Besides the underlying issue with page->mapping containing a bogus value
for some reason, we can see that __dump_page() crashed by trying to read
the pointer at mapping->host, turning a recoverable warning into full
Oops.
It can be expected that when page is reported as bad state for some
reason, the pointers there should not be trusted blindly.
So this patch treats all data in __dump_page() that depends on
page->mapping as lava, using probe_kernel_read_strict(). Ideally this
would include the dentry->d_parent recursively, but that would mean
changing printk handler for %pd. Chances of reaching the dentry
printing part with an initially bogus mapping pointer should be rather
low, though.
Also prefix printing mapping->a_ops with a description of what is being
printed. In case the value is bogus, %ps will print raw value instead
of the symbol name and then it's not obvious at all that it's printing
a_ops.
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not that hard to trigger lockdep splats by calling printk from
under zone->lock. Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well). There are some console drivers which do
allocate from the printk context as well and those should be fixed. In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.
So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock. Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.
Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug. The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.
While at it, remove a similar but unnecessary debug-only printk() as
well. A sample of one of those lockdep splats is,
WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/8653 is trying to acquire lock:
ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750
but task is already holding lock:
ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&(&zone->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock+0x2f/0x40
rmqueue_bulk.constprop.21+0xb6/0x1160
get_page_from_freelist+0x898/0x22c0
__alloc_pages_nodemask+0x2f3/0x1cd0
alloc_pages_current+0x9c/0x110
allocate_slab+0x4c6/0x19c0
new_slab+0x46/0x70
___slab_alloc+0x58b/0x960
__slab_alloc+0x43/0x70
__kmalloc+0x3ad/0x4b0
__tty_buffer_request_room+0x100/0x250
tty_insert_flip_string_fixed_flag+0x67/0x110
pty_write+0xa2/0xf0
n_tty_write+0x36b/0x7b0
tty_write+0x284/0x4c0
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
redirected_tty_write+0x6a/0xc0
do_iter_write+0x248/0x2a0
vfs_writev+0x106/0x1e0
do_writev+0xd4/0x180
__x64_sys_writev+0x45/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #2 (&(&port->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
tty_port_tty_get+0x20/0x60
tty_port_default_wakeup+0xf/0x30
tty_port_tty_wakeup+0x39/0x40
uart_write_wakeup+0x2a/0x40
serial8250_tx_chars+0x22e/0x440
serial8250_handle_irq.part.8+0x14a/0x170
serial8250_default_handle_irq+0x5c/0x90
serial8250_interrupt+0xa6/0x130
__handle_irq_event_percpu+0x78/0x4f0
handle_irq_event_percpu+0x70/0x100
handle_irq_event+0x5a/0x8b
handle_edge_irq+0x117/0x370
do_IRQ+0x9e/0x1e0
ret_from_intr+0x0/0x2a
cpuidle_enter_state+0x156/0x8e0
cpuidle_enter+0x41/0x70
call_cpuidle+0x5e/0x90
do_idle+0x333/0x370
cpu_startup_entry+0x1d/0x1f
start_secondary+0x290/0x330
secondary_startup_64+0xb6/0xc0
-> #1 (&port_lock_key){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
serial8250_console_write+0x3e4/0x450
univ8250_console_write+0x4b/0x60
console_unlock+0x501/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
-> #0 (console_owner){-.-.}:
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&(&zone->lock)->rlock);
lock(&(&port->lock)->rlock);
lock(&(&zone->lock)->rlock);
lock(console_owner);
*** DEADLOCK ***
9 locks held by test.sh/8653:
#0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
vfs_write+0x25f/0x290
#1: ffff888277618880 (&of->mutex){+.+.}, at:
kernfs_fop_write+0x128/0x240
#2: ffff8898131fc218 (kn->count#115){.+.+}, at:
kernfs_fop_write+0x138/0x240
#3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x16/0x50
#4: ffff8884374f4990 (&dev->mutex){....}, at:
device_offline+0x70/0x110
#5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xbf/0xa10
#6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x87/0x2f0
#7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
#8: ffffffff865a4920 (console_lock){+.+.}, at:
vprintk_emit+0x100/0x340
stack backtrace:
Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
BIOS U34 05/21/2019
Call Trace:
dump_stack+0x86/0xca
print_circular_bug.cold.31+0x243/0x26e
check_noncircular+0x29e/0x2e0
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.
Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
PageAnon() and PageKsm() use the low two bits of the page->mapping
pointer to indicate the page type. PageAnon() only checks the LSB while
PageKsm() checks the least significant 2 bits are equal to 3.
Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly
will never print "ksm" because it checks PageAnon() first. Fix this by
checking PageKsm() first.
Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
Fixes: 1c6fb1d89e ("mm: print more information about mapping in __dump_page")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dumping struct page information, __dump_page() prints the page type
with a trailing blank followed by the page flags on a separate line:
anon
flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
It looks like the intent was to use pr_cont() for printing "flags:" but
pr_cont() usage is discouraged so fix this by extending the format to
include the flags into a single line:
anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
If the page is file backed, the name might be long so use two lines:
shmem_aops name:"dev/zero"
flags: 0x10000000008000c(uptodate|dirty|swapbacked)
Eliminate pr_conf() usage as well for appending compound_mapcount.
Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While debugging something, I added a dump_page() into do_swap_page(),
and I got the splat from below. The issue happens when dereferencing
mapping->host in __dump_page():
...
else if (mapping) {
pr_warn("%ps ", mapping->a_ops);
if (mapping->host->i_dentry.first) {
struct dentry *dentry;
dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
pr_warn("name:\"%pd\" ", dentry);
}
}
...
Swap address space does not contain an inode information, and so
mapping->host equals NULL.
Although the dump_page() call was added artificially into
do_swap_page(), I am not sure if we can hit this from any other path, so
it looks worth fixing it. We can easily do that by checking
mapping->host first.
Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
Fixes: 1c6fb1d89e ("mm: print more information about mapping in __dump_page")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>