kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:
- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.
- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.
We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().
Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".
It looks like this code was always wrong, then commit 948b701a60
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.
This patch (of 6):
load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().
The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.
While at it, do whitespace/style cleanups.
NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.
Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.
If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.
We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.
Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.
This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.
Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().
If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.
Fixes: 893e26e61d ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
SwapBacked). There is no lock to prevent the page is added to swap
cache between these two steps by page reclaim. If page reclaim finds
such page, it will simply add the page to swap cache without pageout the
page to swap because the page is marked as clean. Next time, page fault
will read data from the swap slot which doesn't have the original data,
so we have a data corruption. To fix issue, we mark the page dirty and
pageout the page.
However, we shouldn't dirty all pages which is clean and in swap cache.
swapin page is swap cache and clean too. So we only dirty page which is
added into swap cache in page reclaim, which shouldn't be swapin page.
As Minchan suggested, simply dirty the page in add_to_swap can do the
job.
Fixes: 802a3a92ad ("mm: reclaim MADV_FREE pages")
Link: http://lkml.kernel.org/r/08c84256b007bf3f63c91d94383bd9eb6fee2daa.1506446061.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
SwapBacked). There is no lock to prevent the page is added to swap
cache between these two steps by page reclaim. Page reclaim could add
the page to swap cache and unmap the page. After page reclaim, the page
is added back to lru. At that time, we probably start draining per-cpu
pagevec and mark the page lazyfree. So the page could be in a state
with SwapBacked cleared and PG_swapcache set. Next time there is a
refault in the virtual address, do_swap_page can find the page from swap
cache but the page has PageSwapCache false because SwapBacked isn't set,
so do_swap_page will bail out and do nothing. The task will keep
running into fault handler.
Fixes: 802a3a92ad ("mm: reclaim MADV_FREE pages")
Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eryu noticed that he could sometimes get a leftover error reported when
it shouldn't be on fsync with ext2 and non-journalled ext4.
The problem is that writeback_single_inode still uses filemap_fdatawait.
That picks up a previously set AS_EIO flag, which would ordinarily have
been cleared before.
Since we're mostly using this function as a replacement for
filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
and AS_ENOSPC when reporting an error. That should allow the new
function to better emulate the behavior of the old with respect to these
flags.
Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Locking of config and doorbell operations should be done only if the
underlying hardware requires it.
This patch removes the global spinlocks from the rapidio subsystem and
moves them to the mport drivers (fsl_rio and tsi721), only to the
necessary places. For example, local config space read and write
operations (lcread/lcwrite) are atomic in all existing drivers, so there
should be no need for locking, while the cread/cwrite operations which
generate maintenance transactions need to be synchronized with a lock.
Later, each driver could chose to use a per-port lock instead of a
global one, or even more granular locking.
Link: http://lkml.kernel.org/r/20170824113023.GD50104@nokia.com
Signed-off-by: Ioan Nicu <ioan.nicu.ext@nokia.com>
Signed-off-by: Frank Kunz <frank.kunz@nokia.com>
Acked-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function is called from __meminit context and calls other __meminit
functions but isn't it self mark as such today:
WARNING: vmlinux.o(.text.unlikely+0x4516): Section mismatch in reference from the function init_reserved_page() to the function .meminit.text:early_pfn_to_nid()
The function init_reserved_page() references the function __meminit early_pfn_to_nid().
This is often because init_reserved_page lacks a __meminit annotation or the annotation of early_pfn_to_nid is wrong.
On most compilers, we don't notice this because the function gets
inlined all the time. Adding __meminit here fixes the harmless warning
for the old versions and is generally the correct annotation.
Link: http://lkml.kernel.org/r/20170915193149.901180-1-arnd@arndb.de
Fixes: 7e18adb4f8 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea brought to my attention that the L->{L,S} guarantees are
completely bogus for this case. I was looking at the diagram, from the
offending commit, when that _is_ the race, we had the load reordered
already.
What we need is at least S->L semantics, thus simply use
wq_has_sleeper() to serialize the call for good.
Link: http://lkml.kernel.org/r/20170914175313.GB811@linux-80c1.suse
Fixes: 46acef048a (mm,compaction: serialize waitqueue_active() checks)
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Andrea Parri <parri.andrea@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following lockdep splat has been noticed during LTP testing
======================================================
WARNING: possible circular locking dependency detected
4.13.0-rc3-next-20170807 #12 Not tainted
------------------------------------------------------
a.out/4771 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff812b4668>] drain_all_stock.part.35+0x18/0x140
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&mm->mmap_sem){++++++}:
lock_acquire+0xc9/0x230
__might_fault+0x70/0xa0
_copy_to_user+0x23/0x70
filldir+0xa7/0x110
xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
xfs_readdir+0x1fa/0x2c0 [xfs]
xfs_file_readdir+0x30/0x40 [xfs]
iterate_dir+0x17a/0x1a0
SyS_getdents+0xb0/0x160
entry_SYSCALL_64_fastpath+0x1f/0xbe
-> #2 (&type->i_mutex_dir_key#3){++++++}:
lock_acquire+0xc9/0x230
down_read+0x51/0xb0
lookup_slow+0xde/0x210
walk_component+0x160/0x250
link_path_walk+0x1a6/0x610
path_openat+0xe4/0xd50
do_filp_open+0x91/0x100
file_open_name+0xf5/0x130
filp_open+0x33/0x50
kernel_read_file_from_path+0x39/0x80
_request_firmware+0x39f/0x880
request_firmware_direct+0x37/0x50
request_microcode_fw+0x64/0xe0
reload_store+0xf7/0x180
dev_attr_store+0x18/0x30
sysfs_kf_write+0x44/0x60
kernfs_fop_write+0x113/0x1a0
__vfs_write+0x37/0x170
vfs_write+0xc7/0x1c0
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x1f0
return_from_SYSCALL_64+0x0/0x7a
-> #1 (microcode_mutex){+.+.+.}:
lock_acquire+0xc9/0x230
__mutex_lock+0x88/0x960
mutex_lock_nested+0x1b/0x20
microcode_init+0xbb/0x208
do_one_initcall+0x51/0x1a9
kernel_init_freeable+0x208/0x2a7
kernel_init+0xe/0x104
ret_from_fork+0x2a/0x40
-> #0 (cpu_hotplug_lock.rw_sem){++++++}:
__lock_acquire+0x153c/0x1550
lock_acquire+0xc9/0x230
cpus_read_lock+0x4b/0x90
drain_all_stock.part.35+0x18/0x140
try_charge+0x3ab/0x6e0
mem_cgroup_try_charge+0x7f/0x2c0
shmem_getpage_gfp+0x25f/0x1050
shmem_fault+0x96/0x200
__do_fault+0x1e/0xa0
__handle_mm_fault+0x9c3/0xe00
handle_mm_fault+0x16e/0x380
__do_page_fault+0x24a/0x530
do_page_fault+0x30/0x80
page_fault+0x28/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(&type->i_mutex_dir_key#3);
lock(&mm->mmap_sem);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
2 locks held by a.out/4771:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
#1: (percpu_charge_mutex){+.+...}, at: [<ffffffff812b4c97>] try_charge+0x397/0x6e0
The problem is very similar to the one fixed by commit a459eeb7b8
("mm, page_alloc: do not depend on cpu hotplug locks inside the
allocator"). We are taking hotplug locks while we can be sitting on top
of basically arbitrary locks. This just calls for problems.
We can get rid of {get,put}_online_cpus, fortunately. We do not have to
be worried about races with memory hotplug because drain_local_stock,
which is called from both the WQ draining and the memory hotplug
contexts, is always operating on the local cpu stock with IRQs disabled.
The only thing to be careful about is that the target memcg doesn't
vanish while we are still in drain_all_stock so take a reference on it.
Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea has noticed that the oom_reaper doesn't invalidate the range via
mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
corrupt the memory of the kvm guest for example.
tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
sufficient as per Andrea:
"mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either ->invalidate_range
method or the invalidate_range_start/end methods, not both. And if you
implement invalidate_range_start/end like KVM is forced to do, calling
mmu_notifier_invalidate_range in common code is a noop for KVM.
For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD iommuv2)
can get away only implementing ->invalidate_range"
As the callback is allowed to sleep and the implementation is out of
hand of the MM it is safer to simply bail out if there is an mmu
notifier registered. In order to not fail too early make the
mm_has_notifiers check under the oom_lock and have a little nap before
failing to give the current oom victim some more time to exit.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
Fixes: aac4536355 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>