XFS shutdown deadlocks have been reproduced by fstest generic/475.
The deadlock signature involves log I/O completion running error
handling to abort logged items and waiting for an inode cluster
buffer lock in the buffer item unpin handler. The buffer lock is
held by xfsaild attempting to flush an inode. The buffer happens to
be pinned and so xfs_iflush() triggers an async log force to begin
work required to get it unpinned. The log force is blocked waiting
on the commit completion, which never occurs and thus leaves the
filesystem deadlocked.
The root problem is that aborted log I/O completion pots commit
completion behind callback completion, which is unexpected for async
log forces. Under normal running conditions, an async log force
returns to the caller once the CIL ctx has been formatted/submitted
and the commit completion event triggered at the tail end of
xlog_cil_push(). If the filesystem has shutdown, however, we rely on
xlog_cil_committed() to trigger the completion event and it happens
to do so after running log item unpin callbacks. This makes it
unsafe to invoke an async log force from contexts that hold locks
that might also be required in log completion processing.
To address this problem, wake commit completion waiters before
aborting log items in the log I/O completion handler. This ensures
that an async log force will not deadlock on held locks if the
filesystem happens to shutdown. Note that it is still unsafe to
issue a sync log force while holding such locks because a sync log
force explicitly waits on the force completion, which occurs after
log I/O completion processing.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer
is unlocked when either non-stale or aborted. This assert occurs
after the bli refcount has been dropped and the log item potentially
freed. The aborted check is thus a potential use after free. This
problem has been reproduced with KASAN enabled via generic/475.
Fix up xfs_buf_item_unlock() to query aborted state before the bli
reference is dropped to prevent a potential use after free.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe". It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.
The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.
NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.
That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).
Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes from Jens Axboe:
"Set of fixes that should go into this round. This pull is larger than
I'd like at this time, but there's really no specific reason for that.
Some are fixes for issues that went into this merge window, others are
not. Anyway, this contains:
- Hardware queue limiting for virtio-blk/scsi (Dongli)
- Multi-page bvec fixes for lightnvm pblk
- Multi-bio dio error fix (Jason)
- Remove the cache hint from the io_uring tool side, since we didn't
move forward with that (me)
- Make io_uring SETUP_SQPOLL root restricted (me)
- Fix leak of page in error handling for pc requests (Jérôme)
- Fix BFQ regression introduced in this merge window (Paolo)
- Fix break logic for bio segment iteration (Ming)
- Fix NVMe cancel request error handling (Ming)
- NVMe pull request with two fixes (Christoph):
- fix the initial CSN for nvme-fc (James)
- handle log page offsets properly in the target (Keith)"
* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
block: fix the return errno for direct IO
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
block: do not leak memory in bio_copy_user_iov()
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
nvme: cancel request synchronously
blk-mq: introduce blk_mq_complete_request_sync()
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
virtio-blk: limit number of hw queues by nr_cpu_ids
block, bfq: fix use after free in bfq_bfqq_expire
io_uring: restrict IORING_SETUP_SQPOLL to root
tools/io_uring: remove IOCQE_FLAG_CACHEHIT
block: don't use for-inside-for in bio_for_each_segment_all
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fix:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be
sleeping" as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped
working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return
values"
* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
Pull SCSI fix from James Bottomley:
"One obvious fix for a ciostor data corruption on error bug"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: csiostor: fix missing data copy in csio_scsi_err_handler()
Pull clk fixes from Stephen Boyd:
"Here's more than a handful of clk driver fixes for changes that came
in during the merge window:
- Fix the AT91 sama5d2 programmable clk prescaler formula
- A bunch of Amlogic meson clk driver fixes for the VPU clks
- A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc
clks as critical only when really needed
- Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate
implementation
- Use the right structure to test for a frequency table in i.MX's
PLL_1416x driver"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: imx: Fix PLL_1416X not rounding rates
clk: mediatek: fix clk-gate flag setting
platform/x86: pmc_atom: Drop __initconst on dmi table
clk: x86: Add system specific quirk to mark clocks as critical
clk: meson: vid-pll-div: remove warning and return 0 on invalid config
clk: meson: pll: fix rounding and setting a rate that matches precisely
clk: meson-g12a: fix VPU clock parents
clk: meson: g12a: fix VPU clock muxes mask
clk: meson-gxbb: round the vdec dividers to closest
clk: at91: fix programmable clock for sama5d2
Pull PCI fixes from Bjorn Helgaas:
- Add a DMA alias quirk for another Marvell SATA device (Andre
Przywara)
- Fix a pciehp regression that broke safe removal of devices (Sergey
Miroshnichenko)
* tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: pciehp: Ignore Link State Changes after powering off a slot
PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
Pull powerpc fixes from Michael Ellerman:
"A minor build fix for 64-bit FLATMEM configs.
A fix for a boot failure on 32-bit powermacs.
My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
64-bit kernels, ie. compat mode, which is only used on big endian.
The rewrite of the SLB code we merged in 4.20 missed the fact that the
0x380 exception is also used with the Radix MMU to report out of range
accesses. This could lead to an oops if userspace tried to read from
addresses outside the user or kernel range.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
Piggin"
* tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
powerpc/64s/radix: Fix radix segment exception handling
powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
powerpc/32: Fix early boot failure with RTAS built-in
Pull arm64 fixes from Will Deacon:
"The main thing is a fix to our FUTEX_WAKE_OP implementation which was
unbelievably broken, but did actually work for the one scenario that
GLIBC used to use.
Summary:
- Fix stack unwinding so we ignore user stacks
- Fix ftrace module PLT trampoline initialisation checks
- Fix terminally broken implementation of FUTEX_WAKE_OP atomics"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
arm64: backtrace: Don't bother trying to unwind the userspace stack
arm64/ftrace: fix inadvertent BUG() in trampoline check
Pull x86 fixes from Ingo Molnar:
"Fix typos in user-visible resctrl parameters, and also fix assembly
constraint bugs that might result in miscompilation"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Use stricter assembly constraints in bitops
x86/resctrl: Fix typos in the mba_sc mount option
Pull timer fix from Ingo Molnar:
"Fix the alarm_timer_remaining() return value"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Return correct remaining time
Pull scheduler fix from Ingo Molnar:
"Fix a NULL pointer dereference crash in certain environments"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
Pull perf fixes from Ingo Molnar:
"Six kernel side fixes: three related to NMI handling on AMD systems, a
race fix, a kexec initialization fix and a PEBS sampling fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix perf_event_disable_inatomic() race
x86/perf/amd: Remove need to check "running" bit in NMI handler
x86/perf/amd: Resolve NMI latency issues for active PMCs
x86/perf/amd: Resolve race condition when disabling PMC
perf/x86/intel: Initialize TFA MSR
perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
Pull locking fix from Ingo Molnar:
"Fixes a crash when accessing /proc/lockdep"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Zap lock classes even with lock debugging disabled
Pull core fixes from Ingo Molnar:
"Fix an objtool warning plus fix a u64_to_user_ptr() macro expansion
bug"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add rewind_stack_do_exit() to the noreturn list
linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()
Code which initializes the "clk_init_data.ops" checks pll->rate_table
before that field is ever assigned to so it always picks
"clk_pll1416x_min_ops".
This breaks dynamic rate rounding for features such as cpufreq.
Fix by checking pll_clk->rate_table instead, here pll_clk refers to
the constant initialization data coming from per-soc clk driver.
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Fixes: 8646d4dcc7 ("clk: imx: Add PLLs driver for imx8mm soc")
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Pull dma-mapping fixes from Christoph Hellwig:
"Fix a sparc64 sun4v_pci regression introduced in this merged window,
and a dma-debug stracktrace regression from the big refactor last
merge window"
* tag 'dma-mapping-5.1-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: only skip one stackframe entry
sparc64/pci_sun4v: fix ATU checks for large DMA masks