Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- don't wake tasks during connection abort
- don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page"
* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
NFS: Fabricate fscache server index key correctly
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
NFSv3: Fix missing includes of nfs3_fs.h
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
NFS: avoid waiting at all in nfs_release_page when congested.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
MM: export page_wakeup functions
SCHED: add some "wait..on_bit...timeout()" interfaces.
NFS: don't use STABLE writes during writeback.
NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
rpc: Add -EPERM processing for xs_udp_send_request()
rpc: return sent and err from xs_sendpages()
lockd: Try to reconnect if statd has moved
SUNRPC: Don't wake tasks during connection abort
Fixing lease renewal
nfs: fix duplicate proc entries
pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
...
Pull KVM updates from Paolo Bonzini:
"Fixes and features for 3.18.
Apart from the usual cleanups, here is the summary of new features:
- s390 moves closer towards host large page support
- PowerPC has improved support for debugging (both inside the guest
and via gdbstub) and support for e6500 processors
- ARM/ARM64 support read-only memory (which is necessary to put
firmware in emulated NOR flash)
- x86 has the usual emulator fixes and nested virtualization
improvements (including improved Windows support on Intel and
Jailhouse hypervisor support on AMD), adaptive PLE which helps
overcommitting of huge guests. Also included are some patches that
make KVM more friendly to memory hot-unplug, and fixes for rare
caching bugs.
Two patches have trivial mm/ parts that were acked by Rik and Andrew.
Note: I will soon switch to a subkey for signing purposes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
kvm: do not handle APIC access page if in-kernel irqchip is not in use
KVM: s390: count vcpu wakeups in stat.halt_wakeup
KVM: s390/facilities: allow TOD-CLOCK steering facility bit
KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
arm/arm64: KVM: Report correct FSC for unsupported fault types
arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
kvm: Fix kvm_get_page_retry_io __gup retval check
arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
kvm: x86: Unpin and remove kvm_arch->apic_access_page
kvm: vmx: Implement set_apic_access_page_addr
kvm: x86: Add request bit to reload APIC access page address
kvm: Add arch specific mmu notifier for page invalidation
kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
kvm: Fix page ageing bugs
kvm/x86/mmu: Pass gfn and level to rmapp callback.
x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
kvm: x86: use macros to compute bank MSRs
KVM: x86: Remove debug assertion of non-PAE reserved bits
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
kvm: Faults which trigger IO release the mmap_sem
...
Pull "tinification" patches from Josh Triplett.
Work on making smaller kernels.
* tag 'tiny/for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux:
bloat-o-meter: Ignore syscall aliases SyS_ and compat_SyS_
mm: Support compiling out madvise and fadvise
x86: Support compiling out human-friendly processor feature names
x86: Drop support for /proc files when !CONFIG_PROC_FS
x86, boot: Don't compile early_serial_console.c when !CONFIG_EARLY_PRINTK
x86, boot: Don't compile aslr.c when !CONFIG_RANDOMIZE_BASE
x86, boot: Use the usual -y -n mechanism for objects in vmlinux
x86: Add "make tinyconfig" to configure the tiniest possible kernel
x86, platform, kconfig: move kvmconfig functionality to a helper
Merge fixes from Andrew Morton:
"5 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: page_alloc: fix zone allocation fairness on UP
perf: fix perf bug in fork()
MAINTAINERS: change git URL for mpc5xxx tree
mm: memcontrol: do not iterate uninitialized memcgs
ocfs2/dlm: should put mle when goto kill in dlm_assert_master_handler
The zone allocation batches can easily underflow due to higher-order
allocations or spills to remote nodes. On SMP that's fine, because
underflows are expected from concurrency and dealt with by returning 0.
But on UP, zone_page_state will just return a wrapped unsigned long,
which will get past the <= 0 check and then consider the zone eligible
until its watermarks are hit.
Commit 3a025760fc ("mm: page_alloc: spill to remote nodes before
waking kswapd") already made the counter-resetting use
atomic_long_read() to accomodate underflows from remote spills, but it
didn't go all the way with it.
Make it clear that these batches are expected to go negative regardless
of concurrency, and use atomic_long_read() everywhere.
Fixes: 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy")
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup iterators yield css objects that have not yet gone through
css_online(), but they are not complete memcgs at this point and so the
memcg iterators should not return them. Commit d8ad305597 ("mm/memcg:
iteration skip memcgs not yet fully initialized") set out to implement
exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
not meet the ordering requirements for memcg, and so the iterator may
skip over initialized groups, or return partially initialized memcgs.
The cgroup core can not reasonably provide a clear answer on whether the
object around the css has been fully initialized, as that depends on
controller-specific locking and lifetime rules. Thus, introduce a
memcg-specific flag that is set after the memcg has been initialized in
css_online(), and read before mem_cgroup_iter() callers access the memcg
members.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.
VM_RW|VM_PROTNONE
|-----------------|
^
split here
In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.
Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.
[torvalds@linux-foundation.org: use maybe_mkwrite]
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes + unifying __d_move() and __d_materialise_dentry() +
minimal regression fix for d_path() of victims of overwriting rename()
ported on top of that"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Don't exchange "short" filenames unconditionally.
fold swapping ->d_name.hash into switch_names()
fold unlocking the children into dentry_unlock_parents_for_move()
kill __d_materialise_dentry()
__d_materialise_dentry(): flip the order of arguments
__d_move(): fold manipulations with ->d_child/->d_subdirs
don't open-code d_rehash() in d_materialise_unique()
pull rehashing and unlocking the target dentry into __d_materialise_dentry()
ufs: deal with nfsd/iget races
fuse: honour max_read and max_write in direct_io mode
shmem: fix nlink for rename overwrite directory
Pull cgroup fixes from Tejun Heo:
"This is quite late but these need to be backported anyway.
This is the fix for a long-standing cpuset bug which existed from
2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
task's memory allocation behavior according to the settings of the
cpuset it belongs to; unfortunately, when those flags have to be
changed, cpuset did so directly even whlie the target task is running,
which is obviously racy as task->flags may be modified by the task
itself at any time. This obscure bug manifested as corrupt
PF_USED_MATH flag leading to a weird crash.
The bug is fixed by moving the flag to task->atomic_flags. The first
two are prepatory ones to help defining atomic_flags accessors and the
third one is the actual fix"
* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
sched: add macros to define bitops for task atomic flags
sched: fix confusing PFA_NO_NEW_PRIVS constant
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow.
Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.
The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.
Fixes: c9c37e2e63 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann <werner.baumann@onlinehome.de>
Tested-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If overwriting an empty directory with rename, then need to drop the extra
nlink.
Test prog:
#include <stdio.h>
#include <fcntl.h>
#include <err.h>
#include <sys/stat.h>
int main(void)
{
const char *test_dir1 = "test-dir1";
const char *test_dir2 = "test-dir2";
int res;
int fd;
struct stat statbuf;
res = mkdir(test_dir1, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir1);
res = mkdir(test_dir2, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir2);
fd = open(test_dir2, O_RDONLY);
if (fd == -1)
err(1, "open(\"%s\")", test_dir2);
res = rename(test_dir1, test_dir2);
if (res == -1)
err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
res = fstat(fd, &statbuf);
if (res == -1)
err(1, "fstat(%i)", fd);
if (statbuf.st_nlink != 0) {
fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
return 1;
}
return 0;
}
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This fixes the same bug as b43790eedd ("mm: softdirty: don't forget to
save file map softdiry bit on unmap") and 9aed8614af ("mm/memory.c:
don't forget to set softdirty on file mapped fault") where the return
value of pte_*mksoft_dirty was being ignored.
To be sure that no other pte/pmd "mk" function return values were being
ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
with __must_check and rebuilt.
The userspace effect of this bug is that the softdirty mark might be
lost if a file mapped pte get zapped.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit c1221321b7
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.
Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.
Here's the full report:
https://lkml.org/lkml/2014/9/19/230
To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Cc: <stable@vger.kernel.org> # 2.6.31+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.
2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.
3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.
If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.
This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull KVM fixes from Paolo Bonzini:
"Two very simple bugfixes, affecting all supported architectures"
[ Two? There's three commits in here. Oh well, I guess Paolo didn't
count the preparatory symbol export ]
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: correct null pid check in kvm_vcpu_yield_to()
KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn()
mm: export symbol dependencies of is_zero_pfn()
dma_pool_create() needs to unlock the mutex in error case. The bug was
introduced in the 3.16 by commit cc6b664aa2 ("mm/dmapool.c: remove
redundant NULL check for dev in dma_pool_create()")/
Signed-off-by: Krzysztof Hałasa <khc@piap.pl>
Cc: stable@vger.kernel.org # v3.16
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to make the static inline function is_zero_pfn() callable by
modules, export its symbol dependencies 'zero_pfn' and (for s390 and
mips) 'zero_page_mask'.
We need this for KVM, as CONFIG_KVM is a tristate for all supported
architectures except ARM and arm64, and testing a pfn whether it refers
to the zero page is required to correctly distinguish the zero page
from other special RAM ranges that may also have the PG_reserved bit
set, but need to be treated as MMIO memory.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make sure we actually see the output of validate_mm() and browse_rb()
before triggering a BUG(). pr_info isn't shown by default so the reason
for the BUG() isn't obvious.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let memblock skip the hotpluggable memory regions in __next_mem_range(),
it is used to to prevent memblock from allocating hotpluggable memory
for the kernel at early time. The code is the same as __next_mem_range_rev().
Clear hotpluggable flag before releasing free pages to the buddy
allocator. If we don't clear hotpluggable flag in
free_low_memory_core_early(), the memory which marked hotpluggable flag
will not free to buddy allocator. Because __next_mem_range() will skip
them.
free_low_memory_core_early
for_each_free_mem_range
for_each_mem_range
__next_mem_range
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>