Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
total_objects could be 0 and is used as a denom.
While total_objects is a "long", total_objects == 0 unlikely happens for
3.12 and later kernels because 32-bit architectures would not be able to
hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
between 3.1 and 3.11 because total_objects in prune_super() was an "int"
and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable <stable@kernel.org> # 3.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Pull quota, reiserfs, UDF updates from Jan Kara:
"Scalability improvements for quota, a few reiserfs fixes, and couple
of misc cleanups (udf, ext2)"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Fix use after free in journal teardown
reiserfs: fix corruption introduced by balance_leaf refactor
udf: avoid redundant memcpy when writing data in ICB
fs/udf: re-use hex_asc_upper_{hi,lo} macros
fs/quota: kernel-doc warning fixes
udf: use linux/uaccess.h
fs/ext2/super.c: Drop memory allocation cast
quota: remove dqptr_sem
quota: simplify remove_inode_dquot_ref()
quota: avoid unnecessary dqget()/dqput() calls
quota: protect Q_GETFMT by dqonoff_mutex
These externs belong in fs/internal.h. Rename (they are not acct-specific
anymore) and move them over there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just repeat the frozen check after regaining it, and check that sb
is still alive. If several threads hit acct_auto_close() at the
same time, acct_auto_close() will survive that just fine. And we
really don't want to play with writes and closing the file with
->s_umount held exclusive - it's a deadlock country.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
accessing inode->i_dquot now protected by dquot_srcu, and changing
inode->i_dquot is now serialized by dq_data_lock.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying to do
memory reclamation, e.g. when we are doing large amount of file read and
page cache is under pressure. The cached objects quickly got reclaimed
down to 0 and we are aborting the cache_scan() reclaim. But counting
creates a log jam acquiring the sb_lock.
We are holding the shrinker_rwsem which ensures the safety of call to
list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
unregistered now before ->kill_sb() so the operation is safe when we are
doing unmount.
The impact will depend heavily on the machine and the workload but for a
small machine using postmark tuned to use 4xRAM size the results were
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)
ffsb running in a configuration that is meant to simulate a mail server showed
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series is aimed at regressions noticed during reclaim activity. The
first two patches are shrinker patches that were posted ages ago but never
merged for reasons that are unclear to me. I'm posting them again to see
if there was a reason they were dropped or if they just got lost. Dave?
Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
retest the vm scalability test cases on a larger machine? Hugh, does this
work for you on the memcg test cases?
Based on ext4, I get the following results but unfortunately my larger
test machines are all unavailable so this is based on a relatively small
machine.
postmark
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)
ffsb (mail server simulator)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)
dd of a large file
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)
stutter (times mmap latency during large amounts of IO)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)
This patch (of 3):
We will like to unregister the sb shrinker before ->kill_sb(). This will
allow cached objects to be counted without call to grab_super_passive() to
update ref count on sb. We want to avoid locking during memory
reclamation especially when we are skipping the memory reclaim when we are
out of cached objects.
This is safe because grab_super_passive does a try-lock on the
sb->s_umount now, and so if we are in the unmount process, it won't ever
block. That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:
shrinker umount
down_read(shrinker_rwsem)
down_write(sb->s_umount)
shrinker_unregister
down_write(shrinker_rwsem)
<blocks>
grab_super_passive(sb)
down_read_trylock(sb->s_umount)
<fails>
<shrinker aborts>
....
<shrinkers finish running>
up_read(shrinker_rwsem)
<unblocks>
<removes shrinker>
up_write(shrinker_rwsem)
->kill_sb()
....
So it is safe to deregister the shrinker before ->kill_sb().
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move sync_filesystem() after sb_prepare_remount_readonly(). If writers
sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
it can cause inodes to be dirtied and writeback to occur well after
sys_mount() has completely successfully.
This was spotted by corrupted ubifs filesystems on reboot, but appears
that it can cause issues with any filesystem using writeback.
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Richard Weinberger <richard@nod.at>
Co-authored-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On fail path alloc_super() calls destroy_super(), which issues a warning
if the sb's s_mounts list is not empty, in particular if it has not been
initialized. That said s_mounts must be initialized in alloc_super()
before any possible failure, but currently it is initialized close to
the end of the function leading to a useless warning dumped to log if
either percpu_counter_init() or list_lru_init() fails. Let's fix this.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
the right place is destroy_super(). As it is, we leak them if sget()
decides that new superblock it has allocated (and never shown to
anybody) isn't needed and should be freed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts. The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.
This requires the dentry and inode shrinker callouts to be converted to
the count/scan API. This is mainly a mechanical change.
[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the dentry LRUs being per-sb structures, there is no real need for
a global dentry_lru_lock. The locking can be made more fine-grained by
moving to a per-sb LRU lock, isolating the LRU operations of different
filesytsems completely from each other. The need for this is independent
of any performance consideration that may arise: in the interest of
abstracting the lru operations away, it is mandatory that each lru works
around its own lock instead of a global lock for all of them.
[glommer@openvz.org: updated changelog ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>