Commit Graph

535 Commits

Author SHA1 Message Date
Nick Piggin
6416ccb789 fs: scale files_lock
fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:48 -04:00
Nick Piggin
d996b62a8d tty: fix fu_list abuse
tty: fix fu_list abuse

tty code abuses fu_list, which causes a bug in remount,ro handling.

If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".

Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.

The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.

[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:47 -04:00
Nick Piggin
ee2ffa0dfd fs: cleanup files_lock locking
fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:47 -04:00
Christoph Hellwig
9cb569d601 remove SWRITE* I/O types
These flags aren't real I/O types, but tell ll_rw_block to always
lock the buffer instead of giving up on a failed trylock.

Instead add a new write_dirty_buffer helper that implements this semantic
and use it from the existing SWRITE* callers.  Note that the ll_rw_block
code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
this patch fixes.

In the ufs code clean up the helper that used to call ll_rw_block
to mirror sync_dirty_buffer, which is the function it implements for
compound buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 01:09:01 -04:00
Linus Torvalds
10041d2d14 Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  bkl: Remove locked .ioctl file operation
  v4l: Remove reference to bkl ioctl in compat ioctl handling
  logfs: kill BKL
2010-08-13 17:52:35 -07:00
David Howells
c788732523 Mark arguments to certain syscalls as being const
Mark arguments to certain system calls as being const where they should be but
aren't.  The list includes:

 (*) The filename arguments of various stat syscalls, execve(), various utimes
     syscalls and some mount syscalls.

 (*) The filename arguments of some syscall helpers relating to the above.

 (*) The buffer argument of various write syscalls.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-13 16:53:13 -07:00
Arnd Bergmann
b19dd42faf bkl: Remove locked .ioctl file operation
The last user is gone, so we can safely remove this

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-14 00:24:24 +02:00
Adrian Hunter
8d57a98ccd block: add secure discard
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:30 -07:00
Andrew Morton
13bcbc0087 include/linux/fs.h: complete hexification of FMODE_* constants
One straggler which was missed due to merge ordering issues.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:58:59 -07:00
Linus Torvalds
2f9e825d3e Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
  block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
  xen-blkfront: fix missing out label
  blkdev: fix blkdev_issue_zeroout return value
  block: update request stacking methods to support discards
  block: fix missing export of blk_types.h
  writeback: fix bad _bh spinlock nesting
  drbd: revert "delay probes", feature is being re-implemented differently
  drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
  drbd: Disable delay probes for the upcomming release
  writeback: cleanup bdi_register
  writeback: add new tracepoints
  writeback: remove unnecessary init_timer call
  writeback: optimize periodic bdi thread wakeups
  writeback: prevent unnecessary bdi threads wakeups
  writeback: move bdi threads exiting logic to the forker thread
  writeback: restructure bdi forker loop a little
  writeback: move last_active to bdi
  writeback: do not remove bdi from bdi_list
  writeback: simplify bdi code a little
  writeback: do not lose wake-ups in bdi threads
  ...

Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
2010-08-10 15:22:42 -07:00
Linus Torvalds
8c8946f509 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
  fanotify: use both marks when possible
  fsnotify: pass both the vfsmount mark and inode mark
  fsnotify: walk the inode and vfsmount lists simultaneously
  fsnotify: rework ignored mark flushing
  fsnotify: remove global fsnotify groups lists
  fsnotify: remove group->mask
  fsnotify: remove the global masks
  fsnotify: cleanup should_send_event
  fanotify: use the mark in handler functions
  audit: use the mark in handler functions
  dnotify: use the mark in handler functions
  inotify: use the mark in handler functions
  fsnotify: send fsnotify_mark to groups in event handling functions
  fsnotify: Exchange list heads instead of moving elements
  fsnotify: srcu to protect read side of inode and vfsmount locks
  fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
  fsnotify: use _rcu functions for mark list traversal
  fsnotify: place marks on object in order of group memory address
  vfs/fsnotify: fsnotify_close can delay the final work in fput
  fsnotify: store struct file not struct path
  ...

Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
2010-08-10 11:39:13 -07:00
Linus Torvalds
5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Jan Kara
f446daaea9 mm: implement writeback livelock avoidance using page tagging
We try to avoid livelocks of writeback when some steadily creates dirty
pages in a mapping we are writing out.  For memory-cleaning writeback,
using nr_to_write works reasonably well but we cannot really use it for
data integrity writeback.  This patch tries to solve the problem.

The idea is simple: Tag all pages that should be written back with a
special tag (TOWRITE) in the radix tree.  This can be done rather quickly
and thus livelocks should not happen in practice.  Then we start doing the
hard work of locking pages and sending them to disk only for those pages
that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to 296
bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same (13 and 7
respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:44:59 -07:00
Al Viro
7a4dec5389 Fix sget() race with failing mount
If sget() finds a matching superblock being set up, it'll
grab an active reference to it and grab s_umount.  That's
fine - we'll wait for completion of foofs_get_sb() that way.
However, if said foofs_get_sb() fails we'll end up holding
the halfway-created superblock.  deactivate_locked_super()
called by foofs_get_sb() will just unlock the sucker since
we are holding another active reference to it.

What we need is a way to tell if superblock has been successfully
set up.  Unfortunately, neither ->s_root nor the check for
MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
new flag set by the (only) caller of ->get_sb().  If that flag
isn't present by the time sget() grabbed s_umount on preexisting
superblock it has found, it's seeing a stillborn and should
just bury it with deactivate_locked_super() (and repeat the search).

Longer term we want to set that flag in ->get_sb() instances (and
check for it to distinguish between "sget() found us a live sb"
and "sget() has allocated an sb, we need to set it up" in there,
instead of checking ->s_root as we do now).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
2010-08-09 16:49:01 -04:00
Christoph Hellwig
ebabe9a900 pass a struct path to vfs_statfs
We'll need the path to implement the flags field for statvfs support.
We do have it available in all callers except:

 - ecryptfs_statfs.  This one doesn't actually need vfs_statfs but just
   needs to do a caller to the lower filesystem statfs method.
 - sys_ustat.  Add a non-exported statfs_by_dentry helper for it which
   doesn't won't be able to fill out the flags field later on.

In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
of the misleading vfs prefix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:42 -04:00
Al Viro
b57922d97f convert remaining ->clear_inode() to ->evict_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:37 -04:00
Al Viro
45321ac543 Make ->drop_inode() just return whether inode needs to be dropped
... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:35 -04:00
Al Viro
30140837f2 fs/inode.c:clear_inode() is gone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:34 -04:00
Al Viro
07958f9f5b ->delete_inode() is gone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:31 -04:00
Al Viro
b0683aa638 new helper: end_writeback()
Essentially, the minimal variant of ->evict_inode().  It's
a trimmed-down clear_inode(), sans any fs callbacks.  Once
it returns we know that no async writeback will be happening;
every ->evict_inode() instance should do that once and do that
before doing anything ->write_inode() could interfere with
(e.g. freeing the on-disk inode).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:49 -04:00
Al Viro
c6287315cb generic_detach_inode() can be static now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:48 -04:00
Al Viro
be7ce4161f New method - evict_inode()
Hybrid of ->clear_inode() and ->delete_inode(); if present, does
all fs work to be done when in-core inode is about to be gone,
for whatever reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:46 -04:00
Al Viro
a4ffdde6e5 simplify checks for I_CLEAR/I_FREEING
add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information.  As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:44 -04:00
Christoph Hellwig
2c27c65ed0 check ATTR_SIZE contraints in inode_change_ok
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok.  Also clean up and document inode_change_ok
to make this obvious.

As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error.  This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.

Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.

Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:39 -04:00
Christoph Hellwig
1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00