This area of the code has always been a bit delicate due to the
subtleties of lock ordering. The problem is that for "normal"
alloc/dealloc, we always grab the inode locks first and the rgrp lock
later.
In order to ensure no races in looking up the unlinked, but still
allocated inodes, we need to hold the rgrp lock when we do the lookup,
which means that we can't take the inode glock.
The solution is to borrow the technique already used by NFS to solve
what is essentially the same problem (given an inode number, look up
the inode carefully, checking that it really is in the expected
state).
We cannot do that directly from the allocation code (lock ordering
again) so we give the job to the pre-existing delete workqueue and
carry on with the allocation as normal.
If we find there is no space, we do a journal flush (required anyway
if space from a deallocation is to be released) which should block
against the pending deallocations, so we should always get the space
back.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Commit 83fd9c7 changes l_level, l_requested and l_blocking of
ocfs2_lock_res from int to unsigned char. But actually it is
initially as -1(ocfs2_lock_res_init_common) which
correspoding to 255 for unsigned char. So the whole dlm lock
mechanism doesn't work now which means a disaster to ocfs2.
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: remove unused copy_io_context()
Documentation: remove anticipatory scheduler info
block: remove REQ_HARDBARRIER
ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
ioprio: fix RCU locking around task dereference
block: ioctl: fix information leak to userland
block: read i_size with i_size_read()
cciss: fix proc warning on attempt to remove non-existant directory
bio: take care not overflow page count when mapping/copying user data
block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
block: take care not to overflow when calculating total iov length
block: check for proper length of iov entries in blk_rq_map_user_iov()
cciss: remove controllers supported by hpsa
cciss: use usleep_range not msleep for small sleeps
cciss: limit commands allocated on reset_devices
cciss: Use kernel provided PCI state save and restore functions
cciss: fix board status waiting code
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: remove incorrect assert in xfs_vm_writepage
xfs: use hlist_add_fake
xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n
xfs: tell lockdep about parent iolock usage in filestreams
xfs: move delayed write buffer trace
xfs: fix per-ag reference counting in inode reclaim tree walking
xfs: xfs_ioctl: fix information leak to userland
xfs: remove experimental tag from the delaylog option
In commit 20cb52ebd1, titled
"xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
uptodate buffers are not dirty. That asserts turns out to trigger a lot
when running fsx on filesystems with small block sizes. The reason for
that is that the assert is simply incorrect. !mapped and uptodate
just mean this buffer covers a hole, and whenever we do a set_page_dirty
we mark all blocks in the page dirty, no matter if they have data or
not. So remove the assert, and update the comment above the condition
to match reality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
A minor oversight from f7347ce4ee,
"fasync: re-organize fasync entry insertion to allow it under a
spinlock": this cleanup-on-error was only needed to handle -ENOMEM. Now
that we're preallocating it's unneeded.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We must also free the passed-in lease in the case it wasn't used because
an existing lease was upgrade/downgraded or already existed.
Note the nfsd caller doesn't care because it's fl_change callback
returns an error in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
XFS does not need it's inodes to actuall be hashed in the VFS inode
cache, but we require the inode to be marked hashed for the
writeback code to work.
Insted of using insert_inode_hash, which requires a second
inode_lock roundtrip after the partial merge of the inode
scalability patches in 2.6.37-rc simply use the new hlist_add_fake
helper to mark it hashed without requiring a lock or touching a
global cache line.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Andi Kleen reported that gcc-4.5 gives lots of warnings for him
inside the XFS code. It turned out most of them are due to the
quota stubs beeing macros, and gcc now complaining about macros
evaluating to 0 that are not assigned to variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The filestreams code may take the iolock on the parent inode while
holding it on a child. This is the only place in XFS where we take
both the child and parent iolock, so just telling lockdep about it
is enough. The lock flag required for that was already added as
part of the ilock lockdep annotations and unused so far.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
al_hreq is copied from userland. If al_hreq.buflen is not properly aligned
then xfs_attr_list will ignore the last bytes of kbuf. These bytes are
unitialized. It leads to leaking of contents of kernel stack memory.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Commit 4221a9918e "Add RCU check for
find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns=
Assertion failed in sys_ioprio_get. The patch is fixing assertion
failure in ioprio_set as well.
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
stack backtrace:
Pid: 4254, comm: iotop Not tainted
Call Trace:
[<ffffffff810656f2>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffff81053c67>] find_task_by_pid_ns+0x4f/0x68
[<ffffffff81053c9d>] find_task_by_vpid+0x1d/0x1f
[<ffffffff811104e2>] sys_ioprio_get+0x50/0x2da
[<ffffffff81002182>] system_call_fastpath+0x16/0x1b
V2: rcu critical section expanded according to comment by Paul E. McKenney
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With 2.6.37-rc1, I observe sys_ioprio_set not taking the RCU lock [1]
across access to the task credentials.
Inspecting the code in fs/ioprio.c, the tasklist_lock is held for read
across the __task_cred call, which is presumably sufficient to prevent
the task credentials becoming stale.
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
1 lock held by start-stop-daem/2246:
#0: (tasklist_lock){.?.?..}, at: [<ffffffff811a2dfa>]
sys_ioprio_set+0x8a/0x400
stack backtrace:
Pid: 2246, comm: start-stop-daem Not tainted 2.6.37-rc1-330cd+ #2
Call Trace:
[<ffffffff8109f5f4>] lockdep_rcu_dereference+0xa4/0xc0
[<ffffffff81085651>] find_task_by_pid_ns+0x81/0x90
[<ffffffff8108567d>] find_task_by_vpid+0x1d/0x20
[<ffffffff811a3160>] sys_ioprio_set+0x3f0/0x400
[<ffffffff816efa79>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81003482>] system_call_fastpath+0x16/0x1b
Take the RCU lock for read across acquiring the pointer to the task
credentials and dereferencing it.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Fixed up by Jens to fix missing rcu_read_unlock() on mismatches.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the iovec is being set up in a way that causes uaddr + PAGE_SIZE
to overflow, we could end up attempting to map a huge number of
pages. Check for this invalid input type.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix a memleak in cifs_setattr_nounix()
cifs: make cifs_ioctl handle NULL filp->private_data correctly
Fix openpromfs compilation by adding a missing semicolon in
fs/openpromfs/inode.c openprom_mount().
Signed-off-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>