journal_remove_journal_head() can oops when trying to access journal_head
returned by bh2jh(). This is caused for example by the following race:
TASK1 TASK2
journal_commit_transaction()
...
processing t_forget list
__journal_refile_buffer(jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
journal_try_to_free_buffers()
journal_grab_journal_head(bh)
jbd_lock_bh_state(bh)
__journal_try_to_free_buffer()
journal_put_journal_head(jh)
journal_remove_journal_head(bh);
journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not
part of any transaction and thus frees journal_head before TASK1 gets to doing
so. Note that even buffer_head can be released by try_to_free_buffers() after
journal_put_journal_head() which adds even larger opportunity for oops (but I
didn't see this happen in reality).
Fix the problem by making transactions hold their own journal_head reference
(in b_jcount). That way we don't have to remove journal_head explicitely via
journal_remove_journal_head() and instead just remove journal_head when
b_jcount drops to zero. The result of this is that [__]journal_refile_buffer(),
[__]journal_unfile_buffer(), and __journal_remove_checkpoint() can free
journal_head which needs modification of a few callers. Also we have to be
careful because once journal_head is removed, buffer_head might be freed as
well. So we have to get our own buffer_head reference where it matters.
Signed-off-by: Jan Kara <jack@suse.cz>
journal_get_create_access should drop jh->b_jcount in error handling path
Signed-off-by: Ding Dinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
The callers of start_this_handle() (or better ext3_journal_start()) are not
really prepared to handle allocation failures. Such failures can for example
result in silent data loss when it happens in ext3_..._writepage(). OTOH
__GFP_NOFAIL is going away so we just retry allocation in start_this_handle().
This loop is potentially dangerous because the oom killer cannot be invoked
for GFP_NOFS allocation, so there is a potential for infinitely looping.
But still this is better than silent data loss.
Signed-off-by: Jan Kara <jack@suse.cz>
journal_start returns an ERR_PTR() value rather than NULL on failure.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Remove goto statement which jumps to very next line. Also remove
target label because it is no longer used anywhere.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.
This patch is a backport of JBD2 fix by dingdinghua <dingdinghua@nrchpc.ac.cn>.
Signed-off-by: Jan Kara <jack@suse.cz>
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
lockdep annotation for a transaction start has been at the end of
journal_start(). But a transaction is also started from journal_restart(). Move
the lockdep annotation to start_this_handle() which covers both cases.
Signed-off-by: Jan Kara <jack@suse.cz>
Fix jiffie rounding in jbd commit timer setup code. Rounding down could cause
the timer to be fired before the corresponding transaction has expired. That
transaction can stay not committed forever if no new transaction is created or
explicit sync/umount happens.
Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The following race can happen:
CPU1 CPU2
checkpointing code checks the buffer, adds
it to an array for writeback
do_get_write_access()
...
lock_buffer()
unlock_buffer()
flush_batch() submits the buffer for IO
__jbd_journal_file_buffer()
So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment. The similar problem can happen with
the journal_get_create_access() path.
We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.
Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.
Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
I delete the following patch
"commit 3f31fddfa2
Author: Mingming Cao <cmm@us.ibm.com>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
This patch is no longer needed because if race between freeing buffer and
committing transaction functionality occurs and dio gets error, currently
dio falls back to buffered IO by the following patch.
commit 6ccfa806a9
Author: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Date: Tue Sep 2 14:35:40 2008 -0700
VFS: fix dio write returning EIO when try_to_release_page fails
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Remove excess kernel-doc from fs/jbd/transaction.c:
Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a flaw with the way jbd handles fsync batching. If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit. The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going. Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.
This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction. If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk. We acheive
sub-jiffie sleeping using schedule_hrtimeout. This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout. I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average. I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.
I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array. I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk. I ran the following command
fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i
where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my
results
type threads with patch without patch
sata 2 24.6 26.3
sata 4 49.2 48.1
sata 8 70.1 67.0
sata 16 104.0 94.1
sata 32 153.6 142.7
xserve 2 246.4 222.0
xserve 4 480.0 440.8
xserve 8 829.5 730.8
xserve 16 1172.7 1026.9
xserve 32 1816.3 1650.5
ramdisk 2 2538.3 1745.6
ramdisk 4 2942.3 661.9
ramdisk 8 2882.5 999.8
ramdisk 16 2738.7 1801.9
ramdisk 32 2541.9 2394.0
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ric Wheeler <rwheeler@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Delete excess kernel-doc notation in fs/ subdirectory:
Warning(linux-2.6.27-git10//fs/jbd/transaction.c:886): Excess function parameter or struct member 'credits' description in 'journal_get_undo_access'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ordered mode, if a file data buffer being dirtied exists in the
committing transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it. But we
don't have to remove the buffer from the committing transaction when the
buffer couldn't be written out, otherwise it would miss the error and the
committing transaction would not abort.
This patch adds an error check before removing the buffer from the
committing transaction.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
the names were too generic:
drivers/uio/uio.c:87: error: expected identifier or '(' before 'do'
drivers/uio/uio.c:87: error: expected identifier or '(' before 'while'
drivers/uio/uio.c:113: error: 'map_release' undeclared here (not in a function)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most the free-standing lock_acquire() usages look remarkably similar, sweep
them into a new helper.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk. If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller. We have seen the directo IO failed
due to this race. Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().
With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>