Commit Graph

11246 Commits

Author SHA1 Message Date
Theodore Ts'o 87d8fe1ee6 add releasepage hooks to block devices which can be used by file systems
Implement blkdev_releasepage() to release the buffer_heads and pages
after we release private data belonging to a mounted filesystem.

Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03 09:47:09 -05:00
Aneesh Kumar K.V 0087d9fb3f ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
With nodelalloc option we need to update the dirty block counter on
block allocation failure. This is needed because we increment the
dirty block counter early in the block allocation phase. Without
the patch s_dirty_blocks_counter goes wrong so that filesystem's
free blocks decreases incorrectly.

Tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:49:12 -05:00
Aneesh Kumar K.V 29eaf02498 ext4: Init the complete page while building buddy cache
We need to init the complete page during buddy cache init
by setting the contents to '1'.  Otherwise we can see the
following errors after doing an online resize of the
filesystem:

EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used:
	Allocating block 1040385 in system zone of 127 group

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:48:56 -05:00
Aneesh Kumar K.V 8556e8f3b6 ext4: Don't allow new groups to be added during block allocation
After we mark the blocks in the buddy cache as allocated,
we need to ensure that we don't reinit the buddy cache until
the block bitmap is updated.  This commit achieves this by holding
the group_info alloc_semaphore till ext4_mb_release_context

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:46:55 -05:00
Aneesh Kumar K.V 648f5879f5 ext4: mark the blocks/inode bitmap beyond end of group as used
We need to mark the block/inode bitmap beyond the end of the group
with '1'.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:46:04 -05:00
Aneesh Kumar K.V 2ccb5fb9f1 ext4: Use new buffer_head flag to check uninit group bitmaps initialization
For uninit block group, the on-disk bitmap is not initialized. That
implies we cannot depend on the uptodate flag on the bitmap
buffer_head to find bitmap validity.  Use a new buffer_head flag which
would be set after we properly initialize the bitmap.  This also
prevents (re-)initializing the uninit group bitmap every time we call 
ext4_read_block_bitmap().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:49:55 -05:00
Aneesh Kumar K.V 393418676a ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
We need to make sure we update the inode bitmap and clear
EXT4_BG_INODE_UNINIT flag with sb_bgl_lock held, since
ext4_read_inode_bitmap() looks at EXT4_BG_INODE_UNINIT to decide
whether to initialize the inode bitmap each time it is called.
(introduced by commit c806e68f.)

ext4_read_inode_bitmap does:

spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
	ext4_init_inode_bitmap(sb, bh, block_group, desc);

and ext4_new_inode does
if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, group),
                   ino, inode_bitmap_bh->b_data))
		   ......
		   ...
spin_lock(sb_bgl_lock(sbi, group));

gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
i.e., on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_INODE_UNINIT flag. What can happen is a
parallel ext4_read_inode_bitmap can zero out the bitmap in between
the above ext4_set_bit_atomic and spin_lock(sb_bg_lock..)

The race results in below user visible errors
EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449
EXT4-fs warning (device sdb1): ext4_unlink: Deleting nonexistent file ...
EXT4-fs warning (device sdb1): ext4_rmdir: empty directory has too many links ...
# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71
ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:38:14 -05:00
Aneesh Kumar K.V 3300beda52 ext4: code cleanup
Rename some variables.  We also unlock locks in the reverse order we
acquired as a part of cleanup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03 22:33:39 -05:00
Aneesh Kumar K.V 560671a0d3 ext4: Use high 16 bits of the block group descriptor's free counts fields
Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:20:24 -05:00
Aneesh Kumar K.V e8134b27e3 ext4: Fix race between read_block_bitmap() and mark_diskspace_used()
We need to make sure we update the block bitmap and clear
EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since
ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide
whether to initialize the block bitmap each time it is called
(introduced by commit c806e68f), and this can race with block
allocations in ext4_mb_mark_diskspace_used().

ext4_read_block_bitmap does:

spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
	ext4_init_block_bitmap(sb, bh, block_group, desc);

Now on the block allocation side we do

mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
			ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
....
spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
	gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);

ie on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
parallel ext4_read_block_bitmap can zero out the bitmap in between
the above mb_set_bits and spin_lock(sb_bg_lock..)

The race results in below user visible errors
EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105
EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block ..

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:38:26 -05:00
Aneesh Kumar K.V 5d1b1b3f49 ext4: fix BUG when calling ext4_error with locked block group
The mballoc code likes to call ext4_error while it is holding locked
block groups.  This can causes a scheduling in atomic context BUG.  We
can't just unlock the block group and relock it after/if ext4_error
returns since that might result in race conditions in the case where
the filesystem is set to continue after finding errors.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:19:52 -05:00
Aneesh Kumar K.V b7be019e80 ext4: Fix lockdep recursive locking warning
In ext4_mb_init_group(), if the filesystem block size is less than
PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block
groups in a loop.  We need to allow for this by using
down_write_nested() and passing in the loop index as a lock subclass
number.  This works because no other code path needs to take multiple
alloc_sem's.  Note that lockdep will fail for filesystem blocksize
smaller than to PAGE_SIZE/16k.  (e.g., a 1k filesystem blocksize with
a 32k page size, or a 2k filesystem blocksize with a 64k blocksize,
etc.)

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-23 23:51:53 -05:00
Aneesh Kumar K.V 7a2fcbf7f8 ext4: don't use blocks freed but not yet committed in buddy cache init
When we generate buddy cache (especially during resize) we need to
make sure we don't use the blocks freed but not yet comitted.  This
makes sure we have the right value of free blocks count in the group
info and also in the bitmap.  This also ensures the ordered mode
consistency

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:36:55 -05:00
Aneesh Kumar K.V fb68407b0d jbd2: Call journal commit callback without holding j_list_lock
Avoid freeing the transaction in __jbd2_journal_drop_transaction() so
the journal commit callback can run without holding j_list_lock, to
avoid lock contention on this spinlock.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-06 17:50:21 -05:00
Aneesh Kumar K.V c3a326a657 ext4: cleanup mballoc header files
Move some of the forward declaration of the static functions
to mballoc.c where they are used. This enables us to include
mballoc.h in other .c files. Also correct the buddy cache
documentation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-25 15:11:52 -05:00
Aneesh Kumar K.V 920313a726 ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize
The new groups added during resize are flagged as
need_init group. Make sure we properly initialize these
groups. When we have block size < page size and we are adding
new groups the page may still be marked uptodate even though
we haven't initialized the group. While forcing the init
of buddy cache we need to make sure other groups part of the
same page of buddy cache is not using the cache.
group_info->alloc_sem is added to ensure the same.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: stable@kernel.org
2009-01-05 21:36:19 -05:00
Aneesh Kumar K.V e21675d4b6 ext4: Add blocks added during resize to bitmap
With this change new blocks added during resize
are marked as free in the block bitmap and the
group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT
flag.  This makes sure when mballoc tries to allocate
blocks from the new group we would reload the
buddy information using the bitmap present in the disk.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:36:02 -05:00
Aneesh Kumar K.V 3a06d778df ext4: sparse fixes
* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-22 15:04:59 -05:00
Theodore Ts'o 1a0d3786dd jbd2: Remove a large array of bh's from the stack of the checkpoint routine
jbd2_log_do_checkpoint()n is one of the kernel's largest stack users.
Move the array of buffer head's from the stack of jbd2_log_do_checkpoint()
to the in-core journal structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05 00:09:22 -05:00
Theodore Ts'o 498e5f2415 ext4: Change unsigned long to unsigned int
Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05 00:14:04 -05:00
Theodore Ts'o a9df9a4910 ext4: Make ext4_group_t be an unsigned int
Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:18:16 -05:00
Theodore Ts'o cde6436004 ext4: Remove i_ext_generation from ext4_inode_info structure
The i_ext_generation was incremented, but never used.  Remove it to
slim down the ext4_inode_info structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-04 18:46:03 -05:00
Theodore Ts'o 30773840c1 ext4: add fsync batch tuning knobs
Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched with a synchronous write transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03 20:27:38 -05:00
Theodore Ts'o d7cfa4684d ext4: display average commit time
Display the average commit time (which is used by the ext4 fsync
batching patch) in /proc/fs/jbd2/*/info for performance tuning
purposes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-17 00:20:45 -05:00
Josef Bacik e07f7183a4 jbd2: improve jbd2 fsync batching
This patch removes the static sleep time in favor of a more self
optimizing approach where we measure the average amount of time it
takes to commit a transaction to disk and the ammount of time a
transaction has been running.  If somebody does a sync write or an
fsync() traditionally we would sleep for 1 jiffies, which depending on
the value of HZ could be a significant amount of time compared to how
long it takes to commit a transaction to the underlying storage.  With
this patch instead of sleeping for a jiffie, we check to see if the
amount of time this transaction has been running is less than the
average commit time, and if it is we sleep for the delta using
schedule_hrtimeout to give us a higher precision sleep time.  This
greatly benefits high end storage where you could end up sleeping for
longer than it takes to commit the transaction and therefore sitting
idle instead of allowing the transaction to be committed by keeping
the sleep time to a minimum so you are sure to always be doing
something.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26 01:14:26 -05:00