Commit Graph

141 Commits

Author SHA1 Message Date
Jan Kara
5fe2fe895a ext4: provide wrappers for transaction reservation calls
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:37:50 -04:00
Theodore Ts'o
9f203507ed ext4: mark all metadata I/O with REQ_META
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-20 15:46:17 -04:00
Jan Kara
eb9cc7e16b ext4: move quota initialization out of inode allocation transaction
Inode allocation transaction is pretty heavy (246 credits with quotas
and extents before previous patch, still around 200 after it).  This is
mostly due to credits required for allocation of quota structures
(credits there are heavily overestimated but it's difficult to make
better estimates if we don't want to wire non-trivial assumptions about
quota format into filesystem).

So move quota initialization out of allocation transaction. That way
transaction for quota structure allocation will be started only if we
need to look up quota structure on disk (rare) and furthermore it will
be started for each quota type separately, not for all of them at once.
This reduces maximum transaction size to 34 is most cases and to 73 in
the worst case.

[ Modified by tytso to clean up the cleanup paths for error handling.
  Also use a separate call to ext4_std_error() for each failure so it
  is easier for someone who is debugging a problem in this function to
  determine which function call failed. ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19 13:38:14 -04:00
Dmitri Monakho
8c8e0ca622 ext4: fix usless declarations
This patch should fix sparse complains about shadow declatations.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 22:48:36 -04:00
Theodore Ts'o
90ba983f68 ext4: use atomic64_t for the per-flexbg free_clusters count
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-11 23:39:59 -04:00
Theodore Ts'o
8de5c325b4 ext4: use KERN_WARNING for warning messages
Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE.  Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 15:11:41 -05:00
Theodore Ts'o
1139575a92 ext4: start handle at the last possible moment when creating inodes
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated.  In order to do this,
we need to start the handle in the ext4_new_inode().  So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:27:09 -05:00
Theodore Ts'o
9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Tao Ma
f08225d176 ext4: enable ext4 inline support
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:03 -05:00
Theodore Ts'o
aeb1e5d69a ext4: fix possible use after free with metadata csum
Commit fa77dcfafe introduces block bitmap checksum calculation into
ext4_new_inode() in the case that block group was uninitialized.
However we brelse() the bitmap buffer before we attempt to checksum it
so we have no guarantee that the buffer is still there.

Fix this by releasing the buffer after the possible checksum
computation.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: stable@vger.kernel.org
2012-11-29 21:21:22 -05:00
Eric Sandeen
ffb5387e85 ext4: fix unjournaled inode bitmap modification
commit 119c0d4460 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.

Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming.  The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"

[ Changed by tytso to only call ext4_journal_get_write_access() only
  when we're fairly certain that we're going to allocate the inode. ]

I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box.  Without the patch I hit a journal
checksum error every time.  With this fix it survives
many iterations.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-10-28 22:24:57 -04:00
Tao Ma
79f1ba4956 ext4: Checksum the block bitmap properly with bigalloc enabled
In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.

Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-10-22 00:34:32 -04:00
Yongqiang Yang
f2a09af645 ext4: check free inode count before allocating an inode
Recently, I ecountered some corrupted filesystems in which some
groups' free inode counts were 65535, it seemed that free inode
count was overflow.  This patch teaches ext4 to check free inode
count before allocaing an inode.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-23 23:16:03 -04:00
Jan Kara
97a7406880 ext4: remove useless marking of superblock dirty
Commit a0375156 properly notes that superblock doesn't need to be marked
as dirty when only number of free inodes / blocks / number of directories
changes since that is recomputed on each mount anyway. However that comment
leaves some unnecessary markings as dirty in place. Remove these.

Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:29:31 -04:00
Theodore Ts'o
f6fb99cadc ext4: pass a char * to ext4_count_free() instead of a buffer_head ptr
Make it possible for ext4_count_free to operate on buffers and not
just data in buffer_heads.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-06-30 19:14:57 -04:00
Linus Torvalds
4edebed866 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull Ext4 updates from Theodore Ts'o:
 "The major new feature added in this update is Darrick J Wong's
  metadata checksum feature, which adds crc32 checksums to ext4's
  metadata fields.

  There is also the usual set of cleanups and bug fixes."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: hole-punch use truncate_pagecache_range
  jbd2: use kmem_cache_zalloc wrapper instead of flag
  ext4: remove mb_groups before tearing down the buddy_cache
  ext4: add ext4_mb_unload_buddy in the error path
  ext4: don't trash state flags in EXT4_IOC_SETFLAGS
  ext4: let getattr report the right blocks in delalloc+bigalloc
  ext4: add missing save_error_info() to ext4_error()
  ext4: add debugging trigger for ext4_error()
  ext4: protect group inode free counting with group lock
  ext4: use consistent ssize_t type in ext4_file_write()
  ext4: fix format flag in ext4_ext_binsearch_idx()
  ext4: cleanup in ext4_discard_allocated_blocks()
  ext4: return ENOMEM when mounts fail due to lack of memory
  ext4: remove redundundant "(char *) bh->b_data" casts
  ext4: disallow hard-linked directory in ext4_lookup
  ext4: fix potential integer overflow in alloc_flex_gd()
  ext4: remove needs_recovery in ext4_mb_init()
  ext4: force ro mount if ext4_setup_super() fails
  ext4: fix potential NULL dereference in ext4_free_inodes_counts()
  ext4/jbd2: add metadata checksumming to the list of supported features
  ...
2012-06-01 10:12:15 -07:00
Tao Ma
6f2e9f0e7d ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:

Free inodes count wrong for group #1 (1, counted=0).
Fix? no

Free inodes count wrong for group #2 (3, counted=0).
Fix? no

Directories count wrong for group #2 (780, counted=779).
Fix? no

Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no

So this patch try to protect it with the ext4_lock_group.

btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
Dan Carpenter
bb3d132a24 ext4: fix potential NULL dereference in ext4_free_inodes_counts()
The ext4_get_group_desc() function returns NULL on error, and
ext4_free_inodes_count() function dereferences it without checking.
There is a check on the next line, but it's too late.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-05-28 14:16:57 -04:00
Eric W. Biederman
08cefc7ab8 userns: Convert ext4 to user kuid/kgid where appropriate
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:27 -07:00
Darrick J. Wong
feb0ab32a5 ext4: make block group checksums use metadata_csum algorithm
metadata_csum supersedes uninit_bg.  Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set.  Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:45:10 -04:00
Darrick J. Wong
fa77dcfafe ext4: calculate and verify block bitmap checksum
Compute and verify the checksum of the block bitmap; this checksum is
stored in the block group descriptor.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:35:10 -04:00
Darrick J. Wong
41a246d1ff ext4: calculate and verify checksums for inode bitmaps
Compute and verify the checksum of the inode bitmap; the checkum is
stored in the block group descriptor.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:33:10 -04:00
Darrick J. Wong
814525f4df ext4: calculate and verify inode checksums
This patch introduces to ext4 the ability to calculate and verify
inode checksums.  This requires the use of a new ro compatibility flag
and some accompanying e2fsprogs patches to provide the relevant
features in tune2fs and e2fsck.  The inode generation changes have
been integrated into this patch.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:31:10 -04:00
Theodore Ts'o
92b9781658 ext4: change some printk() calls to use ext4_msg() instead
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:41:49 -04:00
Theodore Ts'o
1084f252e3 ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
The functions ext4_msg() and ext4_error() already tack on a trailing
newline, so remove the unnecessary extra newline.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:13:43 -04:00