Commit Graph

165 Commits

Author SHA1 Message Date
Eric Sandeen eee4adc709 ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o 61d08673de ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
Fix a namespace leak from fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o 4a873a472b ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
Fix a namespace leak by moving the function to the file where it is
used and making it static.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o 1f109d5a17 ext4: make various ext4 functions be static
These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o 5dabfc78dc ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Lukas Czerner 7360d1731e ext4: Add batched discard support for ext4
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Theodore Ts'o bd2d0210cf ext4: use bio layer instead of buffer layer in mpage_da_submit_io
Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Eric Sandeen 640e939656 ext4: remove unused ext4_sb_info members
Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:08 -04:00
Toshiyuki Okajima e0d10bfa91 ext4: improve llseek error handling for overly large seek offsets
The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.


If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
written (write systemcall) is different from the maximum size which can be 
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
2010-10-27 21:30:06 -04:00
Lukas Czerner 857ac889cc ext4: add interface to advertise ext4 features in sysfs
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner bfff68738f ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Curt Wohlgemuth fb1813f4a8 ext4: use dedicated slab caches for group_info structures
ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object.  On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)

This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.

Google-Bug-Id: 2980809

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:29:12 -04:00
Linus Torvalds 5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Al Viro 0930fcc1ee convert ext4 to ->evict_inode()
pretty much brute-force...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:30 -04:00
Eric Sandeen 0cfc9255a1 ext4: re-inline ext4_rec_len_(to|from)_disk functions
commit 3d0518f4, "ext4: New rec_len encoding for very
large blocksizes" made several changes to this path, but from
a perf perspective, un-inlining ext4_rec_len_from_disk() seems
most significant.  This function is called from ext4_check_dir_entry(),
which on a file-creation workload is called extremely often.

I tested this with bonnie:

# bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32

(this does 200 iterations) and got this for the file creations:

ext4 stock:   Average =  21206.8 files/s
ext4 inlined: Average =  22346.7 files/s  (+5%)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-05 01:46:37 -04:00
Theodore Ts'o 8b67f04ab9 ext4: Add mount options in superblock
Allow mount options to be stored in the superblock.  Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 23:14:20 -04:00
Eric Sandeen 79e8303677 ext4: fix ext4_get_blocks references
ext4_get_blocks got renamed to ext4_map_blocks, but left stale
comments and a prototype littered around.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
jiayingz@google.com (Jiaying Zhang) 5b3ff237be ext4: move aio completion after unwritten extent conversion
This patch is to be applied upon Christoph's "direct-io: move aio_complete
into ->end_io" patch. It adds iocb and result fields to struct ext4_io_end_t,
so that we can call aio_complete from  ext4_end_io_nolock() after the extent
conversion has finished.

I have verified with Christoph's aio-dio test that used to fail after a few
runs on an original kernel but now succeeds on the patched kernel.

See http://thread.gmane.org/gmane.comp.file-systems.ext4/19659 for details.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Theodore Ts'o 89eeddf033 ext4: Define s_jnl_backup_type in superblock
This has been in use by e2fsprogs for a while; define it to keep the
super block fields in sync.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o 66e61a9e95 ext4: Once a day, printk file system error information to dmesg
This allows us to grab any file system error messages by scraping
/var/log/messages.  This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o 1c13d5c087 ext4: Save error information to the superblock for analysis
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:03 -04:00
Theodore Ts'o c398eda0e4 ext4: Pass line numbers to ext4_error() and friends
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:40 -04:00
Theodore Ts'o 60fd4da34d ext4: Cleanup ext4_check_dir_entry so __func__ is now implicit
Also start passing the line number to ext4_check_dir since we're going
to need it in upcoming patch.
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:54:40 -04:00
Theodore Ts'o e29136f80e ext4: Enhance ext4_grp_locked_error() to take block and function numbers
Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 12:54:28 -04:00
Theodore Ts'o c67d859e39 ext4: clean up ext4_abort() so __func__ is now implicit
Use a macro definition for ext4_abort() to clean up the .c files a wee
bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:07:07 -04:00