Commit Graph

151 Commits

Author SHA1 Message Date
Linus Torvalds a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Boaz Harrosh 537632e0a5 ore: Unlock r4w pages in exact reverse order of locking
The read-4-write pages are locked in address ascending order.
But where unlocked in a way easiest for coding. Fix that,
locks should be released in opposite order of locking, .i.e
descending address order.

I have not hit this dead-lock. It was found by inspecting the
dbug print-outs. I suspect there is an higher lock at caller that
protects us, but fix it regardless.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:49:25 +03:00
Boaz Harrosh 62b62ad873 ore: Remove support of partial IO request (NFS crash)
Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.

Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.

TODO:
	Support real forward progress with some reserved allocations
	of resources, such as mem pools and/or bio_sets

[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
CC: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:47:43 +03:00
Boaz Harrosh 9ff19309a9 ore: Fix NFS crash by supporting any unaligned RAID IO
In RAID_5/6 We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes is shorter, and would need to resubmit
a new IO with the remainder.

Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.

The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.

The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe. we do a single read like before.

The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1._read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute

And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.

Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.

This solution is much better for NFS than the previous supposedly
solution because the short write was dealt  with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)

NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.

hurray!!

[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:45:28 +03:00
Al Viro ebfc3b49a7 don't pass nameidata to ->create()
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:47 +04:00
Al Viro 00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Randy Dunlap b84297197c exofs: fix sparse non-ANSI function warning
Fix sparse non-ANSI function warning:

  fs/exofs/sys.c:112:28: warning: non-ANSI function declaration of function 'exofs_sysfs_dbg_print'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-12 06:33:22 +03:00
Linus Torvalds a01ee165a1 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull exofs updates from Boaz Harrosh:
 "Just a couple of patches.  The first is a BUG fix destined for stable
  which missed the 3.4-rc7 Kernel.  The second is just a fixture
  addition so exofs is able to be better exported as a cluster file
  system via pNFS."

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Add SYSFS info for autologin/pNFS export
  exofs: Fix CRASH on very early IO errors.
2012-05-28 13:10:41 -07:00
Sachin Bhamare 8b56a30caa exofs: Add SYSFS info for autologin/pNFS export
Introduce sysfs infrastructure for exofs cluster filesystem.

Each OSD target shows up as below in the sysfs hierarchy:
	/sys/fs/exofs/<osdname>_<partition_id>/devX

Where <osdname>_<partition_id> is the unique identification
of a Superblock.

Where devX: 0 <= X < device_table_size. They are ordered
in device-table order as specified to the mkfs.exofs command

Each OSD device  devX has following attributes :
	osdname - ReadOnly
	systemid - ReadOnly
	uri - Read/Write

It is up to user-mode to update devX/uri for support of
autologin.

These sysfs information are used both for autologin as well
as support for exporting exofs via a pNFSD server in user-mode.
(.eg NFS-Ganesha)

Signed-off-by: Sachin Bhamare <sbhamare@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-05-21 12:24:01 +03:00
Boaz Harrosh 6abe4a87f7 exofs: Fix CRASH on very early IO errors.
If at exofs_fill_super() we had an early termination
do to any error, like an IO error while reading the
super-block. We would crash inside exofs_free_sbi().

This is because sbi->oc.numdevs was set to 1, before
we actually have a device table at all.

Fix it by moving the sbi->oc.numdevs = 1 to after the
allocation of the device table.

Reported-by: Johannes Schild <JSchild@gmx.de>

Stable: This is a bug since v3.2.0
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-05-20 19:42:41 +03:00
Jan Kara dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Linus Torvalds afb9bd704c Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull trivial exofs changes from Boaz Harrosh:
 "Just nothingness really.  The big exofs changes are reserved for the
  next merge window."

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Cap on the memcpy() size
  exofs: (trivial) Fix typo in super.c
  exofs: fix endian conversion in exofs_sync_fs()
2012-03-28 20:04:27 -07:00
Linus Torvalds e2a0883e40 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
  yet."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
  ext4: initialization of ext4_li_mtx needs to be done earlier
  debugfs-related mode_t whack-a-mole
  hfsplus: add an ioctl to bless files
  hfsplus: change finder_info to u32
  hfsplus: initialise userflags
  qnx4: new helper - try_extent()
  qnx4: get rid of qnx4_bread/qnx4_getblk
  take removal of PF_FORKNOEXEC to flush_old_exec()
  trim includes in inode.c
  um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
  um: embed ->stub_pages[] into mmu_context
  gadgetfs: list_for_each_safe() misuse
  ocfs2: fix leaks on failure exits in module_init
  ecryptfs: make register_filesystem() the last potential failure exit
  ntfs: forgets to unregister sysctls on register_filesystem() failure
  logfs: missing cleanup on register_filesystem() failure
  jfs: mising cleanup on register_filesystem() failure
  make configfs_pin_fs() return root dentry on success
  configfs: configfs_create_dir() has parent dentry in dentry->d_parent
  configfs: sanitize configfs_create()
  ...
2012-03-21 13:36:41 -07:00
Al Viro 48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Al Viro 8de5277879 vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link}
New field of struct super_block - ->s_max_links.  Maximal allowed
value of ->i_nlink or 0; in the latter case all checks still need
to be done in ->link/->mkdir/->rename instances.  Note that this
limit applies both to directoris and to non-directories.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:32 -04:00
Cong Wang bf7014b67f exofs: remove the second argument of k[un]map_atomic()
Ack-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:22 +08:00
Dan Carpenter 72749a270b exofs: Cap on the memcpy() size
This data comes from the device, so probably it's fairly trustworthy but
it makes the static checkers happy if we check it.

[Boaz]
  the system_id_len is zero, if not present, or always OSD_SYSTEMID_LEN.
  So always copy OSD_SYSTEMID_LEN bytes.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-03-19 13:39:12 -07:00
Masanari Iida 3e57638bb1 exofs: (trivial) Fix typo in super.c
Correct spelling "faild" to "failed" in
fs/exofs/super.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-03-19 13:39:12 -07:00
Dan Carpenter b6d1f2dd61 exofs: fix endian conversion in exofs_sync_fs()
fscb->s_numfiles is an __le64 field so we need to use cpu_to_le64()
to get a little endian 64 bit on big endian systems.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-03-19 13:39:11 -07:00
Linus Torvalds 9e203936ea Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  ore: Must support none-PAGE-aligned IO
  ore: fix BUG_ON, too few sgs when reading
  ore: Fix crash in case of an IO error.
  ore: FIX breakage when MISC_FILESYSTEMS is not set
2012-01-09 12:51:01 -08:00
Al Viro da01636a65 exofs: oops after late failure in mount
We have already set ->s_root, so ->put_super() is going to be called.
Freeing ->s_fs_info is a bloody bad idea when it's going to be
dereferenced very shortly...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-08 20:19:12 -05:00
Boaz Harrosh 724577ca35 ore: Must support none-PAGE-aligned IO
NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.

We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.

TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-08 10:43:13 +02:00
Boaz Harrosh 361aba569f ore: fix BUG_ON, too few sgs when reading
When reading RAID5 files, in rare cases, we calculated too
few sg segments. There should be two extra for the beginning
and end partial units.

Also "too few sg segments" should not be a BUG_ON there is
all the mechanics in place to handle it, as a short read.
So just return -ENOMEM and the rest of the code will gracefully
split the IO.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06 16:49:07 +02:00
Boaz Harrosh ffefb8eaa3 ore: Fix crash in case of an IO error.
The users of ore_check_io() expect the reported device
(In case of error) to be indexed relative to the passed-in
ore_components table, and not the logical dev index.

This causes a crash inside objlayoutdriver in case of
an IO error.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06 16:49:06 +02:00
Boaz Harrosh 831c2dc5f4 ore: FIX breakage when MISC_FILESYSTEMS is not set
As Reported by Randy Dunlap

When MISC_FILESYSTEMS is not enabled and NFS4.1 is:

fs/built-in.o: In function `objio_alloc_io_state':
objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
fs/built-in.o: In function `_write_done':
objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
fs/built-in.o: In function `_read_done':
...

When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
is not selected. exofs/Kconfig is never examined during Kconfig,
and it can not do it's magic stuff to automatically select everything
needed.

We must split exofs/Kconfig in two. The ore one is always included.
And the exofs one is left in it's old place in the menu.

[Needed for the 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06 16:48:14 +02:00