Commit Graph

335 Commits

Author SHA1 Message Date
Andrey Ryabinin
5bb53c0fb8 fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().

Check get_super() result to fix that.

Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-25 08:38:26 -06:00
Vegard Nossum
e9e5e3fae8 bdev: fix NULL pointer dereference
I got this:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
       (ftrace buffer empty)
    CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880113415940 task.stack: ffff880118350000
    RIP: 0010:[<ffffffff8172cb32>]  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
    RSP: 0018:ffff880118357ca0  EFLAGS: 00010207
    RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
    RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
    RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
    R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
    FS:  00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
    DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
     ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
     ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
     0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
    Call Trace:
     [<ffffffff8167f207>] mount_fs+0x97/0x2e0
     [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760
     [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300
     [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50
     [<ffffffff816dfc87>] do_mount+0x3d7/0x2730
     [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0
     [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40
     [<ffffffff8161ea81>] ? memset+0x31/0x40
     [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320
     [<ffffffff816e2a02>] SyS_mount+0xb2/0x120
     [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970
     [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
     [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
    RIP  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
     RSP <ffff880118357ca0>
    ---[ end trace 13690ad962168b98 ]---

mount_pseudo() returns ERR_PTR(), not NULL, on error.

Fixes: 3684aa7099 ("block-dev: enable writeback cgroup support")
Cc: Shaohua Li <shli@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-22 08:06:15 -06:00
Jens Axboe
c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie
abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Ross Zwisler
99a01cdf9d block: remove BLK_DEV_DAX config option
The functionality for block device DAX was already removed with commit
acc93d30d7 ("Revert "block: enable dax for raw block devices"")

However, we still had a config option hanging around that was always
disabled because it depended on CONFIG_BROKEN.  This config option was
introduced in commit 03cdadb040 ("block: disable block device DAX by
default")

This change reverts that commit, removing the dead config option.

Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 08:50:07 -04:00
Linus Torvalds
a867d7349e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns vfs updates from Eric Biederman:
 "This tree contains some very long awaited work on generalizing the
  user namespace support for mounting filesystems to include filesystems
  with a backing store.  The real world target is fuse but the goal is
  to update the vfs to allow any filesystem to be supported.  This
  patchset is based on a lot of code review and testing to approach that
  goal.

  While looking at what is needed to support the fuse filesystem it
  became clear that there were things like xattrs for security modules
  that needed special treatment.  That the resolution of those concerns
  would not be fuse specific.  That sorting out these general issues
  made most sense at the generic level, where the right people could be
  drawn into the conversation, and the issues could be solved for
  everyone.

  At a high level what this patchset does a couple of simple things:

   - Add a user namespace owner (s_user_ns) to struct super_block.

   - Teach the vfs to handle filesystem uids and gids not mapping into
     to kuids and kgids and being reported as INVALID_UID and
     INVALID_GID in vfs data structures.

  By assigning a user namespace owner filesystems that are mounted with
  only user namespace privilege can be detected.  This allows security
  modules and the like to know which mounts may not be trusted.  This
  also allows the set of uids and gids that are communicated to the
  filesystem to be capped at the set of kuids and kgids that are in the
  owning user namespace of the filesystem.

  One of the crazier corner casees this handles is the case of inodes
  whose i_uid or i_gid are not mapped into the vfs.  Most of the code
  simply doesn't care but it is easy to confuse the inode writeback path
  so no operation that could cause an inode write-back is permitted for
  such inodes (aka only reads are allowed).

  This set of changes starts out by cleaning up the code paths involved
  in user namespace permirted mounts.  Then when things are clean enough
  adds code that cleanly sets s_user_ns.  Then additional restrictions
  are added that are possible now that the filesystem superblock
  contains owner information.

  These changes should not affect anyone in practice, but there are some
  parts of these restrictions that are changes in behavior.

   - Andy's restriction on suid executables that does not honor the
     suid bit when the path is from another mount namespace (think
     /proc/[pid]/fd/) or when the filesystem was mounted by a less
     privileged user.

   - The replacement of the user namespace implicit setting of MNT_NODEV
     with implicitly setting SB_I_NODEV on the filesystem superblock
     instead.

     Using SB_I_NODEV is a stronger form that happens to make this state
     user invisible.  The user visibility can be managed but it caused
     problems when it was introduced from applications reasonably
     expecting mount flags to be what they were set to.

  There is a little bit of work remaining before it is safe to support
  mounting filesystems with backing store in user namespaces, beyond
  what is in this set of changes.

   - Verifying the mounter has permission to read/write the block device
     during mount.

   - Teaching the integrity modules IMA and EVM to handle filesystems
     mounted with only user namespace root and to reduce trust in their
     security xattrs accordingly.

   - Capturing the mounters credentials and using that for permission
     checks in d_automount and the like.  (Given that overlayfs already
     does this, and we need the work in d_automount it make sense to
     generalize this case).

  Furthermore there are a few changes that are on the wishlist:

   - Get all filesystems supporting posix acls using the generic posix
     acls so that posix_acl_fix_xattr_from_user and
     posix_acl_fix_xattr_to_user may be removed.  [Maintainability]

   - Reducing the permission checks in places such as remount to allow
     the superblock owner to perform them.

   - Allowing the superblock owner to chown files with unmapped uids and
     gids to something that is mapped so the files may be treated
     normally.

  I am not considering even obvious relaxations of permission checks
  until it is clear there are no more corner cases that need to be
  locked down and handled generically.

  Many thanks to Seth Forshee who kept this code alive, and putting up
  with me rewriting substantial portions of what he did to handle more
  corner cases, and for his diligent testing and reviewing of my
  changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
  fs: Call d_automount with the filesystems creds
  fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
  evm: Translate user/group ids relative to s_user_ns when computing HMAC
  dquot: For now explicitly don't support filesystems outside of init_user_ns
  quota: Handle quota data stored in s_user_ns in quota_setxquota
  quota: Ensure qids map to the filesystem
  vfs: Don't create inodes with a uid or gid unknown to the vfs
  vfs: Don't modify inodes with a uid or gid unknown to the vfs
  cred: Reject inodes with invalid ids in set_create_file_as()
  fs: Check for invalid i_uid in may_follow_link()
  vfs: Verify acls are valid within superblock's s_user_ns.
  userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
  fs: Refuse uid/gid changes which don't map into s_user_ns
  selinux: Add support for unprivileged mounts from user namespaces
  Smack: Handle labels consistently in untrusted mounts
  Smack: Add support for unprivileged mounts from user namespaces
  fs: Treat foreign mounts as nosuid
  fs: Limit file caps to the user namespace of the super block
  userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
  userns: Remove implicit MNT_NODEV fragility.
  ...
2016-07-29 15:54:19 -07:00
Linus Torvalds
6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Toshi Kani
163d4baaeb block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:01:01 -06:00
Al Viro
a4a4f9439c bdev: get rid of ->bd_inodes
Since 2006 we have ->i_bdev pinning bdev in question, so there's no
way to get to bdev ->evict_inode() while there's an aliasing inode
anywhere.  In other words, the only place walking the list of aliases
is guaranteed to do it only when the list is empty...

Remove the detritus; it should've been done in "[PATCH] Fix a race
condition between ->i_mapping and iput()", but nobody had noticed it
back then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-19 13:16:52 -04:00
Eric W. Biederman
a2982cc922 vfs: Generalize filesystem nodev handling.
Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV.  Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong.  With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:57 -05:00
Linus Torvalds
315227f6da Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Linus Torvalds
1f40c49570 Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
 "The bulk of this update was stabilized before the merge window and
  appeared in -next.  The "device dax" implementation was revised this
  week in response to review feedback, and to address failures detected
  by the recently expanded ndctl unit test suite.

  Not included in this pull request are two dax topic branches (dax
  error handling, and dax radix-tree locking).  These topics were
  deferred to get a few more days of -next integration testing, and to
  coordinate a branch baseline with Ted and the ext4 tree.  Vishal and
  Ross will send the error handling and locking topics respectively in
  the next few days.

  This branch has received a positive build result from the kbuild robot
  across 226 configs.

  Summary:

   - Device DAX for persistent memory: Device DAX is the device-centric
     analogue of Filesystem DAX (CONFIG_FS_DAX).  It allows memory
     ranges to be allocated and mapped without need of an intervening
     file system.  Device DAX is strict, precise and predictable.
     Specifically this interface:

      a) Guarantees fault granularity with respect to a given page size
         (pte, pmd, or pud) set at configuration time.

      b) Enforces deterministic behavior by being strict about what
         fault scenarios are supported.

     Persistent memory is the first target, but the mechanism is also
     targeted for exclusive allocations of performance/feature
     differentiated memory ranges.

   - Support for the HPE DSM (device specific method) command formats.
     This enables management of these first generation devices until a
     unified DSM specification materializes.

   - Further ACPI 6.1 compliance with support for the common dimm
     identifier format.

   - Various fixes and cleanups across the subsystem"

* tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
  libnvdimm, dax: fix deletion
  libnvdimm, dax: fix alignment validation
  libnvdimm, dax: autodetect support
  libnvdimm: release ida resources
  Revert "block: enable dax for raw block devices"
  /dev/dax, core: file operations and dax-mmap
  /dev/dax, pmem: direct access to persistent memory
  libnvdimm: stop requiring a driver ->remove() method
  libnvdimm, dax: record the specified alignment of a dax-device instance
  libnvdimm, dax: reserve space to store labels for device-dax
  libnvdimm, dax: introduce device-dax infrastructure
  nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
  tools/testing/nvdimm: ND_CMD_CALL support
  nfit: disable vendor specific commands
  nfit: export subsystem ids as attributes
  nfit: fix format interface code byte order per ACPI6.1
  nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
  nfit, libnvdimm: clarify "commands" vs "_DSMs"
  libnvdimm: increase max envelope size for ioctl
  acpi/nfit: Add sysfs "id" for NVDIMM ID
  ...
2016-05-23 11:18:01 -07:00
Dan Williams
acc93d30d7 Revert "block: enable dax for raw block devices"
This reverts commit 5a023cdba5.

The functionality is superseded by the new "Device DAX" facility.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20 22:02:56 -07:00
Dan Williams
0a70bd4305 dax: enable dax in the presence of known media errors (badblocks)
1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18 12:16:56 -06:00
Toshi Kani
a8078b1fc6 block: Update blkdev_dax_capable() for consistency
blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
to remain as a separate interface for checking dax capability of
a raw block device.

Rename and relocate blkdev_dax_capable() to keep them maintained
consistently, and call bdev_direct_access() for the dax capability
check.

There is no change in the behavior.

Link: https://lkml.org/lkml/2016/5/9/950
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:13 -06:00
Toshi Kani
2d96afc8f7 block: Add bdev_dax_supported() for dax mount checks
DAX imposes additional requirements to a device.  Add
bdev_dax_supported() which performs all the precondition checks
necessary for filesystem to mount the device with dax option.

Also add a new check to verify if a partition is aligned by 4KB.
When a partition is unaligned, any dax read/write access fails,
except for metadata update.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:11 -06:00
Toshi Kani
2af3a8159c block: Add vfs_msg() interface
In preparation of moving DAX capability checks to the block layer
from filesystem code, add a VFS message interface that aligns with
filesystem's message format.

For instance, a vfs_msg() message followed by XFS messages in case
of a dax mount error may look like:

  VFS (pmem0p1): error: unaligned partition for dax
  XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
  XFS (pmem0p1): Mounting V5 Filesystem
   :

vfs_msg() is largely based on ext4_msg().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:10 -06:00
Jan Kara
02fbd13975 dax: Remove complete_unwritten argument
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16 18:11:51 -06:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Linus Torvalds
35d88d97be Merge branch 'for-4.6/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "Here are the core block changes for this merge window.  Not a lot of
  exciting stuff going on in this round, most of the changes have been
  on the driver side of things.  That pull request is coming next.  This
  pull request contains:

   - A set of fixes for chained bio handling from Christoph.

   - A tag bounds check for blk-mq from Hannes, ensuring that we don't
     do something stupid if a device reports an invalid tag value.

   - A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

   - A set of blk-mq fixes from Keith, adding support for dynamic
     hardware queues, and fixing init of max_dev_sectors for stacking
     devices.

   - A fix for the dynamic hw context from Ming.

   - Enabling of cgroup writeback support on a block device, from
     Shaohua"

* 'for-4.6/core' of git://git.kernel.dk/linux-block:
  blk-mq: add bounds check on tag-to-rq conversion
  block: bio_remaining_done() isn't unlikely
  block: cleanup bio_endio
  block: factor out chained bio completion
  block: don't unecessarily clobber bi_error for chained bios
  block-dev: enable writeback cgroup support
  blk-mq: Fix NULL pointer updating nr_requests
  blk-mq: mark request queue as mq asap
  block: Initialize max_dev_sectors to 0
  blk-mq: dynamic h/w context count
  cfq-iosched: Allow parent cgroup to preempt its child
  cfq-iosched: Allow sync noidle workloads to preempt each other
  cfq-iosched: Reorder checks in cfq_should_preempt()
  cfq-iosched: Don't group_idle if cfqq has big thinktime
2016-03-18 16:43:11 -07:00
Shaohua Li
3684aa7099 block-dev: enable writeback cgroup support
block_dev's .writepages/.writepage already handles
wbc_init_bio/wbc_account_io. We only set the SB_I_CGROUPWB bit to
suppport writeback cgroup support.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:50:53 -07:00
Ross Zwisler
7f6d5b529b dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00