Pull libnvdimm updates from Dan Williams:
"The bulk of this has appeared in -next and independently received a
build success notification from the kbuild robot. The 'for-4.5/block-
dax' topic branch was rebased over the weekend to drop the "block
device end-of-life" rework that Al would like to see re-implemented
with a notifier, and to address bug reports against the badblocks
integration.
There is pending feedback against "libnvdimm: Add a poison list and
export badblocks" received last week. Linda identified some localized
fixups that we will handle incrementally.
Summary:
- Media error handling: The 'badblocks' implementation that
originated in md-raid is up-levelled to a generic capability of a
block device. This initial implementation is limited to being
consulted in the pmem block-i/o path. Later, 'badblocks' will be
consulted when creating dax mappings.
- Raw block device dax: For virtualization and other cases that want
large contiguous mappings of persistent memory, add the capability
to dax-mmap a block device directly.
- Increased /dev/mem restrictions: Add an option to treat all
io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
while a driver is actively using an address range. This behavior
is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
overridden by the existing "iomem=relaxed" kernel command line
option.
- Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
block device shutdown crash fix, and other small libnvdimm fixes"
* tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
block: kill disk_{check|set|clear|alloc}_badblocks
libnvdimm, pmem: nvdimm_read_bytes() badblocks support
pmem, dax: disable dax in the presence of bad blocks
pmem: fail io-requests to known bad blocks
libnvdimm: convert to statically allocated badblocks
libnvdimm: don't fail init for full badblocks list
block, badblocks: introduce devm_init_badblocks
block: clarify badblocks lifetime
badblocks: rename badblocks_free to badblocks_exit
libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
libnvdimm: Add a poison list and export badblocks
nfit_test: Enable DSMs for all test NFITs
md: convert to use the generic badblocks code
block: Add badblock management for gendisks
badblocks: Add core badblock management code
block: fix del_gendisk() vs blkdev_ioctl crash
block: enable dax for raw block devices
block: introduce bdev_file_inode()
restrict /dev/mem to idle io memory ranges
arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
...
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
Pull vfs copy_file_range updates from Al Viro:
"Several series around copy_file_range/CLONE"
* 'work.copy_file_range' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: use new dedupe data function pointer
vfs: hoist the btrfs deduplication ioctl to the vfs
vfs: wire up compat ioctl for CLONE/CLONE_RANGE
cifs: avoid unused variable and label
nfsd: implement the NFSv4.2 CLONE operation
nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
vfs: pull btrfs clone API to vfs layer
locks: new locks_mandatory_area calling convention
vfs: Add vfs_copy_file_range() support for pagecache copies
btrfs: add .copy_file_range file operation
x86: add sys_copy_file_range to syscall tables
vfs: add copy_file_range syscall and vfs helper
Pull file locking updates from Jeff Layton:
"File locking related changes for v4.5 (pile #1)
Highlights:
- new Kconfig option to allow disabling mandatory locking (which is
racy anyway)
- new tracepoints for setlk and close codepaths
- fix for a long-standing bug in code that handles races between
setting a POSIX lock and close()"
* tag 'locks-v4.5-1' of git://git.samba.org/jlayton/linux:
locks: rename __posix_lock_file to posix_lock_inode
locks: prink more detail when there are leaked locks
locks: pass inode pointer to locks_free_lock_context
locks: sprinkle some tracepoints around the file locking code
locks: don't check for race with close when setting OFD lock
locks: fix unlock when fcntl_setlk races with a close
fs: make locks.c explicitly non-modular
locks: use list_first_entry_or_null()
locks: Don't allow mounts in user namespaces to enable mandatory locking
locks: Allow disabling mandatory locking at compile time
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.
No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem. This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests. It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development. To avoid growth of various slightly incompatible
implementations, add one to the VFS. Note that clones are different from
file copies in several ways:
- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation
Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure. The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.
Based on earlier work from Peng Tao.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also simplify the read/write selection.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes is_sxid return bool to improve readability
due to this particular function only using either one or zero
as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes is_bad_inode return bool to improve
readability due to this particular function only using either
one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes is_subdir return bool to improve
readability due to this particular function only using either
one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes path_is_under return bool to improve
readability due to this particular function only using either
one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently when CONFIG_BLOCK is defined sb_is_blkdev_sb returns bool,
while when CONFIG_BLOCK is not defined it returns int. Let's keep
consistent to make sb_is_blkdev_sb return bool as well when CONFIG_BLOCK
isn't defined.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a copy_file_range() system call for offloading copies between
regular files.
This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:
- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device
This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.
Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems
safely.
Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
Change flags parameter from int to unsigned int,
Add function to include/linux/syscalls.h,
Check copy len after file open mode,
Don't forbid ranges inside the same file,
Use rw_verify_area() to veriy ranges,
Use file_out rather than file_in,
Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Mandatory locking appears to be almost unused and buggy and there
appears no real interest in doing anything with it. Since effectively
no one uses the code and since the code is buggy let's allow it to be
disabled at compile time. I would just suggest removing the code but
undoubtedly that will break some piece of userspace code somewhere.
For the distributions that don't care about this piece of code
this gives a nice starting point to make mandatory locking go away.
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jeff Layton <jeff.layton@primarydata.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Pull vfs update from Al Viro:
- misc stable fixes
- trivial kernel-doc and comment fixups
- remove never-used block_page_mkwrite() wrapper function, and rename
the function that is _actually_ used to not have double underscores.
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: 9p: cache.h: Add #define of include guard
vfs: remove stale comment in inode_operations
vfs: remove unused wrapper block_page_mkwrite()
binfmt_elf: Correct `arch_check_elf's description
fs: fix writeback.c kernel-doc warnings
fs: fix inode.c kernel-doc warning
fs/pipe.c: return error code rather than 0 in pipe_write()
fs/pipe.c: preserve alloc_file() error code
binfmt_elf: Don't clobber passed executable's file header
FS-Cache: Handle a write to the page immediately beyond the EOF marker
cachefiles: perform test on s_blocksize when opening cache file.
FS-Cache: Don't override netfs's primary_index if registering failed
FS-Cache: Increase reference of parent after registering, netfs success
debugfs: fix refcount imbalance in start_creating