Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
Pull overlayfs fixes from Miklos Szeredi:
"This contains a fix for the vfs_mkdir() issue discovered by Al, as
well as other fixes and cleanups"
* tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use inode_insert5() to hash a newly created inode
ovl: Pass argument to ovl_get_inode() in a structure
vfs: factor out inode_insert5()
ovl: clean up copy-up error paths
ovl: return EIO on internal error
ovl: make ovl_create_real() cope with vfs_mkdir() safely
ovl: create helper ovl_create_temp()
ovl: return dentry from ovl_create_real()
ovl: struct cattr cleanups
ovl: strip debug argument from ovl_do_ helpers
ovl: remove WARN_ON() real inode attributes mismatch
ovl: Kconfig documentation fixes
ovl: update documentation for unionmount-testsuite
Split out common helper for race free insertion of an already allocated
inode into the cache. Use this from iget5_locked() and
insert_inode_locked4(). Make iget5_locked() use new_inode()/iput() instead
of alloc_inode()/destroy_inode() directly.
Also export to modules for use by filesystems which want to preallocate an
inode before file/directory creation.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.
This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file. This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
As vfs moves to using struct timespec64 to represent times,
update the argument to timespec_truncate() to use
struct timespec64. Also change the name of the function.
The rest of the implementation logic is the same.
Move this to fs/inode.c instead of kernel/time/time.c as all the
users of this api are filesystems.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <viro@zeniv.linux.org.uk>
Noticed when looking at why cycling 600k inodes/s through the inode
cache was taking a total of 8% cpu in memset() during inode
initialisation. There is no need to zero the inode.i_data structure
twice.
This increases single threaded bulkstat throughput from ~200,000
inodes/s to ~220,000 inodes/s, so we save a substantial amount of
CPU time per inode init by doing this.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.
If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.
We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.
Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull overlayfs updates from Miklos Szeredi:
"This fixes d_ino correctness in readdir, which brings overlayfs on par
with normal filesystems regarding inode number semantics, as long as
all layers are on the same filesystem.
There are also some bug fixes, one in particular (random ioctl's
shouldn't be able to modify lower layers) that touches some vfs code,
but of course no-op for non-overlay fs"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix false positive ESTALE on lookup
ovl: don't allow writing ioctl on lower layer
ovl: fix relatime for directories
vfs: add flags to d_real()
ovl: cleanup d_real for negative
ovl: constant d_ino for non-merge dirs
ovl: constant d_ino across copy up
ovl: fix readdir error value
ovl: check snprintf return
Need to treat non-regular overlayfs files the same as regular files when
checking for an atime update.
Add a d_real() flag to make it return the upper dentry for all file types.
Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery. This also had the
effect of putting linked inodes on the lru instead of evicting them.
Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.
Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.
Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
Pull misc filesystem updates from Al Viro:
"Assorted normal VFS / filesystems stuff..."
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
dentry name snapshots
Make statfs properly return read-only state after emergency remount
fs/dcache: init in_lookup_hashtable
minix: Deinline get_block, save 2691 bytes
fs: Reorder inode_owner_or_capable() to avoid needless
fs: warn in case userspace lied about modprobe return
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Checking for capabilities should be the last operation when performing
access control tests so that PF_SUPERPRIV is set only when it was required
for success (implying that the capability was needed for the operation).
Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull misc vfs updates from Al Viro:
"Assorted bits and pieces from various people. No common topic in this
pile, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/affs: add rename exchange
fs/affs: add rename2 to prepare multiple methods
Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
fs: don't set *REFERENCED on single use objects
fs: compat: Remove warning from COMPATIBLE_IOCTL
remove pointless extern of atime_need_update_rcu()
fs: completely ignore unknown open flags
fs: add a VALID_OPEN_FLAGS
fs: remove _submit_bh()
fs: constify tree_descr arrays passed to simple_fill_super()
fs: drop duplicate header percpu-rwsem.h
fs/affs: bugfix: Write files greater than page size on OFS
fs/affs: bugfix: enable writes on OFS disks
fs/affs: remove node generation check
fs/affs: import amigaffs.h
fs/affs: bugfix: make symbolic links work again