Pull ceph updates from Sage Weil:
"There are some updates and cleanups to the CRUSH placement code, a bug
fix with incremental maps, several cleanups and fixes from Josh Durgin
in the RBD block device code, a series of cleanups and bug fixes from
Alex Elder in the messenger code, and some miscellaneous bounds
checking and gfp cleanups/fixes."
Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".
* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
libceph: fix pg_temp updates
libceph: avoid unregistering osd request when not registered
ceph: add auth buf in prepare_write_connect()
ceph: rename prepare_connect_authorizer()
ceph: return pointer from prepare_connect_authorizer()
ceph: use info returned by get_authorizer
ceph: have get_authorizer methods return pointers
ceph: ensure auth ops are defined before use
ceph: messenger: reduce args to create_authorizer
ceph: define ceph_auth_handshake type
ceph: messenger: check return from get_authorizer
ceph: messenger: rework prepare_connect_authorizer()
ceph: messenger: check prepare_write_connect() result
ceph: don't set WRITE_PENDING too early
ceph: drop msgr argument from prepare_write_connect()
ceph: messenger: send banner in process_connect()
ceph: messenger: reset connection kvec caller
libceph: don't reset kvec in prepare_write_banner()
ceph: ignore preferred_osd field
ceph: fully initialize new layout
...
Merge block/IO core bits from Jens Axboe:
"This is a bit bigger on the core side than usual, but that is purely
because we decided to hold off on parts of Tejun's submission on 3.4
to give it a bit more time to simmer. As a consequence, it's seen a
long cycle in for-next.
It contains:
- Bug fix from Dan, wrong locking type.
- Relax splice gifting restriction from Eric.
- A ton of updates from Tejun, primarily for blkcg. This improves
the code a lot, making the API nicer and cleaner, and also includes
fixes for how we handle and tie policies and re-activate on
switches. The changes also include generic bug fixes.
- A simple fix from Vivek, along with a fix for doing proper delayed
allocation of the blkcg stats."
Fix up annoying conflict just due to different merge resolution in
Documentation/feature-removal-schedule.txt
* 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
blkcg: tg_stats_alloc_lock is an irq lock
vmsplice: relax alignement requirements for SPLICE_F_GIFT
blkcg: use radix tree to index blkgs from blkcg
blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
block: fix elvpriv allocation failure handling
block: collapse blk_alloc_request() into get_request()
blkcg: collapse blkcg_policy_ops into blkcg_policy
blkcg: embed struct blkg_policy_data in policy specific data
blkcg: mass rename of blkcg API
blkcg: style cleanups for blk-cgroup.h
blkcg: remove blkio_group->path[]
blkcg: blkg_rwstat_read() was missing inline
blkcg: shoot down blkgs if all policies are deactivated
blkcg: drop stuff unused after per-queue policy activation update
blkcg: implement per-queue policy activation
blkcg: add request_queue->root_blkg
blkcg: make request_queue bypassing on allocation
blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
blkcg: make blkg_conf_prep() take @pol and return with queue lock held
blkcg: remove static policy ID enums
...
Merge patches through Andrew Morton:
"180 patches - err 181 - listed below:
- most of MM. I held back the (large) "memcg: add hugetlb extension"
series because a bunfight has recently broken out.
- leds. After this, Bryan Wu will be handling drivers/leds/
- backlight
- lib/
- rtc"
* emailed from Andrew Morton <akpm@linux-foundation.org>: (181 patches)
drivers/rtc/rtc-s3c.c: fix compiler warning
drivers/rtc/rtc-tegra.c: clean up probe/remove routines
drivers/rtc/rtc-pl031.c: remove RTC timer interrupt handling
drivers/rtc/rtc-lpc32xx.c: add device tree support
drivers/rtc/rtc-m41t93.c: don't let get_time() reset M41T93_FLAG_OF
rtc: ds1307: add trickle charger support
rtc: ds1307: remove superfluous initialization
rtc: rename CONFIG_RTC_MXC to CONFIG_RTC_DRV_MXC
drivers/rtc/Kconfig: place RTC_DRV_IMXDI and RTC_MXC under "on-CPU RTC drivers"
drivers/rtc/rtc-pcf8563.c: add RTC_VL_READ/RTC_VL_CLR ioctl feature
rtc: add ioctl to get/clear battery low voltage status
drivers/rtc/rtc-ep93xx.c: convert to use module_platform_driver()
rtc/spear: add Device Tree probing capability
lib/vsprintf.c: "%#o",0 becomes '0' instead of '00'
radix-tree: fix preload vector size
spinlock_debug: print kallsyms name for lock
vsprintf: fix %ps on non symbols when using kallsyms
lib/bitmap.c: fix documentation for scnprintf() functions
lib/string_helpers.c: make arrays static
lib/test-kstrtox.c: mark const init data with __initconst instead of __initdata
...
The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.
The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.
On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.
This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove vmtruncate_range(), and remove the truncate_range method from
struct inode_operations: only tmpfs ever supported it, and tmpfs has now
converted over to using the fallocate method of file_operations.
Update Documentation accordingly, adding (setlease and) fallocate lines.
And while we're in mm.h, remove duplicate declarations of shmem_lock() and
shmem_file_setup(): everyone is now using the ones in shmem_fs.h.
Based-on-patch-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CIFS updates from Steve French.
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (29 commits)
cifs: fix oops while traversing open file list (try #4)
cifs: Fix comment as d_alloc_root() is replaced by d_make_root()
CIFS: Introduce SMB2 mounts as vers=2.1
CIFS: Introduce SMB2 Kconfig option
CIFS: Move add/set_credits and get_credits_field to ops structure
CIFS: Move protocol specific demultiplex thread calls to ops struct
CIFS: Move protocol specific part from cifs_readv_receive to ops struct
CIFS: Move header_size/max_header_size to ops structure
CIFS: Move protocol specific part from SendReceive2 to ops struct
cifs: Include backup intent search flags during searches {try #2)
CIFS: Separate protocol specific part from setlk
CIFS: Separate protocol specific part from getlk
CIFS: Separate protocol specific lock type handling
CIFS: Convert lock type to 32 bit variable
CIFS: Move locks to cifsFileInfo structure
cifs: convert send_nt_cancel into a version specific op
cifs: add a smb_version_operations/values structures and a smb_version enum
cifs: remove the vers= and version= synonyms for ver=
cifs: add warning about change in default cache semantics in 3.7
cifs: display cache= option in /proc/mounts
...
Pull NFS client updates from Trond Myklebust:
"New features include:
- Rewrite the O_DIRECT code so that it can share the same coalescing
and pNFS functionality as the page cache code.
- Allow the server to provide hints as to when we should use pNFS,
and when it is more efficient to read and write through the
metadata server.
- NFS cache consistency updates:
* Use the ctime to emulate a change attribute for NFSv2/v3 so that
all NFS versions can share the same cache management code.
* New cache management code will only look at the change attribute
and size attribute when deciding whether or not our cached data
is still valid or not.
* Don't request NFSv4 post-op attributes on writes in cases such as
O_DIRECT, where we don't care about data cache consistency, or
when we have a write delegation, and know that our cache is still
consistent.
* Don't request NFSv4 post-op attributes on operations such as
COMMIT, where there are no expected metadata updates.
* Don't request NFSv4 directory post-op attributes in cases where
the operations themselves already return change attribute
updates: i.e. operations such as OPEN, CREATE, REMOVE, LINK and
RENAME.
- Speed up 'ls' and friends by using READDIR rather than READDIRPLUS
if we detect no attempts to lookup filenames.
- Improve the code sharing between NFSv2/v3 and v4 mounts
- NFSv4.1 state management efficiency improvements
- More patches in preparation for NFSv4/v4.1 migration functionality."
Fix trivial conflict in fs/nfs/nfs4proc.c that was due to the dcache
qstr name initialization changes (that made the length/hash a 64-bit
union)
* tag 'nfs-for-3.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (146 commits)
NFSv4: Add debugging printks to state manager
NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO
NFSv4: update_changeattr does not need to set NFS_INO_REVAL_PAGECACHE
NFSv4.1: nfs4_reset_session should use nfs4_handle_reclaim_lease_error
NFSv4.1: Handle other occurrences of NFS4ERR_CONN_NOT_BOUND_TO_SESSION
NFSv4.1: Handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION in the state manager
NFSv4.1: Handle errors in nfs4_bind_conn_to_session
NFSv4.1: nfs4_bind_conn_to_session should drain the session
NFSv4.1: Don't clobber the seqid if exchange_id returns a confirmed clientid
NFSv4.1: Add DESTROY_CLIENTID
NFSv4.1: Ensure we use the correct credentials for bind_conn_to_session
NFSv4.1: Ensure we use the correct credentials for session create/destroy
NFSv4.1: Move NFSPROC4_CLNT_BIND_CONN_TO_SESSION to the end of the operations
NFSv4.1: Handle NFS4ERR_SEQ_MISORDERED when confirming the lease
NFSv4: When purging the lease, we must clear NFS4CLNT_LEASE_CONFIRM
NFSv4: Clean up the error handling for nfs4_reclaim_lease
NFSv4.1: Exchange ID must use GFP_NOFS allocation mode
nfs41: Use BIND_CONN_TO_SESSION for CB_PATH_DOWN*
nfs4.1: add BIND_CONN_TO_SESSION operation
NFSv4.1 test the mdsthreshold hint parameters
...
Pull exofs updates from Boaz Harrosh:
"Just a couple of patches. The first is a BUG fix destined for stable
which missed the 3.4-rc7 Kernel. The second is just a fixture
addition so exofs is able to be better exported as a cluster file
system via pNFS."
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: Add SYSFS info for autologin/pNFS export
exofs: Fix CRASH on very early IO errors.
Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."
* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally
We're already invalidating the data cache, and setting the new change
attribute. Since directories don't care about the i_size field, there
is no need to be forcing any extra revalidation of the page cache.
We do keep the NFS_INO_INVALID_ATTR flag, in order to force an
attribute cache revalidation on stat() calls since we do not
update the mtime and ctime fields.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The results from a call to nfs4_proc_create_session() should always
be fed into nfs4_handle_reclaim_lease_error, so that we can
handle errors such as NFS4ERR_SEQ_MISORDERED correctly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Let nfs4_schedule_session_recovery() handle the details of choosing
between resetting the session, and other session related recovery.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we handle NFS4ERR_DELAY errors separately, and then
let nfs4_recovery_handle_error() handle all other cases.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In order to avoid races with other RPC calls that end up setting the
NFS4CLNT_BIND_CONN_TO_SESSION flag.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
complicated, but a lot more generic.
In particular, it allows us to really do the operations efficiently on
both little-endian and big-endian machines, pretty much regardless of
machine details. For example, if you can rely on a fast population
count instruction on your architecture, this will allow you to make your
optimized <asm/word-at-a-time.h> file with that.
NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
not truly generic, it actually only works on big-endian. Why? Because
on little-endian the generic algorithms are wasteful, since you can
inevitably do better. The x86 implementation is an example of that.
(The only truly non-generic part of the asm-generic implementation is
the "find_zero()" function, and you could make a little-endian version
of it. And if the Kbuild infrastructure allowed us to pick a particular
header file, that would be lovely)
The <asm/word-at-a-time.h> functions are as follows:
- WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
uses.
- has_zero(): take a word, and determine if it has a zero byte in it.
It gets the word, the pointer to the constant pool, and a pointer to
an intermediate "data" field it can set.
This is the "quick-and-dirty" zero tester: it's what is run inside
the hot loops.
- "prep_zero_mask()": take the word, the data that has_zero() produced,
and the constant pool, and generate an *exact* mask of which byte had
the first zero. This is run directly *outside* the loop, and allows
the "has_zero()" function to answer the "is there a zero byte"
question without necessarily getting exactly *which* byte is the
first one to contain a zero.
If you do multiple byte lookups concurrently (eg "hash_name()", which
looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
phase, the result of those can be or'ed together to get the "either
or" case.
- The result from "prep_zero_mask()" can then be fed into "find_zero()"
(to find the byte offset of the first byte that was zero) or into
"zero_bytemask()" (to find the bytemask of the bytes preceding the
zero byte).
The existence of zero_bytemask() is optional, and is not necessary
for the normal string routines. But dentry name hashing needs it, so
if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
This changes the generic strncpy_from_user() function and the dentry
hashing functions to use these modified word-at-a-time interfaces. This
gets us back to the optimized state of the x86 strncpy that we lost in
the previous commit when moving over to the generic version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the EXCHGID4_FLAG_CONFIRMED_R flag is set, the client is in theory
supposed to already know the correct value of the seqid, in which case
RFC5661 states that it should ignore the value returned.
Also ensure that if the sanity check in nfs4_check_cl_exchange_flags
fails, then we must not change the nfs_client fields.
Finally, clean up the code: we don't need to retest the value of
'status' unless it can change.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Apparently the patch "NFS: Always use the same SETCLIENTID boot verifier"
is tickling a Linux nfs server bug, and causing a regression: the server
can get into a situation where it keeps replying NFS4ERR_SEQ_MISORDERED
to our CREATE_SESSION request even when we are sending the correct
sequence ID.
Fix this by purging the lease and then retrying.
Reported-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>