This reverts commit 8c01a529b8.
It turns out the d_unhashed() check isn't unnecessary after all: while
it's true that unhashing will increment the sequence numbers, that does
not necessarily invalidate the RCU lookup, because it might have seen
the dentry pointer (before it got unhashed), but by the time it loaded
the sequence number, it could have seen the *new* sequence number (after
it got unhashed).
End result: we might look up an unhashed dentry that is about to be
freed, with the sequence number never indicating anything bad about it.
So checking that the dentry is still hashed (*after* reading the sequence
number) is indeed the proper fix, and was never unnecessary.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi points out that we need to also worry about memory
odering when doing the dentry name comparison asynchronously with RCU.
In particular, doing a rename can do a memcpy() of one dentry name over
another, and we want to make sure that any unlocked reader will always
see the proper terminating NUL character, so that it won't ever run off
the allocation.
Rather than having to be extra careful with the name copy or at lookup
time for each character, this resolves the issue by making sure that all
names that are inlined in the dentry always have a NUL character at the
end of the name allocation. If we do that at dentry allocation time, we
know that no future name copy will ever change that final NUL to
anything else, so there are no memory ordering issues.
So even if a concurrent rename ends up overwriting the NUL character
that terminates the original name, we always know that there is one
final NUL at the end, and there is no worry about the lockless RCU
lookup traversing the name too far.
The out-of-line allocations are never copied over, so we can just make
sure that we write the name (with terminating NULL) and do a write
barrier before we expose the name to anything else by setting it in the
dentry.
Reported-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had for some reason overlooked the AIO interface, and it didn't use
the proper rw_verify_area() helper function that checks (for example)
mandatory locking on the file, and that the size of the access doesn't
cause us to overflow the provided offset limits etc.
Instead, AIO did just the security_file_permission() thing (that
rw_verify_area() also does) directly.
This fixes it to do all the proper helper functions, which not only
means that now mandatory file locking works with AIO too, we can
actually remove lines of code.
Reported-by: Manish Honap <manish_honap_vit@yahoo.co.in>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull c6x updates from Mark Salter:
"Clean up some c6x Kconfig items and add support for Elf FDPIC loader."
* tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
C6X: remove unused config items
C6X: add support to build with BINFMT_ELF_FDPIC
C6X: change main arch kbuild symbol
Pull networking changes from David Miller:
1) Get rid of the error prone NLA_PUT*() macros that used an embedded
goto.
2) Kill off the token-ring and MCA networking drivers, from Paul
Gortmaker.
3) Reduce high-order allocations made by datagram AF_UNIX sockets, from
Eric Dumazet.
4) Add PTP hardware clock support to IGB and IXGBE, from Richard
Cochran and Jacob Keller.
5) Allow users to query timestamping capabilities of a card via
ethtool, from Richard Cochran.
6) Add loadbalance mode to the teaming driver, from Jiri Pirko. Part
of this is that we can now have BPF filters not attached to sockets,
and the loadbalancing function is calculated using one.
7) Francois Romieu went through the network drivers removing gratuitous
uses of netdev->base_addr, perhaps some day we can remove it
completely but it's used for ISA probing still.
8) Add a BPF JIT for sparc. I know, who cares, right? :-)
9) Move networking sysctl registry away from using the compatability
mode interfaces in the sysctl code. From Eric W Biederman.
10) Pavel Emelyanov added a way to save and restore TCP socket state via
TCP_REPAIR, TCP_REPAIR_QUEUE, and TCP_QUEUE_SEQ socket options as
well as a way to forcefully bind a socket to a port via the
sk->sk_reuse value SK_FORCE_REUSE. There is also a
TCP_REPAIR_OPTIONS which allows to reinstante the TCP options
enabled on the connection.
11) Several enhancements from Eric Dumazet that, in particular, can
enhance splice performance on TCP sockets significantly.
a) Reset the offset of the per-socket sendmsg page when we know
we're the only use of the page in linear_to_page().
b) Add facilities such that skb->data can be backed a page rather
than SLAB kmalloc'd memory. In particular devices which were
receiving into linear RX buffers can now end up providing paged
data.
The big result is that code like splice and GRO do not have to copy
any more.
12) Allow a pure sender to more gracefully handle ACK backlogs in TCP.
What can happen at high rates is that the sender hasn't grown his
receive buffer limits at all (he's not receiving data so really
doesn't need to), but the non-data ACKs consume receive buffer
space.
sk_add_backlog() is too aggressive in dropping frames in this case,
so relax it's requirements by using the receive buffer plus the send
buffer limit as the backlog limit instead of just the former.
Also from Eric Dumazet.
13) Add ipv6 support to L2TP, from Benjamin LaHaise, James Chapman, and
Chris Elston.
14) Implement TCP early retransmit (RFC 5827), from Yuchung Cheng.
Basically, we can start fast retransmit before hiting the dupack
threshold under certain conditions.
15) New CODEL active queue management packet scheduler, from Eric
Dumazet based upon initial work by Dave Taht.
Basically, the big feature is that packets are dropped (or ECN bits
are set) based upon how long packets live in the queue, rather than
the queue length (which is what RED uses).
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1341 commits)
drivers/net/stmmac: seq_file fix memory leak
ipv6/exthdrs: strict Pad1 and PadN check
USB: qmi_wwan: Add ZTE (Vodafone) K3520-Z
USB: qmi_wwan: Add ZTE (Vodafone) K3765-Z
USB: qmi_wwan: Make forced int 4 whitelist generic
net/ipv4: replace simple_strtoul with kstrtoul
net/ipv4/ipconfig: neaten __setup placement
net: qmi_wwan: Add Vodafone/Huawei K5005 support
net: cdc_ether: Add ZTE WWAN matches before generic Ethernet
ipv6: use skb coalescing in reassembly
ipv4: use skb coalescing in defragmentation
net: introduce skb_try_coalesce()
net:ipv6:fixed space issues relating to operators.
net:ipv6:fixed a trailing white space issue.
ipv6: disable GSO on sockets hitting dst_allfrag
tg3: use netdev_alloc_frag() API
net: napi_frags_skb() is static
ppp: avoid false drop_monitor false positives
ipv6: bool/const conversions phase2
ipx: Remove spurious NULL checking in ipx_ioctl().
...
This branch simplifies and clarifies the dcache lookup, and allows us to
do certain nice optimizations when comparing dentries. It also cleans
up the interface to __d_lookup_rcu(), especially around passing the
inode information around.
* dentry-cleanups:
vfs: make it possible to access the dentry hash/len as one 64-bit entry
vfs: move dentry name length comparison from dentry_cmp() into callers
vfs: do the careful dentry name access for all dentry_cmp cases
vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu
vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces
This teaches vfs_fstat() to use the appropriate f[get|put]_light
functions, allowing it to avoid some unnecessary locking for the common
case.
More noticeably, it also cleans up and simplifies the "getname_flags()"
function, which now relies on the architecture strncpy_from_user() doing
all the user access checks properly, instead of hacking around the fact
that on x86 it didn't use to do it right (see commit 92ae03f2ef: "x86:
merge 32/64-bit versions of 'strncpy_from_user()' and speed it up").
* vfs-cleanups:
VFS: make vfs_fstat() use f[get|put]_light()
VFS: clean up and simplify getname_flags()
x86: make word-at-a-time strncpy_from_user clear bytes at the end
This makes cp_new_stat() a bit more readable, and avoids having to
memset() the whole structure just to fill in a couple of padding fields.
This is another result of me looking at code generation of functions
that show up high on certain kernel profiles, and just going "Oh, let's
just clean that up".
Architectures that don't supply the #define to fill just the padding
fields will still fall back to memset().
* stat-cleanups:
vfs: don't force a big memset of stat data just to clear padding fields
vfs: de-crapify "cp_new_stat()" function
Pull block layer fixes from Jens Axboe:
"A few small, but important fixes. Most of them are marked for stable
as well
- Fix failure to release a semaphore on error path in mtip32xx.
- Fix crashable condition in bio_get_nr_vecs().
- Don't mark end-of-disk buffers as mapped, limit it to i_size.
- Fix for build problem with CONFIG_BLOCK=n on arm at least.
- Fix for a buffer overlow on UUID partition printing.
- Trivial removal of unused variables in dac960."
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix buffer overflow when printing partition UUIDs
Fix blkdev.h build errors when BLOCK=n
bio allocation failure due to bio_get_nr_vecs()
block: don't mark buffers beyond end of disk as mapped
mtip32xx: release the semaphore on an error path
dac960: Remove unused variables from DAC960_CreateProcEntries()
Merge misc fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (4 patches)
frv: delete incorrect task prototypes causing compile fail
slub: missing test for partial pages flush work in flush_all()
fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01
Instead of doing the i_mode calculations at proc_fd_instantiate() time,
move them into tid_fd_revalidate(), which is where the other inode state
(notably uid/gid information) is updated too.
Otherwise we'll end up with stale i_mode information if an fd is re-used
while the dentry still hangs around. Not that anything really *cares*
(symlink permissions don't really matter), but Tetsuo Handa noticed that
the owner read/write bits don't always match the state of the
readability of the file descriptor, and we _used_ to get this right a
long time ago in a galaxy far, far away.
Besides, aside from fixing an ugly detail (that has apparently been this
way since commit 61a2878402: "proc: Remove the hard coded inode
numbers" in 2006), this removes more lines of code than it adds. And it
just makes sense to update i_mode in the same place we update i_uid/gid.
Al Viro correctly points out that we could just do the inode fill in the
inode iops ->getattr() function instead. However, that does require
somewhat slightly more invasive changes, and adds yet *another* lookup
of the file descriptor. We need to do the revalidate() for other
reasons anyway, and have the file descriptor handy, so we might as well
fill in the information at this point.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
map_files/ entries are never supposed to be executed, still curious
minds might try to run them, which leads to the following deadlock
======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc4-24406-g841e6a6 #121 Not tainted
-------------------------------------------------------
bash/1556 is trying to acquire lock:
(&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1
but task is already holding lock:
(&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_killable_nested+0x40/0x45
lock_trace+0x24/0x59
proc_map_files_lookup+0x5a/0x165
__lookup_hash+0x52/0x73
do_lookup+0x276/0x2b1
walk_component+0x3d/0x114
do_last+0xfc/0x540
path_openat+0xd3/0x306
do_filp_open+0x3d/0x89
do_sys_open+0x74/0x106
sys_open+0x21/0x23
tracesys+0xdd/0xe2
-> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
check_prev_add+0x6a/0x1ef
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_nested+0x40/0x45
do_lookup+0x267/0x2b1
walk_component+0x3d/0x114
link_path_walk+0x1f9/0x48f
path_openat+0xb6/0x306
do_filp_open+0x3d/0x89
open_exec+0x25/0xa0
do_execve_common+0xea/0x2f9
do_execve+0x43/0x45
sys_execve+0x43/0x5a
stub_execve+0x6c/0xc0
This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
and when do_lookup happens we try to grab task->signal->cred_guard_mutex
again in lock_trace.
Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
and in proc_map_files_readdir() instead of lock_trace(), the caller must
be CAP_SYS_ADMIN granted anyway.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...and add a "directio" synonym since that's what the manpage has
always advertised.
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
C6x userspace supports a shared library mechanism called DSBT for systems with
no MMU. DSBT is similar to FDPIC in allowing shared text segments and private
copies of data segments without an MMU. Both methods access data using a base
register and offset. With FDPIC, the caller of an external function sets up the
base register for the callee. With DSBT, the called function sets up its own
base register. Other details differ but both userspaces need the same thing
from the kernel loader: a map of where each ELF segment was loaded. The FDPIC
loader already provides this, so DSBT just uses it.
This patch enables BINFMT_ELF_FDPIC by default for C6X and provides the
necessary architecture hooks for the generic loader.
Signed-off-by: Mark Salter <msalter@redhat.com>
Pull three MTD fixes from David Woodhouse:
- Fix a lock ordering deadlock in JFFS2
- Fix an oops in the dataflash driver, triggered by a dummy call to test
whether it has OTP functionality.
- Fix request_mem_region() failure on amsdelta NAND driver.
* tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd:
mtd: ams-delta: fix request_mem_region() failure
jffs2: Fix lock acquisition order bug in gc path
mtd: fix oops in dataflash driver
The number of bio_get_nr_vecs() is passed down via bio_alloc() to
bvec_alloc_bs(), which fails the bio allocation if
nr_iovecs > BIO_MAX_PAGES. For the underlying caller this causes an
unexpected bio allocation failure.
Limiting to queue_max_segments() is not sufficient, as max_segments
also might be very large.
bvec_alloc_bs(gfp_mask, nr_iovecs, ) => NULL when nr_iovecs > BIO_MAX_PAGES
bio_alloc_bioset(gfp_mask, nr_iovecs, ...)
bio_alloc(GFP_NOIO, nvecs)
xfs_alloc_ioend_bio()
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hi,
We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk. It can
easily be reproduced by doing the following:
[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s
In dmesg, you'll find the following:
squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 43.106012] attempt to access beyond end of device
[ 43.106029] loop0: rw=0, want=277410, limit=277408
[ 43.106039] Buffer I/O error on device loop0, logical block 138704
[ 43.106053] attempt to access beyond end of device
[ 43.106057] loop0: rw=0, want=277412, limit=277408
[ 43.106061] Buffer I/O error on device loop0, logical block 138705
[ 43.106066] attempt to access beyond end of device
[ 43.106070] loop0: rw=0, want=277414, limit=277408
[ 43.106073] Buffer I/O error on device loop0, logical block 138706
[ 43.106078] attempt to access beyond end of device
[ 43.106081] loop0: rw=0, want=277416, limit=277408
[ 43.106085] Buffer I/O error on device loop0, logical block 138707
[ 43.106089] attempt to access beyond end of device
[ 43.106093] loop0: rw=0, want=277418, limit=277408
[ 43.106096] Buffer I/O error on device loop0, logical block 138708
[ 43.106101] attempt to access beyond end of device
[ 43.106104] loop0: rw=0, want=277420, limit=277408
[ 43.106108] Buffer I/O error on device loop0, logical block 138709
[ 43.106112] attempt to access beyond end of device
[ 43.106116] loop0: rw=0, want=277422, limit=277408
[ 43.106120] Buffer I/O error on device loop0, logical block 138710
[ 43.106124] attempt to access beyond end of device
[ 43.106128] loop0: rw=0, want=277424, limit=277408
[ 43.106131] Buffer I/O error on device loop0, logical block 138711
[ 43.106135] attempt to access beyond end of device
[ 43.106139] loop0: rw=0, want=277426, limit=277408
[ 43.106143] Buffer I/O error on device loop0, logical block 138712
[ 43.106147] attempt to access beyond end of device
[ 43.106151] loop0: rw=0, want=277428, limit=277408
[ 43.106154] Buffer I/O error on device loop0, logical block 138713
[ 43.106158] attempt to access beyond end of device
[ 43.106162] loop0: rw=0, want=277430, limit=277408
[ 43.106166] attempt to access beyond end of device
[ 43.106169] loop0: rw=0, want=277432, limit=277408
...
[ 43.106307] attempt to access beyond end of device
[ 43.106311] loop0: rw=0, want=277470, limit=2774
Squashfs manages to read in the end block(s) of the disk during the
mount operation. Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped. Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above. I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>
--
Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows comparing hash and len in one operation on 64-bit
architectures. Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.
The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer. This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers do want to check the dentry length, but some of them can
check the length and the hash together, so doing it in dentry_cmp() can
be counter-productive.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 12f8ad4b05 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
interfaces") did the careful ACCESS_ONCE() of the dentry name only for
the word-at-a-time case, even though the issue is generic.
Admittedly I don't really see gcc ever reloading the value in the middle
of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
issue. But better safe than sorry.
Also, this consolidates the common parts of the word-at-a-time and
bytewise logic, which includes checking the length. We'll be changing
that later.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for d_unhashed() is not strictly incorrect, but at the same
time it is also not sensible. The actual dentry removal from the dentry
hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
we depend on __d_drop() updating the sequence number to invalidate any
lookup of an unhashed dentry.
So checking d_unhashed() is not incorrect, but it's not useful either:
the code has to work correctly even without it. So just remove it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
MAINTAINERS: add maintainer for LED subsystem
mm: nobootmem: fix sign extend problem in __free_pages_memory()
drivers/leds: correct __devexit annotations
memcg: free spare array to avoid memory leak
namespaces, pid_ns: fix leakage on fork() failure
hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
mm: fix division by 0 in percpu_pagelist_fraction()
proc/pid/pagemap: correctly report non-present ptes and holes between vmas