The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes. Issues:
- maps should not be world-readable, especially if programs expect any
kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
check the maps when %n is in a *printf call, and a setuid(getuid())
process wouldn't be able to read its own maps file. (For reference
see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
non-root applications that depend on access to the maps contents.
This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents. To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARNING: vmlinux - Section mismatch: reference to
.init.text:eisa_root_register from .text between 'virtual_eisa_root_init' (at
offset 0xc026b80f) and 'cpufreq_debug_disable_ratelimit'
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't have a routine called namei() anymore since at least 2.3.x, and
the comment is just totally out of sync with the current lookup logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARN_ON(de && de->deleted); is sooo unreliable. Why?
proc_lookup remove_proc_entry
=========== =================
lock_kernel();
spin_lock(&proc_subdir_lock);
[find proc entry]
spin_unlock(&proc_subdir_lock);
spin_lock(&proc_subdir_lock);
[find proc entry]
proc_get_inode
==============
WARN_ON(de && de->deleted); ...
if (!atomic_read(&de->count))
free_proc_entry(de);
else
de->deleted = 1;
So, if you have some strange oops [1], and doesn't see this WARN_ON it means
nothing.
[1] try_module_get() of module which doesn't exist, two lines below
should suffice, or not?
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
strlcpy already accounts for the trailing zero in its length
computation, so there is no need to substract one to the buffer size.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Epoll is doing multiple passes over the ready set at the moment, because of
the constraints over the f_op->poll() call. Looking at the code again, I
noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).
This is a stress application that can be used to test the new code. It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads. Stress tested on my dual Opteron 254 w/out any problems.
http://www.xmailserver.org/totalmess.c
This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:
http://www.xmailserver.org/epwbench.c
[1] Considering that epoll must be thread-safe, there are five ways we can
be hit during an epoll_wait() transfer loop (ep_send_events()):
1) The epoll fd going away and calling ep_free
This just can't happen, since we did an fget() in sys_epoll_wait
2) An epoll_ctl(EPOLL_CTL_DEL)
This can't happen because epoll_ctl() gets ep->sem in write, and
we're holding it in read during ep_send_events()
3) An fd stored inside the epoll fd going away
This can't happen because in eventpoll_release_file() we get
ep->sem in write, and we're holding it in read during
ep_send_events()
4) Another epoll_wait() happening on another thread
They both can be inside ep_send_events() at the same time, we get
(splice) the ready-list under the spinlock, so each one will get
its own ready list. Note that an fd cannot be at the same time
inside more than one ready list, because ep_poll_callback() will
not re-queue it if it sees it already linked:
if (ep_is_linked(&epi->rdllink))
goto is_linked;
Another case that can happen, is two concurrent epoll_wait(),
coming in with a userspace event buffer of size, say, ten.
Suppose there are 50 event ready in the list. The first
epoll_wait() will "steal" the whole list, while the second, seeing
no events, will go to sleep. But at the end of ep_send_events() in
the first epoll_wait(), we will re-inject surplus ready fds, and we
will trigger the proper wake_up to the second epoll_wait().
5) ep_poll_callback() hitting us asyncronously
This is the tricky part. As I said above, the ep_is_linked() test
done inside ep_poll_callback(), will guarantee us that until the
item will result linked to a list, ep_poll_callback() will not try
to re-queue it again (read, write data on any of its members). When
we do a list_del() in ep_send_events(), the item will still satisfy
the ep_is_linked() test (whatever data is written in prev/next,
it'll never be its own pointer), so ep_poll_callback() will still
leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
that it'll become visible to ep_poll_callback(), but at the point
we're already past it.
[akpm@osdl.org: 80 cols]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- ext3_dx_find_entry() exit with out setting proper error pointer
- do_split() exit with out setting proper error pointer
it is realy painful because many callers contain folowing code:
de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
if (!(de))
return retval;
<<< WOW retval wasn't changed by do_split(), so caller failed
<<< but return SUCCESS :)
- Rearrange do_split() error path. Current error path is realy ugly, all
this up and down jump stuff doesn't make code easy to understand.
[dmonakhov@sw.ru: fix annoying fake error messages]
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first thing done by timespec_trunc() is :
if (gran <= jiffies_to_usecs(1) * 1000)
This should really be a test against a constant known at compile time.
Alas, it isnt. jiffies_to_usec() was unilined so C compiler emits a function
call and a multiply to compute : a CONSTANT.
mov $0x1,%edi
mov %rbx,0xffffffffffffffe8(%rbp)
mov %r12,0xfffffffffffffff0(%rbp)
mov %edx,%ebx
mov %rsi,0xffffffffffffffc8(%rbp)
mov %rsi,%r12
callq ffffffff80232010 <jiffies_to_usecs>
imul $0x3e8,%eax,%eax
cmp %ebx,%eax
This patch reorders kernel/time.c a bit so that jiffies_to_usecs() is defined
before timespec_trunc() so that compiler now generates :
cmp $0x3d0900,%edx (HZ=250 on my machine)
This gives a better code (timespec_trunc() becoming a leaf function), and
shorter kernel size as well.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PNP now initializes device dma masks, which prevents oopses when generic
dma calls are made using pnp device nodes.
This assumes PNP only uses ISA DMA, with 24 bit addresses; and that it's
safe to init those masks for all devices (rather than finding out which
devices have been assigned DMA channels, and handling only those).
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Adam Belay <abelay@novell.com>
Cc: Jaroslav Kysela <perex@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
namespaces. But they have different code paths.
This patch merges all the nsproxy and its associated namespace copy/clone
handling (as much as possible). Posted on container list earlier for
feedback.
- Create a new nsproxy and its associated namespaces and pass it back to
caller to attach it to right process.
- Changed all copy_*_ns() routines to return a new copy of namespace
instead of attaching it to task->nsproxy.
- Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
- Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
just incase.
- Get rid of all individual unshare_*_ns() routines and make use of
copy_*_ns() instead.
[akpm@osdl.org: cleanups, warning fix]
[clg@fr.ibm.com: remove dup_namespaces() declaration]
[serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
[akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Tesarik discovered a problem in remove_arg_zero(). He writes:
When a script is loaded, load_script() replaces argv[0] with the
name of the interpreter and the filename passed to the exec syscall.
However, there is no guarantee that the length of the interpreter
name plus the length of the filename is greater than the length of
the original argv[0]. If the difference happens to cross a page boundary,
setup_arg_pages() will call put_dirty_page() [aka install_arg_page()]
with an address outside the VMA.
Therefore, remove_arg_zero() must free all pages which would be unused
after the argument is removed.
So, rewrite the remove_arg_zero function without gotos, with a few comments,
and with the commonly used explicit index/offset. This fixes the problem
and makes it easier to understand as well.
[a.p.zijlstra@chello.nl: add comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The value of shmmax may be larger than will fit in the struct used by
the 32bit compat version of sys_shmctl. This change mirrors what the
normal sys_shmctl does when called with the old IPC_INFO command.
Signed-off-by: Guy Streeter <streeter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace call_smp_function with stop_machine_run in the Intel RNG driver.
CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.
A third CPU calls call_smp_function and issues the IPI. CPU A takes CPU
C's IPI. CPU B is waiting with interrupts disabled and does not see the
IPI. CPU C is stuck waiting for CPU B to respond to the IPI.
Deadlock.
The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may be
suspended).
[haruo.tomita@toshiba.co.jp: fix a typo in mod_init()]
[haruo.tomita@toshiba.co.jp: fix memory leak]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: "Tomita, Haruo" <haruo.tomita@toshiba.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This could help to find buggy drivers where request_irq return value wasn't
checked. There's just no reason to ignore errors which can and do occur.
Anyone who got warning during compilation have to realise what it is't
realy safe code.
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>