This commit makes the stack guard page somewhat less visible to user
space. It does this by:
- not showing the guard page in /proc/<pid>/maps
It looks like lvm-tools will actually read /proc/self/maps to figure
out where all its mappings are, and effectively do a specialized
"mlockall()" in user space. By not showing the guard page as part of
the mapping (by just adding PAGE_SIZE to the start for grows-up
pages), lvm-tools ends up not being aware of it.
- by also teaching the _real_ mlock() functionality not to try to lock
the guard page.
That would just expand the mapping down to create a new guard page,
so there really is no point in trying to lock it in place.
It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.
Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.
Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
isofs: Fix lseek() to position beyond 4 GB
vfs: remove unused MNT_STRICTATIME
vfs: show unreachable paths in getcwd and proc
vfs: only add " (deleted)" where necessary
vfs: add prepend_path() helper
vfs: __d_path: dont prepend the name of the root dentry
ia64: perfmon: add d_dname method
vfs: add helpers to get root and pwd
cachefiles: use path_get instead of lone dget
fs/sysv/super.c: add support for non-PDP11 v7 filesystems
V7: Adjust sanity checks for some volumes
Add v7 alias
v9fs: fixup for inode_setattr being removed
Manual merge to take Al's version of the fs/sysv/super.c file: it merged
cleanly, but Al had removed an unnecessary header include, so his side
was better.
Since the entire fs/proc directory is conditionally included based on
CONFIG_PROC_FS, it's redundant to check that same variable within that
directory.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepend "(unreachable)" to path strings if the path is not reachable
from the current root.
Two places updated are
- the return string from getcwd()
- and symlinks under /proc/$PID.
Other uses of d_path() are left unchanged (we know that some old
software crashes if /proc/mounts is changed).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.
get_fs_root()
get_fs_pwd()
get_fs_root_and_pwd()
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a kernel thread is using use_mm(), badness() returns a positive value.
This is not a big issue because caller take care of it correctly. But
there is one exception, /proc/<pid>/oom_score calls badness() directly and
doesn't care that the task is a regular process.
Another example, /proc/1/oom_score return !0 value. But it's unkillable.
This incorrectness makes administration a little confusing.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.
In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:
spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.
What happens is that get_task_cred() can race with commit_creds():
TASK_1 TASK_2 RCU_CLEANER
-->get_task_cred(TASK_2)
rcu_read_lock()
__cred = __task_cred(TASK_2)
-->commit_creds()
old_cred = TASK_2->real_cred
TASK_2->real_cred = ...
put_cred(old_cred)
call_rcu(old_cred)
[__cred->usage == 0]
get_cred(__cred)
[__cred->usage == 1]
rcu_read_unlock()
-->put_cred_rcu()
[__cred->usage == 1]
panic()
However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.
We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.
Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:
kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
[<ffffffff810698cd>] put_cred+0x13/0x15
[<ffffffff81069b45>] commit_creds+0x16b/0x175
[<ffffffff8106aace>] set_current_groups+0x47/0x4e
[<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP [<ffffffff81069881>] __put_cred+0xc/0x45
RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack
of the original thread after execve. This is largely based on the MMU code.
Subsidiary thread stacks are not indicated.
For FDPIC, we now get:
root:/> cat /proc/self/maps
02064000-02067ccc rw-p 0004d000 00:01 22 /bin/busybox
0206e000-0206f35c rw-p 00006000 00:01 295 /lib/ld-uClibc.so.0
025f0000-025f6f0c r-xs 00000000 00:01 295 /lib/ld-uClibc.so.0
02680000-026ba6b0 r-xs 00000000 00:01 297 /lib/libc.so.0
02700000-0274d384 r-xs 00000000 00:01 22 /bin/busybox
02816000-02817000 rw-p 00000000 00:00 0
02848000-0284c0d8 rw-p 00000000 00:00 0
02860000-02880000 rw-p 00000000 00:00 0 [stack]
The semi-downside here is that for FLAT, we get:
root:/> cat /proc/155/maps
029f0000-029f9000 rwxp 00000000 00:00 0 [stack]
The reason being that FLAT combines a whole lot of stuff into one map
(including the stack). But this isn't any worse than the current output
(which is nothing), so screw it.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some bogus firmwares include properties with "/" in their name. This
causes problems when creating the /proc/device-tree file system,
because the slash is taken to indicate a directory.
We don't care about those properties, and we don't want to encourage
them, so just throw them away when creating /proc/device-tree.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Now that task->signal can't go away get_nr_threads() doesn't need
->siglock to read signal->count.
Also, make it inline, move into sched.h, and convert 2 other proc users of
signal->count to use this (now trivial) helper.
Henceforth get_nr_threads() is the only valid user of signal->count, we
are ready to turn it into "int nr_threads" or, perhaps, kill it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
sunrpc: Include missing smp_lock.h
procfs: Kill the bkl in ioctl
procfs: Push down the bkl from ioctl
procfs: Use generic_file_llseek in /proc/vmcore
procfs: Use generic_file_llseek in /proc/kmsg
procfs: Use generic_file_llseek in /proc/kcore
procfs: Kill BKL in llseek on proc base