__mem_open() which is called by both /proc/<pid>/environ and
/proc/<pid>/mem ->open() handlers will allow the use of negative offsets.
/proc/<pid>/mem has negative offsets but not /proc/<pid>/environ.
Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
to allow negative offsets only on /proc/<pid>/mem.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the following offset and environment address range check in
environ_read() of /proc/<pid>/environ is buggy:
int this_len = mm->env_end - (mm->env_start + src);
if (this_len <= 0)
break;
Large or negative offsets on /proc/<pid>/environ converted to 'unsigned
long' may pass this check since '(mm->env_start + src)' can overflow and
'this_len' will be positive.
This can turn /proc/<pid>/environ to act like /proc/<pid>/mem since
(mm->env_start + src) will point and read from another VMA.
There are two fixes here plus some code cleaning:
1) Fix the overflow by checking if the offset that was converted to
unsigned long will always point to the [mm->env_start, mm->env_end]
address range.
2) Remove the truncation that was made to the result of the check,
storing the result in 'int this_len' will alter its value and we can
not depend on it.
For kernels that have commit b409e578d ("proc: clean up
/proc/<pid>/environ handling") which adds the appropriate ptrace check and
saves the 'mm' at ->open() time, this is not a security issue.
This patch is taken from the grsecurity patch since it was just made
available.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Benjamin Herrenschmidt:
"Notable highlights:
- iommu improvements from Anton removing the per-iommu global lock in
favor of dividing the DMA space into pools, each with its own lock,
and hashed on the CPU number. Along with making the locking more
fine grained, this gives significant improvements in multiqueue
networking scalability.
- Still from Anton, we know provide a vdso based variant of getcpu
which makes sched_getcpu with the appropriate glibc patch something
like 18 times faster.
- More anton goodness (he's been busy !) in other areas such as a
faster __clear_user and copy_page on P7, various perf fixes to
improve sampling quality, etc...
- One more step toward removing legacy i2c interfaces by using new
device-tree based probing of platform devices for the AOA audio
drivers
- A nice series of patches from Michael Neuling that helps avoiding
confusion between register numbers and litterals in assembly code,
trying to enforce the use of "%rN" register names in gas rather
than plain numbers.
- A pile of FSL updates
- The usual bunch of small fixes, cleanups etc...
You may spot a change to drivers/char/mem. The patch got no comment
or ack from outside, it's a trivial patch to allow the architecture to
skip creating /dev/port, which we use to disable it on ppc64 that
don't have a legacy brige. On those, IO ports 0...64K are not mapped
in kernel space at all, so accesses to /dev/port cause oopses (and
yes, distros -still- ship userspace that bangs hard coded ports such
as kbdrate)."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
powerpc/mpic: Create a revmap with enough entries for IPIs and timers
Remove stale .rej file
powerpc/iommu: Fix iommu pool initialization
powerpc/eeh: Check handle_eeh_events() return value
powerpc/85xx: Add phy nodes in SGMII mode for MPC8536/44/72DS & P2020DS
powerpc/e500: add paravirt QEMU platform
powerpc/mpc85xx_ds: convert to unified PCI init
powerpc/fsl-pci: get PCI init out of board files
powerpc/85xx: Update corenet64_smp_defconfig
powerpc/85xx: Update corenet32_smp_defconfig
powerpc/85xx: Rename P1021RDB-PC device trees to be consistent
powerpc/watchdog: move booke watchdog param related code to setup-common.c
sound/aoa: Adapt to new i2c probing scheme
i2c/powermac: Improve detection of devices from device-tree
powerpc: Disable /dev/port interface on systems without an ISA bridge
of: Improve prom_update_property() function
powerpc: Add "memory" attribute for mfmsr()
powerpc/ftrace: Fix assembly trampoline register usage
powerpc/hw_breakpoints: Fix incorrect pointer access
powerpc: Put the gpr save/restore functions in their own section
...
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a helper that abstracts out the jump to an already parsed struct path
from ->follow_link operation from procfs. Not only does this clean up
the code by moving the two sides of this game into a single helper, but
it also prepares for making struct nameidata private to namei.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently the non-nd_set_link based versions of ->follow_link are expected
to do a path_put(&nd->path) on failure. This calling convention is unexpected,
undocumented and doesn't match what the nd_set_link-based instances do.
Move the path_put out of the only non-nd_set_link based ->follow_link
instance into the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
prom_update_property() currently fails if the property doesn't
actually exist yet which isn't what we want. Change to add-or-update
instead of update-only, then we can remove a lot duplicated lines.
Suggested-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cyrill Gorcunov reports that I broke the fdinfo files with commit
30a08bf2d3 ("proc: move fd symlink i_mode calculations into
tid_fd_revalidate()"), and he's quite right.
The tid_fd_revalidate() function is not just used for the <tid>/fd
symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the
permission model for those are different.
So do the dynamic symlink permission handling just for symlinks, making
the fdinfo files once more appear as the proper regular files they are.
Of course, Al Viro argued (probably correctly) that we shouldn't do the
symlink permission games at all, and make the symlinks always just be
the normal 'lrwxrwxrwx'. That would have avoided this issue too, but
since somebody noticed that the permissions had changed (which was the
reason for that original commit 30a08bf2d3 in the first place), people
do apparently use this feature.
[ Basically, you can use the symlink permission data as a cheap "fdinfo"
replacement, since you see whether the file is open for reading and/or
writing by just looking at st_mode of the symlink. So the feature
does make sense, even if the pain it has caused means we probably
shouldn't have done it to begin with. ]
Reported-and-tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we do checkpoint of a task we need to know the list of children the
task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).
So instead of walking over all pids in the system (creating one big
process tree in memory, just to figure out which children a task has) --
we add explicit /proc/<pid>/task/<tid>/children entry, because the kernel
already has this kind of information but it is not yet exported.
This is a first level children, not the whole process tree.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, nonlinear mappings can not be distinguished from ordinary
mappings. This patch adds into /proc/pid/smaps line "Nonlinear: <size>
kB", where size is amount of nonlinear ptes in vma, this line appears only
if VM_NONLINEAR is set. This information may be useful not only for
checkpoint/restore project.
Requested by Pavel Emelyanov.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.
The problem with the working set detection is multilateral. In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use. The mincore syscall I though could help with this did not.
First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump. Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make that has
not been cow-ed.
Note, that issue with swap pages is critical -- we must dump swap pages to
image file. But the issues with file pages are optimization -- we can
take all file pages to image, this would be correct, but if we know that a
page is not mapped or not cow-ed, we can remove them from dump file. The
dump would still be self-consistent, though significantly smaller in size
(up to 10 times smaller on real apps).
Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
it reports whether a page is present or swapped and it doesn't report not
mapped page cache pages. But, it doesn't distinguish cow-ed file pages
from not cow-ed.
I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.
[comment stolen from Pavel Emelyanov's v1 patch]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- use int fpr priority and nice, since task_nice()/task_prio() return that
- field 24: get_mm_rss() returns unsigned long
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rcu_read_lock()/rcu_read_unlock() is nop for TINY_RCU, but is not a nop
for, say, PREEMPT_RCU.
proc_fill_cache() is called without RCU lock, there is no need to
lock/unlock on error path, simply jump out of the loop.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to e268337dfe ("proc: clean up and fix /proc/<pid>/mem
handling"), move the check of permission to open(), this will simplify
read() code.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.
The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.
On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.
This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."
* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally
Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.
Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.
- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.
- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.
- With the user namespaces compiled out the performance is as good or
better than it is today.
- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.
- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).
- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.
- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.
- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.
- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.
My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."
Fix up trivial conflicts due to nearby independent changes in fs/stat.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...