This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.
In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.
Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...
The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.
Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
This patch has a small common code hit, namely making dup_mm non-static.
Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Arrgghhh...
Sorry about that, I'd been sure I'd folded that one, but it actually got
lost. Please apply - that breaks execve().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.
As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two optimizations:
1) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps. Next patch
does this lazy allocation.
2) allocate the right size for the actual cpu rather than 512 bytes always.
Patches enabling xsave/xrstor support (coming shortly) will take advantage
of this.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.
Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
signal_struct->tsk points to the ->group_leader and thus we have the nasty
code in de_thread() which has to change it and restart ->real_timer if the
leader is changed.
Use "struct pid *leader_pid" instead. This also allows us to kill now
unneeded send_group_sig_info().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. It is much easier to grep for ->state change if __set_task_state() is used
instead of the direct assignment.
2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
unneeded mb() (btw even if we use mb() it is still possible that do_wait()
sees the new ->state but not ->exit_code, but this is ok).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.
This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.
One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.
The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do
prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);
The following test program will get and set the bounding
set (but not pI). For instance
./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)
************************************************************
cap_bound.c
************************************************************
#include <sys/prctl.h>
#include <linux/capability.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif
#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif
int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop <capability>\n", me);
return 1;
}
#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};
int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;
printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i<numcaps; i++) {
ret = prctl(PR_CAPBSET_READ, i);
if (ret < 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}
int capdrop(char *str)
{
unsigned long i;
int found=0;
for (i=0; i<numcaps; i++) {
if (strcmp(captable[i], str) == 0) {
found=1;
break;
}
}
if (!found)
return 1;
if (prctl(PR_CAPBSET_DROP, i)) {
perror("prctl");
return 1;
}
return 0;
}
int main(int argc, char *argv[])
{
if (argc<2)
return usage(argv[0]);
if (strcmp(argv[1], "get")==0)
return getbcap();
if (strcmp(argv[1], "drop")!=0 || argc<3)
return usage(argv[0]);
if (capdrop(argv[2])) {
printf("unknown capability\n");
return 1;
}
return execl("/bin/bash", "/bin/bash", NULL);
}
************************************************************
[serue@us.ibm.com: fix typo]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>a
Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com>
Tested-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ulrich says that we never used this clone flags and that nothing should be
using it.
As we're down to only a single bit left in clone's flags argument, let's add a
warning to check that no userspace is actually using it. Hopefully we will
be able to recycle it.
Roland said:
CLONE_STOPPED was previously used by some NTPL versions when under
thread_db (i.e. only when being actively debugged by gdb), but not for a
long time now, and it never worked reliably when it was used. Removing it
seems fine to me.
[akpm@linux-foundation.org: it looks like CLONE_DETACHED is being used]
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syslets (or other threads/processes that want io context sharing) can
set this to enforce sharing of io context.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>