Commit Graph

49 Commits

Author SHA1 Message Date
Balbir Singh
49048622ea sched: fix process time monotonicity
Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bd. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem

TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.

Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-05 18:14:35 +02:00
Roland McGrath
0d094efeb1 tracehook: tracehook_tracer_task
This adds the tracehook_tracer_task() hook to consolidate all forms of
"Who is using ptrace on me?" logic.  This is used for "TracerPid:" in
/proc and for permission checks.  We also clean up the selinux code the
called an identical accessor.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:08 -07:00
Andrew G. Morgan
ca05a99a54 capabilities: remain source compatible with 32-bit raw legacy capability support.
Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset().  Its unfortunate, but
true.

Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities.  These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().

As such, this patch does a number of things to clean up the situation
for all. It

  1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
     legacy value.

  2. adopts a new #define strategy for the kernel's internal
     implementation of the preferred magic.

  3. deprecates v2 capability magic in favor of a new (v3) magic
     number. The functionality of v3 is entirely equivalent to v2,
     the only difference being that the v2 magic causes the kernel
     to log a "deprecated" warning so the admin can find applications
     that may be using v2 inappropriately.

[User space code continues to be encouraged to use the libcap API which
protects the application from details like this.  libcap-2.10 is the first
to support v3 capabilities.]

Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.

[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-31 16:36:16 -07:00
Serge E. Hallyn
289f8e27ed capabilities: add bounding set to /proc/self/status
There is currently no way to query the bounding set of another task.  As there
appears to be no security reason not to, and as Michael Kerrisk points out the
following valid reasons to do so exist:

* consistency (I can see all of the other per-thread/process sets in
  /proc/.../status)

* debugging -- I could imagine that it would make the job of debugging an
  application that uses capabilities a little simpler.

this patch adds the bounding set to /proc/self/status right after the
effective set.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:24 -07:00
Al Viro
9f3acc3140 [PATCH] split linux/file.h
Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:16 -04:00
Alan Cox
5d0fdf1e01 tty_io: fix remaining pid struct locking
This fixes the last couple of pid struct locking failures I know about.

[oleg@tv-sign.ru: clean up do_task_stat()]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:40 -07:00
Oleg Nesterov
06fffb1267 do_task_stat: don't take rcu_read_lock()
lock_task_sighand() was changed, and do_task_stat() doesn't need
rcu_read_lock any longer.  sighand->siglock protects all "interesting"
fields.

Except: it doesn't protect ->tty->pgrp, but neither does rcu_read_lock(), this
should be fixed.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Pavel Emelyanov <xemul@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:34 -07:00
Eric W. Biederman
df5f8314ca proc: seqfile convert proc_pid_status to properly handle pid namespaces
Currently we possibly lookup the pid in the wrong pid namespace.  So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: s390 build fix]
[akpm@linux-foundation.org: fix task_name() output]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Eric W. Biederman
a56d3fc74c seqfile convert proc_pid_statm
This conversion is just for code cleanliness, uniformity, and general safety.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Eric W. Biederman
ee992744ea proc: rewrite do_task_stat to correctly handle pid namespaces.
Currently (as pointed out by Oleg) do_task_stat has a race when calling
task_pid_nr_ns with the task exiting.  In addition do_task_stat is not
currently displaying information in the context of the pid namespace that
mounted the /proc filesystem.  So "cut -d' ' -f 1 /proc/<pid>/stat" may not
equal <pid>.

This patch fixes the problem by converting to a single_open seq_file show
method.  Getting the pid namespace from the filesystem superblock instead of
current, and simply using the the struct pid from the inode instead of
attempting to get that same pid from the task.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Andrew Morgan
e338d263a7 Add 64-bit capability support to the kernel
The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities.  If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

 http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:20 -08:00
Linus Torvalds
75659ca0c1 Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
  Remove commented-out code copied from NFS
  NFS: Switch from intr mount option to TASK_KILLABLE
  Add wait_for_completion_killable
  Add wait_event_killable
  Add schedule_timeout_killable
  Use mutex_lock_killable in vfs_readdir
  Add mutex_lock_killable
  Use lock_page_killable
  Add lock_page_killable
  Add fatal_signal_pending
  Add TASK_WAKEKILL
  exit: Use task_is_*
  signal: Use task_is_*
  sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
  ptrace: Use task_is_*
  power: Use task_is_*
  wait: Use TASK_NORMAL
  proc/base.c: Use task_is_*
  proc/array.c: Use TASK_REPORT
  perfmon: Use task_is_*
  ...

Fixed up conflicts in NFS/sunrpc manually..
2008-02-01 11:45:47 +11:00
Oleg Nesterov
a98fdcef94 fix the "remove task_ppid_nr_ns" commit
Commit 84427eaef1 (remove task_ppid_nr_ns)
moved the task_tgid_nr_ns(task->real_parent) outside of lock_task_sighand().
This is wrong, ->real_parent could be freed/reused.

Both ->parent/real_parent point to nothing after __exit_signal() because
we remove the child from ->children list, and thus the child can't be
reparented when its parent exits.

rcu_read_lock() protects ->parent/real_parent, but _only_ if we know it was
valid before we take rcu lock.

Revert this part of the patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-14 13:23:00 -08:00
Roland McGrath
84427eaef1 remove task_ppid_nr_ns
task_ppid_nr_ns is called in three places.  One of these should never
have called it.  In the other two, using it broke the existing
semantics.  This was presumably accidental.  If the function had not
been there, it would have been much more obvious to the eye that those
patches were changing the behavior.  We don't need this function.

In task_state, the pid of the ptracer is not the ppid of the ptracer.

In do_task_stat, ppid is the tgid of the real_parent, not its pid.
I also moved the call outside of lock_task_sighand, since it doesn't
need it.

In sys_getppid, ppid is the tgid of the real_parent, not its pid.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-13 09:56:43 -08:00
Matthew Wilcox
1587e2b188 proc/array.c: Use TASK_REPORT
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2007-12-06 17:20:28 -05:00
Ingo Molnar
08e4570a4a sched: fix prev_stime calculation
Srivatsa Vaddagiri noticed occasionally incorrect CPU usage
values in top and tracked it down to stime going below 0 in
task_stime(). Negative values are possible there due to the
sampled nature of stime/utime.

Fix suggested by Balbir Singh.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
2007-11-26 21:21:49 +01:00
Balbir Singh
9301899be7 sched: fix /proc/<PID>/stat stime/utime monotonicity, part 2
Extend Peter's patch to fix accounting issues, by keeping stime
monotonic too.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Frans Pop <elendil@planet.nl>
2007-10-30 00:26:32 +01:00
Peter Zijlstra
73a2bcb0ed sched: keep utime/stime monotonic
keep utime/stime monotonic.

cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Linus Torvalds
ec2626815b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: fix guest time accounting going faster than user time accounting
2007-10-19 12:07:03 -07:00
Eugene Teo
270f722d4d Fix tsk->exit_state usage
tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD.  A non-zero test
is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
tsk->exit_state is sufficient.

Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:42 -07:00
Pavel Emelyanov
b488893a39 pid namespaces: changes to show virtual ids to user
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
 - all in-kernel data structures must store either struct pid itself
   or the pid's global nr, obtained with pid_nr() call;
 - when seeking the task from kernel code with the stored id one
   should use find_task_by_pid() call that works with global pids;
 - when showing pid's numerical value to the user the virtual one
   should be used, but however when one shows task's pid outside this
   task's namespace the global one is to be used;
 - when getting the pid from userspace one need to consider this as
   the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Pavel Emelianov
a47afb0f9d pid namespaces: round up the API
The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.

The proposals are to
* equip the functions that return the integer with _nr suffix to
  represent that fact,
* and to make all functions work with task (not process) by making
  the common prefix of the same name.

For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:37 -07:00
Christian Borntraeger
f9e26291be sched: fix guest time accounting going faster than user time accounting
cputime_add already adds, dont do it twice.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-19 20:52:40 +02:00
Laurent Vivier
9ac52315d4 sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
like for cpustat, introduce the "gtime" (guest time of the task) and
"cgtime" (guest time of the task children) fields for the
tasks. Modify signal_struct and task_struct.

Modify /proc/<pid>/stat to display these new fields.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Christian Borntraeger
efe567fc82 sched: accounting regression since rc1
Fix the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING.  It
reverts parts of commit b27f03d4bd by
converting fs/proc/array.c back to cputime_t.  The new functions
task_utime and task_stime now return cputime_t instead of clock_t.  If
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are
returned directly instead of using sum_exec_runtime.

Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as
on i386.

[ mingo@elte.hu: cleanups, comments. ]

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-23 15:18:02 +02:00