I noticed a bug on the process accounting facility. In multi-threading
process, some data would be recorded incorrectly when the group_leader dies
earlier than one or more threads. The attached patch fixes this problem.
See below. 'bugacct' is a test program that create a worker thread after 4
seconds sleeping, then the group_leader dies soon. The worker thread
consume CPU/Memory for 6 seconds, then exit. We can estimate 10 seconds as
etime and 6 seconds as stime + utime. This is a sample program which the
group_leader dies earlier than other threads.
The results of same binary execution on different kernel are below.
-- accounted records --------------------
| btime | utime | stime | etime | minflt | majflt | comm |
original | 13:16:40 | 0.00 | 0.00 | 6.10 | 171 | 0 | bugacct |
patched | 13:20:21 | 5.83 | 0.18 | 10.03 | 32776 | 0 | bugacct |
(*) bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is
appropriate.
-- Test results in original kernel ------
$ date; time -p ./bugacct
Tue Mar 28 13:16:36 JST 2006 <- But pacct said btime is 13:16:40
real 10.11 <- But pacct said etime is 6.10
user 5.96 <- But pacct said utime is 0.00
sys 0.14 <- But pacct said stime is 0.00
$
-- Test results in patched kernel -------
$ date; time -p ./bugacct
Tue Mar 28 13:20:21 JST 2006
real 10.04
user 5.83
sys 0.19
$
In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and
minflt incorrectly. In my opinion, this problem is caused by an assumption
that group_leader dies last.
The following section calculates process running time for etime and btime.
But it means running time of the thread that dies last, not process. The
start_time of the first thread in the process (group_leader) should be
reduced from uptime to calculate etime and btime correctly.
---- do_acct_process() in kernel/acct.c:
/* calculate run_time in nsec*/
do_posix_clock_monotonic_gettime(&uptime);
run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC
+ current->start_time.tv_nsec;
----
The following section calculates stime and utime of the process.
But it might count the utime and stime of the group_leader duplicatly
and ignore the utime and stime of the thread dies last, when one or
more threads remain after group_leader dead.
The ac_utime should be calculated as the sum of the signal->utime
and utime of the thread dies last. The ac_stime should be done also.
---- do_acct_process() in kernel/acct.c:
jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime,
current->signal->utime));
ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies));
jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime,
current->signal->stime));
ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies));
----
The part of the minflt/majflt calculation has same problem.
This patch solves those problems, I think.
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).
And __setup() is used in obsolete_checksetup().
start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()
If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.
If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].
Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.
This patch fixes a wrong usage of it, however fixes obvious one only.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since base and new_base are of the same type now, we can save one 'if'
branch and simplify the code a bit.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit a4a6198b80:
[PATCH] tvec_bases too large for per-cpu data
introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at
compile time. This means we can kill __init_timer_base and move
timer_base_s's content into tvec_t_base_s.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
strace /bin/bash misbehaves after resume; this fixes it.
(akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to
do the right thing).
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock. This is
unsafe: 'p' can exit in between and set ->sighand = NULL. The race is
theoretical, the window is tiny and irqs are disabled by the caller, so I
don't think we need the fix for -stable tree.
Convert send_sigqueue() to use lock_task_sighand() helper.
Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the
comment is wrong.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The previous patch has changed callsites of do_notify_parent_cldstop() so that
to_self == (->ptrace & PT_PTRACED) always (as it should be). We can remove
this parameter now.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove an obscure 'stop_count < 0' check in finish_stop(). The previous patch
made this case impossible.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
do_signal_stop() considers 'thread_group_empty()' as a special case.
This was needed to avoid taking tasklist_lock. Since this lock is
unneeded any longer, we can remove this special case and simplify
the code even more.
Also, before this patch, finish_stop() was called with stop_count == -1
for 'thread_group_empty()' case. This is not strictly wrong, but confusing
and unneeded.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This
makes the exit path more understandable and allows us to do
cleanup_sighand() outside of ->siglock protected section.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric W. Biederman wrote:
>
> Ok. SUSV3/Posix is clear, fork is atomic with respect
> to signals. Either a signal comes before or after a
> fork but not during. (See the rationale section).
> http://www.opengroup.org/onlinepubs/000095399/functions/fork.html
>
> The tasklist_lock does not stop forks from adding to a process
> group. The forks stall while the tasklist_lock is held, but a fork
> that began before we grabbed the tasklist_lock simply completes
> afterwards, and the child does not receive the signal.
This also means that SIGSTOP or sig_kernel_coredump() signal can't
be delivered to pgrp/session reliably.
With this patch copy_process() returns -ERESTARTNOINTR when it
detects a pending signal, fork() will be restarted transparently
after handling the signals.
This patch also deletes now unneeded "group_stop_count > 0" check,
copy_process() can no longer succeed while group stop in progress.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a
preparation for the further tref/pids rework.
This patch adds 'struct list_head thread_group' to 'struct task_struct'
instead.
We don't detach group leader from PIDTYPE_PID namespace until another
thread inherits it's ->pid == ->tgid, so we are safe wrt premature
free_pidmap(->tgid) call.
Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
Should the need arise, we can use find_task_by_pid()->group_leader.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
do_sigaction() does not need tasklist_lock anymore, we can simplify the code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
do_signal_stop() does not need tasklist_lock anymore. So it does not need to
do misc re-checks, and we can simplify the code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK
signals anymore.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch moves __unhash_process() call from realease_task() to
__exit_signal(), so __detach_pid() is called with ->siglock held.
This means we don't need tasklist_lock to iterate over thread group anymore:
copy_process() was already changed to do attach_pid()
under ->siglock.
Eric's "pidhash-kill-switch_exec_pids.patch" from -mm
changed de_thread() so it doesn't touch PIDTYPE_TGID.
NOTE: de_thread() still needs some attention. It still changes task->pid
lockless. Taking ->sighand.siglock here allows to do more tasklist_lock
removals.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in
sys_times(). The reason is that the next patch breaks memory ordering which
is needed for that optimization.
tasklist_lock in sys_times() will be eliminated completely by further patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__exit_signal() is private to release_task() now. I think it is better to
make it static in kernel/exit.c and export flush_sigqueue() instead - this
function is much more simple and straightforward.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to
copy_sighand().
This matches copy_signal/cleanup_signal naming, and I think it is easier to
follow.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__exit_signal() does important cleanups atomically under ->siglock. It is
also called from copy_process's error path. This is not good, for example we
can't move __unhash_process() under ->siglock for that reason.
We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under
'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient
to just backout copy_signal(), nothing more.
Again, nobody can see this task yet. For CLONE_THREAD case we just decrement
signal->count, otherwise nobody can see this ->signal and we can free it
lockless.
This patch assumes it is safe to do exit_thread_group_keys() without
tasklist_lock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only caller of exit_sighand(tsk) is copy_process's error path. We can
call __exit_sighand() directly and kill exit_sighand().
This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no
external references, nobody can see it, and
IF (clone_flags & CLONE_SIGHAND)
At least 'current' has a reference to ->sighand, this
means atomic_dec_and_test(sighand->count) can't be true.
ELSE
Nobody can see this ->sighand, this means we can free it
without any locking.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>