Commit Graph

126 Commits

Author SHA1 Message Date
Ingo Molnar
da84d96176 sched: reintroduce cache-hot affinity
reintroduce a simplified version of cache-hot/cold scheduling
affinity. This improves performance with certain SMP workloads,
such as sysbench.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Peter Zijlstra
5f6d858ecc sched: speed up and simplify vslice calculations
speed up and simplify vslice calculations.

[ From: Mike Galbraith <efault@gmx.de>: build fix ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:12 +02:00
Ingo Molnar
e22f5bbf86 sched: remove wait_runtime limit
remove the wait_runtime-limit fields and the code depending on it, now
that the math has been changed over to rely on the vruntime metric.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:06 +02:00
Ingo Molnar
8ebc91d936 sched: remove stat_gran
remove the stat_gran code - it was disabled by default and it causes
unnecessary overhead.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:03 +02:00
Al Viro
2b8232ce51 minimal build fixes for uml (fallout from x86 merge)
a) include/asm-um/arch can't just point to include/asm-$(SUBARCH) now
 b) arch/{i386,x86_64}/crypto are merged now
 c) subarch-obj needed changes
 d) cpufeature_64.h should pull "cpufeature_32.h", not <asm/cpufeature_32.h>
    since it can be included from asm-um/cpufeature.h
 e) in case of uml-i386 we need CONFIG_X86_32 for make and gcc, but not
    for Kconfig
 f) sysctl.c shouldn't do vdso_enabled for uml-i386 (actually, that one
    should be registered from corresponding arch/*/kernel/*, with ifdef
    going away; that's a separate patch, though).

With that and with Stephen's patch ("[PATCH net-2.6] uml: hard_header fix")
we have uml allmodconfig building both on i386 and amd64.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-13 09:57:15 -07:00
Olof Johansson
d0c3d534a4 [POWERPC] Implement logging of unhandled signals
Implement show_unhandled_signals sysctl + support to print when a process
is killed due to unhandled signals just as i386 and x86_64 does.

Default to having it off, unlike x86 that defaults on.

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-10-12 14:05:18 +10:00
Ingo Molnar
1799e35d5b sched: add /proc/sys/kernel/sched_compat_yield
add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
more agressive, by moving the yielding task to the last position
in the rbtree.

with sched_compat_yield=0:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2539 mingo     20   0  1576  252  204 R   50  0.0   0:02.03 loop_yield
  2541 mingo     20   0  1576  244  196 R   50  0.0   0:02.05 loop

with sched_compat_yield=1:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2584 mingo     20   0  1576  248  196 R   99  0.0   0:52.45 loop
  2582 mingo     20   0  1576  256  204 R    0  0.0   0:00.00 loop_yield

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-09-19 23:34:46 +02:00
Ingo Molnar
172ac3dbb7 sched: cleanup, sched_granularity -> sched_min_granularity
due to adaptive granularity scheduling the role of sched_granularity
has changed to "minimum granularity", so rename the variable (and the
tunable) accordingly.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-08-25 18:41:53 +02:00
Peter Zijlstra
218050855e sched: adaptive scheduler granularity
Instead of specifying the preemption granularity, specify the wanted
latency. By fixing the granlarity to a constany the wakeup latency
it a function of the number of running tasks on the rq.

Invert this relation.

sysctl_sched_granularity becomes a minimum for the dynamic granularity
computed from the new sysctl_sched_latency.

Then use this latency to do more intelligent granularity decisions: if
there are fewer tasks running then we can schedule coarser. This helps
performance while still always keeping the latency target.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-25 18:41:53 +02:00
Peter Zijlstra
1fc84aaae3 sched: fix CONFIG_SCHED_DEBUG dependency of lockdep sysctls
Make the lockdep sysctls not depend on CONFIG_SCHED_DEBUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-25 18:41:52 +02:00
Christian Heim
e598fbaabd Remove double inclusion of linux/capability.h
Remove the second inclusion of linux/capability.h, which has been
introduced with "[PATCH] move capable() to capability.h" (commit
c59ede7b78)

Signed-off-by: Christian Heim <phreak@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-19 10:12:32 -07:00
Lee Schermerhorn
8daec965e7 Fix missing numa_zonelist_order sysctl
Misplaced #endif is hiding the numa_zonelist_order sysctl when !SECURITY.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-11 15:47:40 -07:00
Len Brown
673d5b43da ACPI: restore CONFIG_ACPI_SLEEP
Restore the 2.6.22 CONFIG_ACPI_SLEEP build option, but now shadowing the
new CONFIG_PM_SLEEP option.

Signed-off-by: Len Brown <len.brown@intel.com>
[ Modified to work with the PM config setup changes. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-29 16:53:59 -07:00
Len Brown
e8b2fd0122 ACPI: Kconfig: remove CONFIG_ACPI_SLEEP from source
As it was a synonym for (CONFIG_ACPI && CONFIG_X86),
the ifdefs for it were more clutter than they were worth.

For ia64, just add a few stubs in anticipation of future
S3 or S4 support.

Signed-off-by: Len Brown <len.brown@intel.com>
2007-07-25 01:29:39 -04:00
Masoud Asgharifard Sharbiani
abd4f7505b x86: i386-show-unhandled-signals-v3
This patch makes the i386 behave the same way that x86_64 does when a
segfault happens.  A line gets printed to the kernel log so that tools
that need to check for failures can behave more uniformly between
debug.show_unhandled_signals sysctl variable to 0 (or by doing echo 0 >
/proc/sys/debug/exception-trace)

Also, all of the lines being printed are now using printk_ratelimit() to
deny the ability of DoS from a local user with a program like the
following:

main()
{
       while (1)
               if (!fork()) *(int *)0 = 0;
}

This new revision also includes the fix that Andrew did which got rid of
new sysctl that was added to the system in earlier versions of this.
Also, 'show-unhandled-signals' sysctl has been renamed back to the old
'exception-trace' to avoid breakage of people's scripts.

AK: Enabling by default for i386 will be likely controversal, but let's see what happens
AK: Really folks, before complaining just fix your segfaults
AK: I bet this will find a lot of silent issues

Signed-off-by: Masoud Sharbiani <masouds@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
[ Personally, I've found the complaints useful on x86-64, so I'm all for
  this. That said, I wonder if we could do it more prettily..   -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-22 11:03:37 -07:00
Andrew Morton
ed2c12f323 kernel/sysctl.c: finish off the warning comments
I've been chasing these comments around this file all week.  Hopefully we're
straight now.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:57 -07:00
Peter Zijlstra
f20786ff4d lockstat: core infrastructure
Introduce the core lock statistics code.

Lock statistics provides lock wait-time and hold-time (as well as the count
of corresponding contention and acquisitions events). Also, the first few
call-sites that encounter contention are tracked.

Lock wait-time is the time spent waiting on the lock. This provides insight
into the locking scheme, that is, a heavily contended lock is indicative of
a too coarse locking scheme.

Lock hold-time is the duration the lock was held, this provides a reference for
the wait-time numbers, so they can be put into perspective.

  1)
    lock
  2)
    ... do stuff ..
    unlock
  3)

The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
hold-time.

The lockdep held-lock tracking code is reused, because it already collects locks
into meaningful groups (classes), and because it is an existing infrastructure
for lock instrumentation.

Currently lockdep tracks lock acquisition with two hooks:

  lock()
    lock_acquire()
    _lock()

 ... code protected by lock ...

  unlock()
    lock_release()
    _unlock()

We need to extend this with two more hooks, in order to measure contention.

  lock_contended() - used to measure contention events
  lock_acquired()  - completion of the contention

These are then placed the following way:

  lock()
    lock_acquire()
    if (!_try_lock())
      lock_contended()
      _lock()
      lock_acquired()

 ... do locked stuff ...

  unlock()
    lock_release()
    _unlock()

(Note: the try_lock() 'trick' is used to avoid instrumenting all platform
       dependent lock primitive implementations.)

It is also possible to toggle the two lockdep features at runtime using:

  /proc/sys/kernel/prove_locking
  /proc/sys/kernel/lock_stat

(esp. turning off the O(n^2) prove_locking functionaliy can help)

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: nuke unneeded ifdefs]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Kawai, Hidehiro
76fdbb25f9 coredump masking: bound suid_dumpable sysctl
This patch series is version 5 of the core dump masking feature, which
controls which VMAs should be dumped based on their memory types and
per-process flags.

I adopted most of Andrew's suggestion at the previous version.  He also
suggested using system call instead of /proc/<pid>/ interface, I decided to
use the latter continuously because adding new system call with pid argument
will give a big impact on the kernel.

You can access the per-process flags via /proc/<pid>/coredump_filter
interface.  coredump_filter represents a bitmask of memory types, and if a bit
is set, VMAs of corresponding memory type are written into a core file when
the process is dumped.  The bitmask is inherited from the parent process when
a process is created.

The original purpose is to avoid longtime system slowdown when a number of
processes which share a huge shared memory are dumped at the same time.  To
achieve this purpose, this patch series adds an ability to suppress dumping
anonymous shared memory for specified processes.  In this version, three other
memory types are also supported.

Here are the coredump_filter bits:
  bit 0: anonymous private memory
  bit 1: anonymous shared memory
  bit 2: file-backed private memory
  bit 3: file-backed shared memory

The default value of coredump_filter is 0x3.  This means the new core dump
routine has the same behavior as conventional behavior by default.

In this version, coredump_filter bits and mm.dumpable are merged into
mm.flags, and it is accessed by atomic bitops.

The supported core file formats are ELF and ELF-FDPIC.  ELF has been tested,
but ELF-FDPIC has not been built and tested because I don't have the test
environment.

This patch limits a value of suid_dumpable sysctl to the range of 0 to 2.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:46 -07:00
Peter Zijlstra
bdf4c48af2 audit: rework execve audit
The purpose of audit_bprm() is to log the argv array to a userspace daemon at
the end of the execve system call.  Since user-space hasn't had time to run,
this array is still in pristine state on the process' stack; so no need to
copy it, we can just grab it from there.

In order to minimize the damage to audit_log_*() copy each string into a
temporary kernel buffer first.

Currently the audit code requires that the full argument vector fits in a
single packet.  So currently it does clip the argv size to a (sysctl) limit,
but only when execve auditing is enabled.

If the audit protocol gets extended to allow for multiple packets this check
can be removed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ollie Wild <aaw@google.com>
Cc: <linux-audit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Pavel Machek
77afcf78a2 PM: Integrate beeping flag with existing acpi_sleep flags
Move "debug during resume from s2ram" into the variable we already use
for real-mode flags to simplify code. It also closes nasty trap for
the user in acpi_sleep_setup; order of parameters actually mattered there,
acpi_sleep=s3_bios,s3_mode doing something different from
acpi_sleep=s3_mode,s3_bios.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Jeremy Fitzhardinge
10a0a8d4e3 Add common orderly_poweroff()
Various pieces of code around the kernel want to be able to trigger an
orderly poweroff.  This pulls them together into a single
implementation.

By default the poweroff command is /sbin/poweroff, but it can be set
via sysctl: kernel/poweroff_cmd.  This is split at whitespace, so it
can include command-line arguments.

This patch replaces four other instances of invoking either "poweroff"
or "shutdown -h now": two sbus drivers, and acpi thermal
management.

sparc64 has its own "powerd"; still need to determine whether it should
be replaced by orderly_poweroff().

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David S. Miller <davem@davemloft.net>
2007-07-18 08:47:40 -07:00
Adrian Bunk
62239ac2b3 proper prototype for proc_nr_files()
Add a proper prototype for proc_nr_files() in include/linux/fs.h

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:03 -07:00
Mel Gorman
396faf0303 Allow huge page allocations to use GFP_HIGH_MOVABLE
Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
can be used to satisfy hugepage allocations even when the system has been
running a long time.  This allows an administrator to resize the hugepage pool
at runtime depending on the size of ZONE_MOVABLE.

This patch adds a new sysctl called hugepages_treat_as_movable.  When a
non-zero value is written to it, future allocations for the huge page pool
will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.

[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Linus Torvalds
7144521f5a Remove duplicate comments from sysctl.c
Randy Dunlap noticed that the recent comment clarifications from Andrew
had somehow gotten duplicated.  Quoth Andrew: "hm, that could have been
some late-night reject-fixing."

Fix it up.

Cc: From: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 11:50:38 -07:00
Andrew Morton
2be7fe075a sysctl.c: add text telling people to use CTL_UNNUMBERED
Hopefully this will help people to understand the new regime.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:49 -07:00