Commit Graph

1690 Commits

Author SHA1 Message Date
Paul Menage 8793d854ed Task Control Groups: make cpusets a client of cgroups
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:36 -07:00
Randy Dunlap 8f731f7d83 kernel-api docbook: fix content problems
Fix kernel-api docbook contents problems.

docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: 			of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:35 -07:00
Coly Li 3ed75eb8f1 setup vma->vm_page_prot by vm_get_page_prot()
This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

Though inside vm_get_page_prot() the protection flags is AND with
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

Signed-off-by: Coly Li <coyli@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:34 -07:00
Benjamin Herrenschmidt 1c7037db50 remove unused flush_tlb_pgtables
Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining
traces of it from all archs.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:34 -07:00
Linus Torvalds 53253383fd Include <linux/backing-dev.h> in mm/filemap.c
It gets it indirectly from blkdev.h when CONFIG_BLOCK is enabled, but it
needs it unconditionally for the definition of mapping_cap_writeback_dirty.

Noticed and bisected down to 4af3c9cc4f
("Drop some headers from mm.h") by Avuton Olrich.

Cc: Avuton Olrich <avuton@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:47:32 -07:00
Stephen Hemminger c80544dc0b sparse pointer use of zero as null
Get rid of sparse related warnings from places that use integer as NULL
pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:31 -07:00
Akinobu Mita 12d00f6a12 cpu hotplug: slab: fix memory leak in cpu hotplug error path
This patch fixes memory leak in error path.

In reality, we don't need to call cpuup_canceled(cpu) for now.  But upcoming
cpu hotplug error handling change needs this.

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:21 -07:00
Akinobu Mita fbf1e473bd cpu hotplug: slab: cleanup cpuup_callback()
cpuup_callback() is too long.  This patch factors out CPU_UP_CANCELLED and
CPU_UP_PREPARE handlings from cpuup_callback().

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:21 -07:00
Linus Torvalds fb9fc39517 Merge branch 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
* 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
  xfs: eagerly remove vmap mappings to avoid upsetting Xen
  xen: add some debug output for failed multicalls
  xen: fix incorrect vcpu_register_vcpu_info hypercall argument
  xen: ask the hypervisor how much space it needs reserved
  xen: lock pte pages while pinning/unpinning
  xen: deal with stale cr3 values when unpinning pagetables
  xen: add batch completion callbacks
  xen: yield to IPI target if necessary
  Clean up duplicate includes in arch/i386/xen/
  remove dead code in pgtable_cache_init
  paravirt: clean up lazy mode handling
  paravirt: refactor struct paravirt_ops into smaller pv_*_ops
2007-10-17 11:10:11 -07:00
Adrian Bunk cbfee34520 security/ cleanups
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exports

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Serge E. Hallyn b53767719b Implement file posix capabilities
Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
	Nov 27:
	Incorporate fixes from Andrew Morton
	(security-introduce-file-caps-tweaks and
	security-introduce-file-caps-warning-fix)
	Fix Kconfig dependency.
	Fix change signaling behavior when file caps are not compiled in.

	Nov 13:
	Integrate comments from Alexey: Remove CONFIG_ ifdef from
	capability.h, and use %zd for printing a size_t.

	Nov 13:
	Fix endianness warnings by sparse as suggested by Alexey
	Dobriyan.

	Nov 09:
	Address warnings of unused variables at cap_bprm_set_security
	when file capabilities are disabled, and simultaneously clean
	up the code a little, by pulling the new code into a helper
	function.

	Nov 08:
	For pointers to required userspace tools and how to use
	them, see http://www.friedhoff.org/fscaps.html.

	Nov 07:
	Fix the calculation of the highest bit checked in
	check_cap_sanity().

	Nov 07:
	Allow file caps to be enabled without CONFIG_SECURITY, since
	capabilities are the default.
	Hook cap_task_setscheduler when !CONFIG_SECURITY.
	Move capable(TASK_KILL) to end of cap_task_kill to reduce
	audit messages.

	Nov 05:
	Add secondary calls in selinux/hooks.c to task_setioprio and
	task_setscheduler so that selinux and capabilities with file
	cap support can be stacked.

	Sep 05:
	As Seth Arnold points out, uid checks are out of place
	for capability code.

	Sep 01:
	Define task_setscheduler, task_setioprio, cap_task_kill, and
	task_setnice to make sure a user cannot affect a process in which
	they called a program with some fscaps.

	One remaining question is the note under task_setscheduler: are we
	ok with CAP_SYS_NICE being sufficient to confine a process to a
	cpuset?

	It is a semantic change, as without fsccaps, attach_task doesn't
	allow CAP_SYS_NICE to override the uid equivalence check.  But since
	it uses security_task_setscheduler, which elsewhere is used where
	CAP_SYS_NICE can be used to override the uid equivalence check,
	fixing it might be tough.

	     task_setscheduler
		 note: this also controls cpuset:attach_task.  Are we ok with
		     CAP_SYS_NICE being used to confine to a cpuset?
	     task_setioprio
	     task_setnice
		 sys_setpriority uses this (through set_one_prio) for another
		 process.  Need same checks as setrlimit

	Aug 21:
	Updated secureexec implementation to reflect the fact that
	euid and uid might be the same and nonzero, but the process
	might still have elevated caps.

	Aug 15:
	Handle endianness of xattrs.
	Enforce capability version match between kernel and disk.
	Enforce that no bits beyond the known max capability are
	set, else return -EPERM.
	With this extra processing, it may be worth reconsidering
	doing all the work at bprm_set_security rather than
	d_instantiate.

	Aug 10:
	Always call getxattr at bprm_set_security, rather than
	caching it at d_instantiate.

[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Randy Dunlap 8d63494f78 remap_file_pages: kernel-doc corrections
Fix kernel-doc for sys_remap_file_pages() and add info to the 'prot' NOTE.
Rename __prot parameter to prot.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Dave Hansen ce8d2cdf3d r/o bind mounts: filesystem helpers for custom 'struct file's
Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.  In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of filesystems
writable.  It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems.  This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated.  I've been using the following script to test that the feature is
working as desired.  It takes a directory and makes a regular bind and a r/o
bind mount of it.  It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Fengguang Wu 369f2389e7 writeback: remove unnecessary wait in throttle_vm_writeout()
We don't want to introduce pointless delays in throttle_vm_writeout() when
the writeback limits are not yet exceeded, do we?

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Greg KH <greg@kroah.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Joern Engel 1c0eeaf569 introduce I_SYNC
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect.  One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu 2e6883bdf4 writeback: introduce writeback_control.more_io to indicate more io
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays.  But sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out.  The big dirty file is actually still sitting in s_more_io.  We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty.  It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
not necessarily mean that "all data are written".  This patch introduces a
flag writeback_control.more_io to indicate this situation.  With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.

Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Robert P. J. Day bda5b655fe Delete gcc-2.95 compatible structure definition.
Since nothing earlier than gcc-3.2 is supported for kernel
compilation, that 2.95 hack can be removed.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:58 -07:00
Alexey Dobriyan 4af3c9cc4f Drop some headers from mm.h
mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.

Cross-compile tested on many configs and archs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Alexey Dobriyan 040b5c6f95 SLAB_PANIC more (proc, posix-timers, shmem)
These aren't modular, so SLAB_PANIC is OK.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Andrew Morton e423003028 writeback: don't propagate AOP_WRITEPAGE_ACTIVATE
This is a writeback-internal marker but we're propagating it all the way back
to userspace!.

Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
Nick Piggin 7a4050791b mm: document tree_lock->zone.lock lockorder
zone->lock is quite an "inner" lock and mostly constrained to page alloc as
well, so like slab locks, it probably isn't something that is critically
important to document here.  However unlike slab locks, zone lock could be
used more widely in future, and page_alloc.c might possibly have more
business to do tricky things with pagecache than does slab.  So...  I don't
think it hurts to document it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes d773ed6b85 mm: test and set zone reclaim lock before starting reclaim
Introduces new zone flag interface for testing and setting flags:

	int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
called, this flag is test and set before starting zone reclaim.  Zone reclaim
starts in __alloc_pages() when a zone's watermark fails and the system is in
zone_reclaim_mode.  If it's already in reclaim, there's no need to start again
so it is simply considered full for that allocation attempt.

There is a change of behavior with regard to concurrent zone shrinking.  It is
now possible for try_to_free_pages() or kswapd to already be shrinking a
particular zone when __alloc_pages() starts zone reclaim.  In this case, it is
possible for two concurrent threads to invoke shrink_zone() for a single zone.

This change forbids a zone to be in zone reclaim twice, which was always the
behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
when starting zone reclaim.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes ae74138da6 oom: convert zone_scan_lock from mutex to spinlock
There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if
the lock can't be acquired; it will be available soon enough once the zonelist
scanning is done.  All other threads waiting for the OOM killer are also
contingent on the exiting task being able to acquire the lock in
clear_zonelist_oom() so it doesn't make sense to put it to sleep.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes 3ff566963c oom: do not take callback_mutex
Since no task descriptor's 'cpuset' field is dereferenced in the execution of
the OOM killer anymore, it is no longer necessary to take callback_mutex.

[akpm@linux-foundation.org: restore cpuset_lock for other patches]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes bbe373f2c6 oom: compare cpuset mems_allowed instead of exclusive ancestors
Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors.  This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00