... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.
i_mode is not worth the effort; it has no common default value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.
Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.
With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: James Morris <jmorris@namei.org>
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.
This was posted for review some time ago and I believe its been in -mm
since then.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the ability to configure the hugetlb hstate used on a per mount basis.
- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- This option causes the mount code to find the hstate corresponding to the
specified size, and sets up a pointer to the hstate in the mount's
superblock.
- Change the hstate accessors to use this information rather than the
global_hstate they were using (requires a slight change in mm/memory.c
so we don't NULL deref in the error-unmap path -- see comments).
[np: take hstate out of hugetlbfs inode and vma->vm_private_data]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).
The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.
This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.
Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.
We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.
Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.
1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed
This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.
To summarise the new behaviour:
1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.
2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.
2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.
If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().
[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.
The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;
1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.
Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().
Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces the mempolicy mode, mode_flags, and nodemask in the
shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
inode.c and simplifies the interfaces.
mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
argument that causes the input nodemask to be stored in the w.user_nodemask of
the created mempolicy for use when the mempolicy is installed in a tmpfs inode
shared policy tree. At that time, any cpuset contextualization is applied to
the original input nodemask. This preserves the previous behavior where the
input nodemask was stored in the superblock. We can think of the returned
mempolicy as "context free".
Because mpol_parse_str() is now calling mpol_new(), we can remove from
mpol_to_str() the semantic checks that mpol_new() already performs.
Add 'no_context' parameter to mpol_to_str() to specify that it should format
the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
Change mpol_shared_policy_init() to take a pointer to a "context free" struct
mempolicy and to create a new, "contextualized" mempolicy using the mode,
mode_flags and user_nodemask from the input mempolicy.
Note: we know that the mempolicy passed to mpol_to_str() or
mpol_shared_policy_init() from a tmpfs superblock is "context free". This
is currently the only instance thereof. However, if we found more uses for
this concept, and introduced any ambiguity as to whether a mempolicy was
context free or not, we could add another internal mode flag to identify
context free mempolicies. Then, we could remove the 'no_context' argument
from mpol_to_str().
Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
if one exists, to pass to mpol_shared_policy_init(). We must add the
reference under the sb stat_lock to prevent races with replacement of the mpol
by remount. This reference is removed in mpol_shared_policy_init().
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: yet another build fix]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the evolution of mempolicies, it is necessary to support mempolicy mode
flags that specify how the policy shall behave in certain circumstances. The
most immediate need for mode flag support is to suppress remapping the
nodemask of a policy at the time of rebind.
Both the mempolicy mode and flags are passed by the user in the 'int policy'
formal of either the set_mempolicy() or mbind() syscall. A new constant,
MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
passed as part of this int. Mempolicies that include illegal flags as part of
their policy are rejected as invalid.
An additional member to struct mempolicy is added to support the mode flags:
struct mempolicy {
...
unsigned short policy;
unsigned short flags;
}
The splitting of the 'int' actual passed by the user is done in
sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
there are additional flags, and storing it in the new 'flags' member of struct
mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
the 'policy' member of the struct and all current users of pol->policy remain
unchanged.
The union of the policy mode and optional mode flags is passed back to the
user in get_mempolicy().
This combination of mode and flags within the same actual does not break
userspace code that relies on get_mempolicy(&policy, ...) and either
switch (policy) {
case MPOL_BIND:
...
case MPOL_INTERLEAVE:
...
};
statements or
if (policy == MPOL_INTERLEAVE) {
...
}
statements. Such applications would need to use optional mode flags when
calling set_mempolicy() or mbind() for these previously implemented statements
to stop working. If an application does start using optional mode flags, it
will need to mask the optional flags off the policy in switch and conditional
statements that only test mode.
An additional member is also added to struct shmem_sb_info to store the
optional mode flags.
[hugh@veritas.com: shmem mpol: fix build warning]
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow sticky directory mount option for hugetlbfs. This allows admin
to create a shared hugetlbfs mount point for multiple users, while
prevent accidental file deletion that users may step on each other.
It is similiar to default tmpfs mount option, or typical option used
on /tmp.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.
Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.
This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.
This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.
It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.
This patch:
Some filesystems forego the vfs and may_open() and create their own 'struct
file's.
This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.
Also, rename an existing, static-scope init_file() to a less generic name.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.
Also document the various bits and change their order from historical to
logical.
[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support for reading from hugetlbfs files. libhugetlbfs lets application
text/data to be placed in large pages. When we do that, oprofile doesn't
work - since libbfd tries to read from it.
This code is very similar to what do_generic_mapping_read() does, but I
can't use it since it has PAGE_CACHE_SIZE assumptions.
[akpm@linux-foundation.org: cleanups, fix leak]
[bunk@stusta.de: make hugetlbfs_read() static]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For historical reason, expanding ftruncate that increases file size on
hugetlbfs is not allowed due to pages were pre-faulted and lack of fault
handler. Now that we have demand faulting on hugetlb since 2.6.15, there
is no reason to hold back that limitation.
This will make hugetlbfs behave more like a normal fs. I'm writing a user
level code that uses hugetlbfs but will fall back to tmpfs if there are no
hugetlb page available in the system. Having hugetlbfs specific ftruncate
behavior is a bit quirky and I would like to remove that artificial
limitation.
Signed-off-by: <kenchen@google.com>
Acked-by: Wiliam Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>