A very small cleanup for mq_open.
We do not have to call set_close_on_exit if we create the file
descriptor right away with the flag set. We have a function for this
now. The resulting code is smaller and a tiny bit faster.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.
Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)
The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).
Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).
Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add definitions of USHORT_MAX and others into kernel. ipc uses it and slub
implementation might also use it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Pierre Peiffer" <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
semctl_down(), msgctl_down() and shmctl_down() are used to handle the same set
of commands for each kind of IPC. They all start to do the same job (they
retrieve the ipc and do some permission checks) before handling the commands
on their own.
This patch proposes to consolidate this by moving these same pieces of code
into one common function called ipcctl_pre_down().
It simplifies a little these xxxctl_down() functions and increases a little
the maintainability.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All IPCs make use of an intermetiate *_setbuf structure to handle the IPC_SET
command. This is not really needed and, moreover, it complicates a little bit
the code.
This patch gets rid of the use of it and uses directly the semid64_ds/
msgid64_ds/shmid64_ds structure.
In addition of removing one struture declaration, it also simplifies and
improves a little bit the common 64-bits path.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
semctl_down is called with the rwmutex (the one which protects the list of
ipcs) taken in write mode.
This patch moves this rwmutex taken in write-mode inside semctl_down.
This has the advantages of reducing a little bit the window during which this
rwmutex is taken, clarifying sys_semctl, and finally of having a coherent
behaviour with [shm|msg]ctl_down
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, sys_msgctl is not easy to read.
This patch tries to improve that by introducing the msgctl_down function to
handle all commands requiring the rwmutex to be taken in write mode (ie
IPC_SET and IPC_RMID for now). It is the equivalent function of semctl_down
for message queues.
This greatly changes the readability of sys_msgctl and also harmonizes the way
these commands are handled among all IPCs.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the way the different commands are handled in sys_shmctl introduces
some duplicated code.
This patch introduces the shmctl_down function to handle all the commands
requiring the rwmutex to be taken in write mode (ie IPC_SET and IPC_RMID for
now). It is the equivalent function of semctl_down for shared memory.
This removes some duplicated code for handling these both commands and
harmonizes the way they are handled among all IPCs.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.
A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered. With that new interface we avoid taking care of the states
changes in procfs.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make msgmni not recomputed anymore upon ipc namespace creation / removal or
memory add/remove, as soon as it has been set from userland.
As soon as msgmni is explicitly set via procfs or sysctl(), the associated
callback routine is unregistered from the ipc namespace notifier chain.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a notification mechanism that aims at recomputing msgmni each time
an ipc namespace is created or removed.
The ipc namespace notifier chain already defined for memory hotplug management
is used for that purpose too.
Each time a new ipc namespace is allocated or an existing ipc namespace is
removed, the ipcns notifier chain is notified. The callback routine for each
registered ipc namespace is then activated in order to recompute msgmni for
that namespace.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the registration of a callback routine that recomputes msg_ctlmni
upon memory add / remove.
A single notifier block is registered in the hotplug memory chain for all the
ipc namespaces.
Since the ipc namespaces are not linked together, they have their own
notification chain: one notifier_block is defined per ipc namespace.
Each time an ipc namespace is created (removed) it registers (unregisters) its
notifier block in (from) the ipcns chain. The callback routine registered in
the memory chain invokes the ipcns notifier chain with the IPCNS_LOWMEM event.
Each callback routine registered in the ipcns namespace, in turn, recomputes
msgmni for the owning namespace.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On large systems we'd like to allow a larger number of message queues. In
some cases up to 32K. However simply setting MSGMNI to a larger value may
cause problems for smaller systems.
The first patch of this series introduces a default maximum number of message
queue ids that scales with the amount of lowmem.
Since msgmni is per namespace and there is no amount of memory dedicated to
each namespace so far, the second patch of this series scales msgmni to the
number of ipc namespaces too.
Since msgmni depends on the amount of memory, it becomes necessary to
recompute it upon memory add/remove. In the 4th patch, memory hotplug
management is added: a notifier block is registered into the memory hotplug
notifier chain for the ipc subsystem. Since the ipc namespaces are not linked
together, they have their own notification chain: one notifier_block is
defined per ipc namespace. Each time an ipc namespace is created (removed) it
registers (unregisters) its notifier block in (from) the ipcns chain. The
callback routine registered in the memory chain invokes the ipcns notifier
chain with the IPCNS_MEMCHANGE event. Each callback routine registered in the
ipcns namespace, in turn, recomputes msgmni for the owning namespace.
The 5th patch makes it possible to keep the memory hotplug notifier chain's
lock for a lesser amount of time: instead of directly notifying the ipcns
notifier chain upon memory add/remove, a work item is added to the global
workqueue. When activated, this work item is the one who notifies the ipcns
notifier chain.
Since msgmni depends on the number of ipc namespaces, it becomes necessary to
recompute it upon ipc namespace creation / removal. The 6th patch uses the
ipc namespace notifier chain for that purpose: that chain is notified each
time an ipc namespace is created or removed. This makes it possible to
recompute msgmni for all the namespaces each time one of them is created or
removed.
When msgmni is explicitely set from userspace, we should avoid recomputing it
upon memory add/remove or ipcns creation/removal. This is what the 7th patch
does: it simply unregisters the ipcns callback routine as soon as msgmni has
been changed from procfs or sysctl().
Even if msgmni is set by hand, it should be possible to make it back
automatically recomputed upon memory add/remove or ipcns creation/removal.
This what is achieved in patch 8: if set to a negative value, msgmni is added
back to the ipcns notifier chain, making it automatically recomputed again.
This patch:
Compute msg_ctlmni to make it scale with the amount of lowmem. msg_ctlmni is
now set to make the message queues occupy 1/32 of the available lowmem.
Some cleaning has also been done for the MSGPOOL constant: the msgctl man page
says it's not used, but it also defines it as a size in bytes (the code
expresses it in Kbytes).
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By continuing to consolidate a little the IPC code, each id can be built
directly in ipc_addid() instead of having it built from each callers of
ipc_addid()
And I also remove shm_addid() in order to have, as much as possible, the
same code for shm/sem/msg.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After further discussion with Christoph Lameter, it has become clear that my
earlier attempts to clean up the mempolicy reference counting were a bit of
overkill in some areas, resulting in superflous ref/unref in what are usually
fast paths. In other areas, further inspection reveals that I botched the
unref for interleave policies.
A separate patch, suitable for upstream/stable trees, fixes up the known
errors in the previous attempt to fix reference counting.
This patch reworks the memory policy referencing counting and, one hopes,
simplifies the code. Maybe I'll get it right this time.
See the update to the numa_memory_policy.txt document for a discussion of
memory policy reference counting that motivates this patch.
Summary:
Lookup of mempolicy, based on (vma, address) need only add a reference for
shared policy, and we need only unref the policy when finished for shared
policies. So, this patch backs out all of the unneeded extra reference
counting added by my previous attempt. It then unrefs only shared policies
when we're finished with them, using the mpol_cond_put() [conditional put]
helper function introduced by this patch.
Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
containing just the policy. read_swap_cache_async() can call alloc_page_vma()
multiple times, so we can't let alloc_page_vma() unref the shared policy in
this case. To avoid this, we make a copy of any non-null shared policy and
remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
a page [or multiple pages] from swap, so the overhead should not be an issue
here.
I introduced a new static inline function "mpol_cond_copy()" to copy the
shared policy to an on-stack policy and remove the flags that would require a
conditional free. The current implementation of mpol_cond_copy() assumes that
the struct mempolicy contains no pointers to dynamically allocated structures
that must be duplicated or reference counted during copy.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_vma_policy() is not handling fallback to task policy correctly when the
get_policy() vm_op returns NULL. The NULL overwrites the 'pol' variable that
was holding the fallback task mempolicy. So, it was falling back directly to
system default policy.
Fix get_vma_policy() to use only non-NULL policy returned from the vma
get_policy op.
shm_get_policy() was falling back to current task's mempolicy if the "backing
file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
the vma policy is null. This is incorrect for show_numa_maps() which is
likely querying the numa_maps of some task other than current. Remove this
fallback.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.
We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.
There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Address 3 known bugs in the current memory policy reference counting method.
I have a series of patches to rework the reference counting to reduce overhead
in the allocation path. However, that series will require testing in -mm once
I repost it.
1) alloc_page_vma() does not release the extra reference taken for
vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in
leaking mempolicy structures. This is probably occurring, but not being
noticed.
Fix: add the conditional release of the reference.
2) hugezonelist unconditionally releases a reference on the mempolicy when
mode == MPOL_INTERLEAVE. This can result in decrementing the reference
count for system default policy [should have no ill effect] or premature
freeing of task policy. If this occurred, the next allocation using task
mempolicy would use the freed structure and probably BUG out.
Fix: add the necessary check to the release.
3) The current reference counting method assumes that vma 'get_policy()'
methods automatically add an extra reference a non-NULL returned mempolicy.
This is true for shmem_get_policy() used by tmpfs mappings, including
regular page shm segments. However, SHM_HUGETLB shm's, backed by
hugetlbfs, just use the vma policy without the extra reference. This
results in freeing of the vma policy on the first allocation, with reuse of
the freed mempolicy structure on subsequent allocations.
Fix: Rather than add another condition to the conditional reference
release, which occur in the allocation path, just add a reference when
returning the vma policy in shm_get_policy() to match the assumptions.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: <eric.whitney@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>