The current implementation of tmpfs is not scalable. We found that
stat_lock is contended by multiple threads when we need to get a new page,
leading to useless spinning inside this spin lock.
This patch makes use of the percpu_counter library to maintain local count
of used blocks to speed up getting and returning of pages. So the
acquisition of stat_lock is unnecessary for getting and returning blocks,
improving the performance of tmpfs on system with large number of cpus.
On a 4 socket 32 core NHM-EX system, we saw improvement of 270%.
The implementation below has a slight chance of race between threads
causing a slight overshoot of the maximum configured blocks. However, any
overshoot is small, and is bounded by the number of cpus. This happens
when the number of used blocks is slightly below the maximum configured
blocks when a thread checks the used block count, and another thread
allocates the last block before the current thread does. This should not
be a problem for tmpfs, as the overshoot is most likely to be a few blocks
and bounded. If a strict limit is really desired, then configured the max
blocks to be the limit less the number of cpus in system.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add percpu_counter_compare that allows for a quick but accurate comparison
of percpu_counter with a given value.
A rough count is provided by the count field in percpu_counter structure,
without accounting for the other values stored in individual cpu counters.
The actual count is a sum of count and the cpu counters. However, count
field is never different from the actual value by a factor of
batch*num_online_cpu. We do not need to get actual count for comparison
if count is different from the given value by this factor and allows for
quick comparison without summing up all the per cpu counters.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define stubs for the numa_*_id() generic percpu related functions for
non-NUMA configurations in <asm-generic/topology.h> where the other
non-numa stubs live.
Fixes ia64 !NUMA build breakage -- e.g., tiger_defconfig
Back out now unneeded '#ifndef CONFIG_NUMA' guards from ia64 smpboot.c
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have been used naming try_set_zone_oom and clear_zonelist_oom.
The role of functions is to lock of zonelist for preventing parallel
OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
awkward and unmatched with clear_zonelist_oom.
Let's change it with try_set_zonelist_oom.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are various points in the oom killer where the kernel must determine
whether to panic or not. It's better to extract this to a helper function
to remove all the confusion as to its semantics.
Also fix a call to dump_header() where tasklist_lock is not read- locked,
as required.
There's no functional change with this patch.
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes. There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.
In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score. This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment suggests that when b_count equals zero it is calling
__wait_no_buffer to trigger some debug, but as there is no debug in
__wait_on_buffer the whole thing is redundant.
AFAICT from the git log this has been the case for at least 5 years, so
it seems safe just to remove this.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSM reference counts can cause an anon_vma to exist after the processe it
belongs to have already exited. Because the anon_vma lock now lives in
the root anon_vma, we need to ensure that the root anon_vma stays around
until after all the "child" anon_vmas have been freed.
The obvious way to do this is to have a "child" anon_vma take a reference
to the root in anon_vma_fork. When the anon_vma is freed at munmap or
process exit, we drop the refcount in anon_vma_unlink and possibly free
the root anon_vma.
The KSM anon_vma reference count function also needs to be modified to
deal with the possibility of freeing 2 levels of anon_vma. The easiest
way to do this is to break out the KSM magic and make it generic.
When compiling without CONFIG_KSM, this code is compiled out.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.
However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.
Also add the proper locking to vma_adjust.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Track the root (oldest) anon_vma in each anon_vma tree. Because we only
take the lock on the root anon_vma, we cannot use the lock on higher-up
anon_vmas to lock anything. This makes it impossible to do an indirect
lookup of the root anon_vma, since the data structures could go away from
under us.
However, a direct pointer is safe because the root anon_vma is always the
last one that gets freed on munmap or exit, by virtue of the same_vma list
order and unlink_anon_vmas walking the list forward.
[akpm@linux-foundation.org: fix typo]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kunmap_atomic() is currently at level -4 on Rusty's "Hard To Misuse"
list[1] ("Follow common convention and you'll get it wrong"), except in
some architectures when CONFIG_DEBUG_HIGHMEM is set[2][3].
kunmap() takes a pointer to a struct page; kunmap_atomic(), however, takes
takes a pointer to within the page itself. This seems to once in a while
trip people up (the convention they are following is the one from
kunmap()).
Make it much harder to misuse, by moving it to level 9 on Rusty's list[4]
("The compiler/linker won't let you get it wrong"). This is done by
refusing to build if the type of its first argument is a pointer to a
struct page.
The real kunmap_atomic() is renamed to kunmap_atomic_notypecheck()
(which is what you would call in case for some strange reason calling it
with a pointer to a struct page is not incorrect in your code).
The previous version of this patch was compile tested on x86-64.
[1] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
[2] In these cases, it is at level 5, "Do it right or it will always
break at runtime."
[3] At least mips and powerpc look very similar, and sparc also seems to
share a common ancestor with both; there seems to be quite some
degree of copy-and-paste coding here. The include/asm/highmem.h file
for these three archs mention x86 CPUs at its top.
[4] http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html
[5] As an aside, could someone tell me why mn10300 uses unsigned long as
the first parameter of kunmap_atomic() instead of void *?
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Cc: Russell King <linux@arm.linux.org.uk> (arch/arm)
Cc: Ralf Baechle <ralf@linux-mips.org> (arch/mips)
Cc: David Howells <dhowells@redhat.com> (arch/frv, arch/mn10300)
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> (arch/mn10300)
Cc: Kyle McMartin <kyle@mcmartin.ca> (arch/parisc)
Cc: Helge Deller <deller@gmx.de> (arch/parisc)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> (arch/parisc)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> (arch/powerpc)
Cc: Paul Mackerras <paulus@samba.org> (arch/powerpc)
Cc: "David S. Miller" <davem@davemloft.net> (arch/sparc)
Cc: Thomas Gleixner <tglx@linutronix.de> (arch/x86)
Cc: Ingo Molnar <mingo@redhat.com> (arch/x86)
Cc: "H. Peter Anvin" <hpa@zytor.com> (arch/x86)
Cc: Arnd Bergmann <arnd@arndb.de> (include/asm-generic)
Cc: Rusty Russell <rusty@rustcorp.com.au> ("Hard To Misuse" list)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The start/stop_critical_timing functions for preemptirqsoff, preemptoff
and irqsoff tracers contain atomic_inc() and atomic_dec() operations.
Atomic operations use local_irq_save/restore macros to ensure atomic
access but they are traced by the same function which is causing recursion
problem.
The reason is when these tracers are turn ON then the
local_irq_save/restore macros are changed in include/linux/irqflags.h to
call trace_hardirqs_on/off which call start/stop_critical_timing.
Microblaze was affected because it uses generic atomic implementation.
Signed-off-by: Michal Simek <monstr@monstr.eu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: check kmalloc() result
arch/tile: catch up on various minor cleanups.
arch/tile: avoid erroneous error return for PTRACE_POKEUSR.
tile: set ARCH_KMALLOC_MINALIGN
tile: remove homegrown L1_CACHE_ALIGN macro
arch/tile: Miscellaneous cleanup changes.
arch/tile: Split the icache flush code off to a generic <arch> header.
arch/tile: Fix bug in support for atomic64_xx() ops.
arch/tile: Shrink the tile-opcode files considerably.
arch/tile: Add driver to enable access to the user dynamic network.
arch/tile: Enable more sophisticated IRQ model for 32-bit chips.
Move list types from <linux/list.h> to <linux/types.h>.
Add wait4() back to the set of <asm-generic/unistd.h> syscalls.
Revert adding some arch-specific signal syscalls to <linux/syscalls.h>.
arch/tile: Do not use GFP_KERNEL for dma_alloc_coherent(). Feedback from fujita.tomonori@lab.ntt.co.jp.
arch/tile: core support for Tilera 32-bit chips.
Fix up the "generic" unistd.h ABI to be more useful.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: (82 commits)
firewire: core: add forgotten dummy driver methods, remove unused ones
firewire: add isochronous multichannel reception
firewire: core: small clarifications in core-cdev
firewire: core: remove unused code
firewire: ohci: release channel in error path
firewire: ohci: use memory barriers to order descriptor updates
tools/firewire: nosy-dump: increment program version
tools/firewire: nosy-dump: remove unused code
tools/firewire: nosy-dump: use linux/firewire-constants.h
tools/firewire: nosy-dump: break up a deeply nested function
tools/firewire: nosy-dump: make some symbols static or const
tools/firewire: nosy-dump: change to kernel coding style
tools/firewire: nosy-dump: work around segfault in decode_fcp
tools/firewire: nosy-dump: fix it on x86-64
tools/firewire: add userspace front-end of nosy
firewire: nosy: note ioctls in ioctl-number.txt
firewire: nosy: use generic printk macros
firewire: nosy: endianess fixes and annotations
firewire: nosy: annotate __user pointers and __iomem pointers
firewire: nosy: fix device shutdown with active client
...
* 'acpica' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (27 commits)
ACPI / ACPICA: Simplify acpi_ev_initialize_gpe_block()
ACPI / ACPICA: Fail acpi_gpe_wakeup() if ACPI_GPE_CAN_WAKE is unset
ACPI / ACPICA: Do not execute _PRW methods during initialization
ACPI: Fix bogus GPE test in acpi_bus_set_run_wake_flags()
ACPICA: Update version to 20100702
ACPICA: Fix for Alias references within Package objects
ACPICA: Fix lint warning for 64-bit constant
ACPICA: Remove obsolete GPE function
ACPICA: Update debug output components
ACPICA: Add support for WDDT - Watchdog Descriptor Table
ACPICA: Drop acpi_set_gpe
ACPICA: Use low-level GPE enable during GPE block initialization
ACPI / EC: Do not use acpi_set_gpe
ACPI / EC: Drop suspend and resume routines
ACPICA: Remove wakeup GPE reference counting which is not used
ACPICA: Introduce acpi_gpe_wakeup()
ACPICA: Rename acpi_hw_gpe_register_bit
ACPICA: Update version to 20100528
ACPICA: Add signatures for undefined tables: ATKG, GSCI, IEIT
ACPICA: Optimization: Reduce the number of namespace walks
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Add missing <linux/slab.h> include
IB/ehca: Drop unnecessary NULL test
RDMA/nes: Fix confusing if statement indentation
IB/ehca: Init irq tasklet before irq can happen
RDMA/nes: Fix misindented code
RDMA/nes: Fix showing wqm_quanta
RDMA/nes: Get rid of "set but not used" variables
RDMA/nes: Read firmware version from correct place
IB/srp: Export req_lim via sysfs
IB/srp: Make receive buffer handling more robust
IB/srp: Use print_hex_dump()
IB: Rename RAW_ETY to RAW_ETHERTYPE
RDMA/nes: Fix two sparse warnings
RDMA/cxgb3: Make needlessly global iwch_l2t_send() static
IB/iser: Make needlessly global iser_alloc_rx_descriptors() static
RDMA/cxgb4: Add timeouts when waiting for FW responses
IB/qib: Fix race between qib_error_qp() and receive packet processing
IB/qib: Limit the number of packets processed per interrupt
IB/qib: Allow writes to the diag_counters to be able to clear them
IB/qib: Set cfgctxts to number of CPUs by default
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (214 commits)
ALSA: hda - Add pin-fix for HP dc5750
ALSA: als4000: Fix potentially invalid DMA mode setup
ALSA: als4000: enable burst mode
ALSA: hda - Fix initial capsrc selection in patch_alc269()
ASoC: TWL4030: Capture route runtime DAPM ordering fix
ALSA: hda - Add PC-beep whitelist for an Intel board
ALSA: hda - More relax for pending period handling
ALSA: hda - Define AC_FMT_* constants
ALSA: hda - Fix beep frequency on IDT 92HD73xx and 92HD71Bxx codecs
ALSA: hda - Add support for HDMI HBR passthrough
ALSA: hda - Set Stream Type in Stream Format according to AES0
ALSA: hda - Fix Thinkpad X300 so SPDIF is not exposed
ALSA: hda - FIX to not expose SPDIF on Thinkpad X301, since it does not have the ability to use SPDIF
ASoC: wm9081: fix resource reclaim in wm9081_register error path
ASoC: wm8978: fix a memory leak if a wm8978_register fail
ASoC: wm8974: fix a memory leak if another WM8974 is registered
ASoC: wm8961: fix resource reclaim in wm8961_register error path
ASoC: wm8955: fix resource reclaim in wm8955_register error path
ASoC: wm8940: fix a memory leak if wm8940_register return error
ASoC: wm8904: fix resource reclaim in wm8904_register error path
...
* 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd4: fix file open accounting for RDWR opens
nfsd: don't allow setting maxblksize after svc created
nfsd: initialize nfsd versions before creating svc
net: sunrpc: removed duplicated #include
nfsd41: Fix a crash when a callback is retried
nfsd: fix startup/shutdown order bug
nfsd: minor nfsd read api cleanup
gcc-4.6: nfsd: fix initialized but not read warnings
nfsd4: share file descriptors between stateid's
nfsd4: fix openmode checking on IO using lock stateid
nfsd4: miscellaneous process_open2 cleanup
nfsd4: don't pretend to support write delegations
nfsd: bypass readahead cache when have struct file
nfsd: minor nfsd_svc() cleanup
nfsd: move more into nfsd_startup()
nfsd: just keep single lockd reference for nfsd
nfsd: clean up nfsd_create_serv error handling
nfsd: fix error handling in __write_ports_addxprt
nfsd: fix error handling when starting nfsd with rpcbind down
nfsd4: fix v4 state shutdown error paths
...