* akpm-incoming-2: (139 commits)
epoll: make epoll_wait() use the hrtimer range feature
select: rename estimate_accuracy() to select_estimate_accuracy()
Remove duplicate includes from many files
ramoops: use the platform data structure instead of module params
kernel/resource.c: handle reinsertion of an already-inserted resource
kfifo: fix kfifo_alloc() to return a signed int value
w1: don't allow arbitrary users to remove w1 devices
alpha: remove dma64_addr_t usage
mips: remove dma64_addr_t usage
sparc: remove dma64_addr_t usage
fuse: use release_pages()
taskstats: use real microsecond granularity for CPU times
taskstats: split fill_pid function
taskstats: separate taskstats commands
delayacct: align to 8 byte boundary on 64-bit systems
delay-accounting: reimplement -c for getdelays.c to report information on a target command
namespaces Kconfig: move namespace menu location after the cgroup
namespaces Kconfig: remove the cgroup device whitelist experimental tag
namespaces Kconfig: remove pointless cgroup dependency
namespaces Kconfig: make namespace a submenu
...
In commit f7347ce4ee ("fasync: re-organize fasync entry insertion to
allow it under a spinlock") Arnd took an earlier patch of mine that had
the comment about the FASYNC flag above the wrong function.
When the fasync_add_entry() function was split to introduce the new
fasync_insert_entry() helper function, the code that actually cares
about the FASYNC bit moved to that new helper.
So just move the comment to the right point.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
locks: turn lock_flocks into a spinlock
fasync: re-organize fasync entry insertion to allow it under a spinlock
locks/nfsd: allocate file lock outside of spinlock
lockd: fix nlmsvc_notify_blocked locking
lockd: push lock_flocks down
Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
turn PF_KTHREAD off in the ENOEXEC case.
This patch moves this flag modification to after the finding of the
executable file.
This is only a theorical issue because kthreads do not call do_execve()
directly. But fixing would be better.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus. But we can make use of kstat_irqs().
kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)
If a system is very big and CONFIG_GENERIC_HARDIRQ, it does
for_each_irq()
for_each_cpu()
- look up a radix tree
- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as
for_each_irq()
look up radix tree
for_each_cpu()
- read desc->irq_stat[cpu]
This reduces cost.
A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)
%time cat /proc/stat > /dev/null
Before Patch: 2.459 sec
After Patch : .561 sec
[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/stat shows the total number of all interrupts to each cpu. But when
the number of IRQs are very large, it take very long time and 'cat
/proc/stat' takes more than 10 secs. This is because sum of all irq
events are counted when /proc/stat is read. This patch adds "sum of all
irq" counter percpu and reduce read costs.
The cost of reading /proc/stat is important because it's used by major
applications as 'top', 'ps', 'w', etc....
A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows
%time cat /proc/stat > /dev/null
Before Patch: 12.627 sec
After Patch: 2.459 sec
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length of the BLOCK_IPOLL string is making i's value be printed too
far to the right. This patch fixes this and makes the output a bit
neater.
Currently:
CPU0
HI: 0
TIMER: 599792
NET_TX: 2
NET_RX: 6
BLOCK: 80807
BLOCK_IOPOLL: 0
TASKLET: 20012
SCHED: 0
HRTIMER: 63
RCU: 619279
With patch:
CPU0
HI: 0
TIMER: 585582
NET_TX: 2
NET_RX: 6
BLOCK: 80320
BLOCK_IOPOLL: 0
TASKLET: 19287
SCHED: 0
HRTIMER: 62
RCU: 604441
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export the number of anonymous pages in a mapping via smaps.
Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.
Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userland ELF tools have been coping with partial-segments core files
for a few years now. Multiple distro builds are now setting this option.
It behooves everyone who ever deals with core files to have more info
dumped in there, especially as more and more people's compilers are
producing build IDs. Make it the default.
Anyone using older tools confused by these core files can configure this
option off, or just change /proc/PID/coredump_filter after boot.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We met a parameter truncated issue, consider following:
> echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
%s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
/proc/sys/kernel/core_pattern
This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
/proc/sys/kernel/core_pattern" shows the whole string. but after we run
core_pattern_pipe_test in man page, we found last parameter was truncated
like below:
argc[10]=<12807486>
The root cause is core_pattern allows % specifiers, which need to be
replaced during parse time, but the replace may expand the strings to
larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
will write out of corename array.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nothing depends on lock_flocks using the BKL
any more, so we can do the switch over to
a private spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".
Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.
In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry. That way
the actual list insertion can happen while holding the required
spinlock.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
nlmsvc_notify_blocked walks the nlm_blocked list,
which requires nlm_blocked_lock.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
lockd should use lock_flocks() instead of lock_kernel()
to lock against posix locks accessing the i_flock list.
This is a prerequisite to turning lock_flocks into a
spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
* akpm-incoming-1: (176 commits)
scripts/checkpatch.pl: add check for declaration of pci_device_id
scripts/checkpatch.pl: add warnings for static char that could be static const char
checkpatch: version 0.31
checkpatch: statement/block context analyser should look at sanitised lines
checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
checkpatch: clean up structure definition macro handline
checkpatch: update copyright dates
checkpatch: Add additional attribute #defines
checkpatch: check for incorrect permissions
checkpatch: ensure kconfig help checks only apply when we are adding help
checkpatch: simplify and consolidate "missing space after" checks
checkpatch: add check for space after struct, union, and enum
checkpatch: returning errno typically should be negative
checkpatch: handle casts better fixing false categorisation of : as binary
checkpatch: ensure we do not collapse bracketed sections into constants
checkpatch: suggest cleanpatch and cleanfile when appropriate
checkpatch: types may sit on a line on their own
checkpatch: fix regressions in "fix handling of leading spaces"
div64_u64(): improve precision on 32bit platforms
lib/parser: cleanup match_number()
...
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.
get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>