Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.
Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.
Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:
Before: 18.436s (sys 18.380s)
After: 3.769s (sys 3.742s)
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
/proc/stat shows (among lots of other things) the current boottime (i.e.
number of seconds since boot). While a 32-bit number is sufficient for
this particular case, we want to get rid of the 'struct timespec'
suffers from a 32-bit overflow in 2038.
This changes the code to use a struct timespec64, which is known to be
safe in all cases.
Link: http://lkml.kernel.org/r/20160617201247.2292101-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
These two patches are supposed to "fix" failed order-4 memory
allocations which have been observed when reading /proc/stat. The
problem has been observed on s390 as well as on x86.
To address the problem change the seq_file memory allocations to
fallback to use vmalloc, so that allocations also work if memory is
fragmented.
This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator. Also it "fixes" other users as well,
which use seq_file's single_open() interface.
This patch (of 2):
Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it. Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number
of online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The architectures that override cputime_t (s390, ppc) don't provide
any version of nsecs_to_cputime(). Indeed this cputime_t implementation
by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
which the core code doesn't make any use of nsecs_to_cputime().
At least for now.
We are going to make a broader use of it so lets provide a default
version with a per usecs granularity. It should be good enough for most
usecases.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
PROC_FS is a bool, so this code is either present or absent. It will
never be modular, so using module_init as an alias for __initcall is
rather misleading.
Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h to
obviously non-modular code, and that would be ugly at best.
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change these registrations from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.
Also note that this change uncovers a missing semicolon bug in the
registration of vmcore_init as an initcall.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some platforms (such as IA64) the large page size may results in
slab allocations to be allowed of numbers that do not fit in 32 bit.
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Git commit 09a1d34f85 "nohz: Make idle/iowait counter update
conditional" introduced a bug in regard to cpu hotplug. The effect is
that the number of idle ticks in the cpu summary line in /proc/stat is
still counting ticks for offline cpus.
Reproduction is easy, just start a workload that keeps all cpus busy,
switch off one or more cpus and then watch the idle field in top.
On a dual-core with one cpu 100% busy and one offline cpu you will get
something like this:
%Cpu(s): 48.7 us, 1.3 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si,
%0.0 st
The problem is that an offline cpu still has ts->idle_active == 1.
To fix this we should make sure that the cpu is online when calling
get_cpu_idle_time_us and get_cpu_iowait_time_us.
[Srivatsa: Rebased to current mainline]
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20121010061820.8999.57245.stgit@srivatsabhat.in.ibm.com
Cc: deepthi@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Git commit a25cac5198 "proc: Consider NO_HZ when printing idle and
iowait times" changes the code for /proc/stat to use get_cpu_idle_time_us
and get_cpu_iowait_time_us if the system is running with nohz enabled.
For architectures which define arch_idle_time (currently s390 only)
this is a change for the worse. The result of arch_idle_time is supposed
to be the exact sleep time of the target cpu and should be used instead
of the value kept by the scheduler.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120330122308.18720283@de.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
== stat_check.py
num = 0
with open("/proc/stat") as f:
while num < 1000 :
data = f.read()
f.seek(0, 0)
num = num + 1
==
perf shows
20.39% stat_check.py [kernel.kallsyms] [k] format_decode
13.41% stat_check.py [kernel.kallsyms] [k] number
12.61% stat_check.py [kernel.kallsyms] [k] vsnprintf
10.85% stat_check.py [kernel.kallsyms] [k] memcpy
4.85% stat_check.py [kernel.kallsyms] [k] radix_tree_lookup
4.43% stat_check.py [kernel.kallsyms] [k] seq_printf
This patch removes most of calls to vsnprintf() by adding num_to_str()
and seq_print_decimal_ull(), which prints decimal numbers without rich
functions provided by printf().
On my 8cpu box.
== Before patch ==
[root@bluextal test]# time ./stat_check.py
real 0m0.150s
user 0m0.026s
sys 0m0.121s
== After patch ==
[root@bluextal test]# time ./stat_check.py
real 0m0.055s
user 0m0.022s
sys 0m0.030s
[akpm@linux-foundation.org: remove incorrect comment, use less statck in num_to_str(), move comment from .h to .c, simplify seq_put_decimal_ull()]
[andrea@betterlinux.com: avoid breaking the ABI in /proc/stat]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Turner <pjt@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a typical 16 cpus machine, "cat /proc/stat" gives more than 4096 bytes,
and is slow :
# strace -T -o /tmp/STRACE cat /proc/stat | wc -c
5826
# grep "cpu " /tmp/STRACE
read(0, "cpu 1949310 19 2144714 12117253"..., 32768) = 5826 <0.001504>
Thats partly because show_stat() must be called twice since initial
buffer size is too small (4096 bytes for less than 32 possible cpus)
Fix this by :
1) Taking into account nr_irqs in the initial buffer sizing.
2) Using ksize() to allow better filling of initial buffer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>