Pull sysctl updates from Eric Biederman:
- Rewrite of sysctl for speed and clarity.
Insert/remove/Lookup in sysctl are all now O(NlogN) operations, and
are no longer bottlenecks in the process of adding and removing
network devices.
sysctl is now focused on being a filesystem instead of system call
and the code can all be found in fs/proc/proc_sysctl.c. Hopefully
this means the code is now approachable.
Much thanks is owed to Lucian Grinjincu for keeping at this until
something was found that was usable.
- The recent proc_sys_poll oops found by the fuzzer during hibernation
is fixed.
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl: (36 commits)
sysctl: protect poll() in entries that may go away
sysctl: Don't call sysctl_follow_link unless we are a link.
sysctl: Comments to make the code clearer.
sysctl: Correct error return from get_subdir
sysctl: An easier to read version of find_subdir
sysctl: fix memset parameters in setup_sysctl_set()
sysctl: remove an unused variable
sysctl: Add register_sysctl for normal sysctl users
sysctl: Index sysctl directories with rbtrees.
sysctl: Make the header lists per directory.
sysctl: Move sysctl_check_dups into insert_header
sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
sysctl: Replace root_list with links between sysctl_table_sets.
sysctl: Add sysctl_print_dir and use it in get_subdir
sysctl: Stop requiring explicit management of sysctl directories
sysctl: Add a root pointer to ctl_table_set
sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
sysctl: Normalize the root_table data structure.
sysctl: Factor out insert_header and erase_header
...
Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish(). This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.
Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:
RIP: 0010:[<ffffffff81233e3e>] [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89
If an entry goes away while we are polling() it, ctl_table may not exist
anymore.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
There are no functional changes. Just code motion to make it
clear that we don't follow a link between sysctl roots unless the
directory entry actually is a link.
Suggested-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Document get_subdir and that find_subdir alwasy takes a reference.
Suggested-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
When insert_header fails ensure we return the proper error value
from get_subdir. In practice nothing cares, but there is no
need to be sloppy.
Reported-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The plan is to convert all callers of register_sysctl_table
and register_sysctl_paths to register_sysctl. The interface
to register_sysctl is enough nicer this should make the callers
a bit more readable. Additionally after the conversion the
230 lines of backwards compatibility can be removed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
One of the most important jobs of sysctl is to export network stack
tunables. Several of those tunables are per network device. In
several instances people are running with 1000+ network devices in
there network stacks, which makes the simple per directory linked list
in sysctl a scaling bottleneck. Replace O(N^2) sysctl insertion and
lookup times with O(NlogN) by using an rbtree to index the sysctl
directories.
Benchmark before:
make-dummies 0 999 -> 0.32s
rmmod dummy -> 0.12s
make-dummies 0 9999 -> 1m17s
rmmod dummy -> 17s
Benchmark after:
make-dummies 0 999 -> 0.074s
rmmod dummy -> 0.070s
make-dummies 0 9999 -> 3.4s
rmmod dummy -> 0.44s
Benchmark after (without dev_snmp6):
make-dummies 0 9999 -> 0.75s
rmmod dummy -> 0.44s
make-dummies 0 99999 -> 11s
rmmod dummy -> 4.3s
At 10,000 dummy devices the bottleneck becomes the time to add and
remove the files under /proc/sys/net/dev_snmp6. I have commented
out the code that adds and removes files under /proc/sys/net/dev_snmp6
and taken measurments of creating and destroying 100,000 dummies to
verify the sysctl continues to scale.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Simplify the callers of insert_header by removing explicit calls to check
for duplicates and instead have insert_header do the work.
This makes the code slightly more maintainable by enabling changes to
data structures where the insertion of new entries without duplicate
suppression is not possible.
There is not always a convenient path string where insert_header
is called so modify sysctl_check_dups to use sysctl_print_dir
when printing the full path when a duplicate is discovered.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
An nsproxy argument here has always been awkard and now the nsproxy argument
is completely unnecessary so remove it, replacing it with the set we want
the registered tables to show up in.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Piecing together directories by looking first in one directory
tree, than in another directory tree and finally in a third
directory tree makes it hard to verify that some directory
entries are not multiply defined and makes it hard to create
efficient implementations the sysctl filesystem.
Replace the sysctl wide list of roots with autogenerated
links from the core sysctl directory tree to the other
sysctl directory trees.
This simplifies sysctl directory reading and lookups as now
only entries in a single sysctl directory tree need to be
considered.
Benchmark before:
make-dummies 0 999 -> 0.44s
rmmod dummy -> 0.065s
make-dummies 0 9999 -> 1m36s
rmmod dummy -> 0.4s
Benchmark after:
make-dummies 0 999 -> 0.63s
rmmod dummy -> 0.12s
make-dummies 0 9999 -> 2m35s
rmmod dummy -> 18s
The slowdown is caused by the lookups used in insert_headers
and put_links to see if we need to add links or remove links.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
When there are errors it is very nice to know the full sysctl path.
Add a simple function that computes the sysctl path and prints it
out.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Simplify the code and the sysctl semantics by autogenerating
sysctl directories when a sysctl table is registered that needs
the directories and autodeleting the directories when there are
no more sysctl tables registered that need them.
Autogenerating directories keeps sysctl tables from depending
on each other, removing all of the arcane register/unregister
ordering constraints and makes it impossible to get the order
wrong when reigsering and unregistering sysctl tables.
Autogenerating directories yields one unique entity that dentries
can point to, retaining the current effective use of the dcache.
Add struct ctl_dir as the type of these new autogenerated
directories.
The attached_by and attached_to fields in ctl_table_header are
removed as they are no longer needed.
The child field in ctl_table is no longer needed by the core of
the sysctl code. ctl_table.child can be removed once all of the
existing users have been updated.
Benchmark before:
make-dummies 0 999 -> 0.7s
rmmod dummy -> 0.07s
make-dummies 0 9999 -> 1m10s
rmmod dummy -> 0.4s
Benchmark after:
make-dummies 0 999 -> 0.44s
rmmod dummy -> 0.065s
make-dummies 0 9999 -> 1m36s
rmmod dummy -> 0.4s
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Add a ctl_table_root pointer to ctl_table set so it is easy to
go from a ctl_table_set to a ctl_table_root.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Replace sysctl_head_next with first_entry and next_entry. These new
iterators operate at the level of sysctl table entries and filter
out any sysctl tables that should not be shown.
Utilizing two specialized functions instead of a single function removes
conditionals for handling awkward special cases that only come up
at the beginning of iteration, making the iterators easier to read
and understand.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Replace the helpers that proc_sys_lookup uses with helpers that work
in terms of an entire sysctl directory. This is worse for sysctl_lock
hold times but it is much better for code clarity and the code cleanups
to come.
find_in_table is no longer needed so it is removed.
find_entry a general helper to find entries in a directory is added.
lookup_entry is a simple wrapper around find_entry that takes the
sysctl_lock increases the use count if an entry is found and drops
the sysctl_lock.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Every other directory has a .child member and we look at the .child
for our entries. Do the same for the root_table.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Add nreg to ctl_table_header. When nreg drops to 0 the ctl_table_header
will be unregistered.
Factor out drop_sysctl_table from unregister_sysctl_table, and add
the logic for decrementing nreg.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>