mirror of
https://github.com/ukui/kernel.git
synced 2026-03-09 10:07:04 -07:00
fd04899208d2057b2de808e8447cfd806fd0a607
134 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
126e8bee94 |
ipc: WARN if trying to remove ipc object which is absent
Patch series "shm: shm_rmid_forced feature fixes".
Some time ago I met kernel crash after CRIU restore procedure,
fortunately, it was CRIU restore, so, I had dump files and could do
restore many times and crash reproduced easily. After some
investigation I've constructed the minimal reproducer. It was found
that it's use-after-free and it happens only if sysctl
kernel.shm_rmid_forced = 1.
The key of the problem is that the exit_shm() function not handles shp's
object destroy when task->sysvshm.shm_clist contains items from
different IPC namespaces. In most cases this list will contain only
items from one IPC namespace.
How can this list contain object from different namespaces? The
exit_shm() function is designed to clean up this list always when
process leaves IPC namespace. But we made a mistake a long time ago and
did not add a exit_shm() call into the setns() syscall procedures.
The first idea was just to add this call to setns() syscall but it
obviously changes semantics of setns() syscall and that's
userspace-visible change. So, I gave up on this idea.
The first real attempt to address the issue was just to omit forced
destroy if we meet shp object not from current task IPC namespace [1].
But that was not the best idea because task->sysvshm.shm_clist was
protected by rwsem which belongs to current task IPC namespace. It
means that list corruption may occur.
Second approach is just extend exit_shm() to properly handle shp's from
different IPC namespaces [2]. This is really non-trivial thing, I've
put a lot of effort into that but not believed that it's possible to
make it fully safe, clean and clear.
Thanks to the efforts of Manfred Spraul working an elegant solution was
designed. Thanks a lot, Manfred!
Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
In shm_exit destroy all created and never attached segments") Eric's
idea was to maintain a list of shm_clists one per IPC namespace, use
lock-less lists. But there is some extra memory consumption-related
concerns.
An alternative solution which was suggested by me was implemented in
("shm: reset shm_clist on setns but omit forced shm destroy"). The idea
is pretty simple, we add exit_shm() syscall to setns() but DO NOT
destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
clean up the task->sysvshm.shm_clist list.
This chages semantics of setns() syscall a little bit but in comparision
to the "naive" solution when we just add exit_shm() without any special
exclusions this looks like a safer option.
[1] https://lkml.org/lkml/2021/7/6/1108
[2] https://lkml.org/lkml/2021/7/14/736
This patch (of 2):
Let's produce a warning if we trying to remove non-existing IPC object
from IPC namespace kht/idr structures.
This allows us to catch possible bugs when the ipc_rmid() function was
called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
arguments.
Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
20401d1058 |
ipc: replace costly bailout check in sysvipc_find_ipc()
sysvipc_find_ipc() was left with a costly way to check if the offset
position fed to it is bigger than the total number of IPC IDs in use. So
much so that the time it takes to iterate over /proc/sysvipc/* files grows
exponentially for a custom benchmark that creates "N" SYSV shm segments
and then times the read of /proc/sysvipc/shm (milliseconds):
12 msecs to read 1024 segs from /proc/sysvipc/shm
18 msecs to read 2048 segs from /proc/sysvipc/shm
65 msecs to read 4096 segs from /proc/sysvipc/shm
325 msecs to read 8192 segs from /proc/sysvipc/shm
1303 msecs to read 16384 segs from /proc/sysvipc/shm
5182 msecs to read 32768 segs from /proc/sysvipc/shm
The root problem lies with the loop that computes the total amount of ids
in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
"ids->in_use". That is a quite inneficient way to get to the maximum
index in the id lookup table, specially when that value is already
provided by struct ipc_ids.max_idx.
This patch follows up on the optimization introduced via commit
|
||
|
|
b869d5be0a |
ipc/util.c: use binary search for max_idx
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO, MSG_INFO or SHM_INFO, then the return value is the index of the highest used index in the kernel's internal array recording information about all SysV objects of the requested type for the current namespace. (This information can be used with repeated ..._STAT or ..._STAT_ANY operations to obtain information about all SysV objects on the system.) There is a cache for this value. But when the cache needs up be updated, then the highest used index is determined by looping over all possible values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the index values do not need to be consecutive. With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a loop, I have observed a performance increase of around factor 13000. As there is no get_last() function for idr structures: Implement a "get_last()" using a binary search. As far as I see, ipc is the only user that needs get_last(), thus implement it in ipc/util.c and not in a central location. [akpm@linux-foundation.org: tweak comment, fix typo] Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5e698222c7 |
ipc/util.c: sysvipc_find_ipc() incorrectly updates position index
Commit |
||
|
|
89163f93c6 |
ipc/util.c: sysvipc_find_ipc() should increase position index
If seq_file .next function does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: NeilBrown <neilb@suse.com> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d919b33daf |
proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
97a32539b9 |
proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c593642c8b |
treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except at places where these are defined. Later patches will remove the unused definition of FIELD_SIZEOF(). This patch is generated using following script: EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h" git grep -l -e "\bFIELD_SIZEOF\b" | while read file; do if [[ "$file" =~ $EXCLUDE_FILES ]]; then continue fi sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file; done Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com> Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: David Miller <davem@davemloft.net> # for net |
||
|
|
99db46ea29 |
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
3278a2c20c |
ipc: conserve sequence numbers in ipcmni_extend mode
Rewrite, based on the patch from Waiman Long: The mixing in of a sequence number into the IPC IDs is probably to avoid ID reuse in userspace as much as possible. With ipcmni_extend mode, the number of usable sequence numbers is greatly reduced leading to higher chance of ID reuse. To address this issue, we need to conserve the sequence number space as much as possible. Right now, the sequence number is incremented for every new ID created. In reality, we only need to increment the sequence number when new allocated ID is not greater than the last one allocated. It is in such case that the new ID may collide with an existing one. This is being done irrespective of the ipcmni mode. In order to avoid any races, the index is first allocated and then the pointer is replaced. Changes compared to the initial patch: - Handle failures from idr_alloc(). - Avoid that concurrent operations can see the wrong sequence number. (This is achieved by using idr_replace()). - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to ipcmni_seq_shift(). - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max(). Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5ac893b8cb |
ipc: allow boot time extension of IPCMNI from 32k to 16M
The maximum number of unique System V IPC identifiers was limited to 32k. That limit should be big enough for most use cases. However, there are some users out there requesting for more, especially those that are migrating from Solaris which uses 24 bits for unique identifiers. To satisfy the need of those users, a new boot time kernel option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This is a 512X increase which should be big enough for users out there that need a large number of unique IPC identifier. The use of this new option will change the pattern of the IPC identifiers returned by functions like shmget(2). An application that depends on such pattern may not work properly. So it should only be used if the users really need more than 32k of unique IPC numbers. This new option does have the side effect of reducing the maximum number of unique sequence numbers from 64k down to 128. So it is a trade-off. The computation of a new IPC id is not done in the performance critical path. So a little bit of additional overhead shouldn't have any real performance impact. Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8f0db01800 |
rhashtable: use bit_spin_locks to protect hash bucket.
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
2a9d648100 |
ipc/util.c: update return value of ipc_getref from int to bool
ipc_getref has still a return value of type "int", matching the atomic_t interface of atomic_inc_not_zero()/atomic_add_unless(). ipc_getref now uses refcount_inc_not_zero, which has a return value of type "bool". Therefore, update the return code to avoid implicit conversions. Link: http://lkml.kernel.org/r/20180712185241.4017-13-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
27c331a174 |
ipc/util.c: further variable name cleanups
The varable names got a mess, thus standardize them again:
id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree
Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
eae04d25a7 |
ipc: simplify ipc initialization
Now that we know that rhashtable_init() will not fail, we can get rid of a lot of the unnecessary cleanup paths when the call errored out. [manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning] Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dc2c8c84de |
ipc: get rid of ids->tables_initialized hack
In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in
|
||
|
|
82061c57ce |
ipc: drop ipc_lock()
ipc/util.c contains multiple functions to get the ipc object pointer given an id number. There are two sets of function: One set verifies the sequence counter part of the id number, other functions do not check the sequence counter. The standard for function names in ipc/util.c is - ..._check() functions verify the sequence counter - ..._idr() functions do not verify the sequence counter ipc_lock() is an exception: It does not verify the sequence counter value, but this is not obvious from the function name. Furthermore, shm.c is the only user of this helper. Thus, we can simply move the logic into shm_lock() and get rid of the function altogether. [manfred@colorfullife.com: most of changelog] Link: http://lkml.kernel.org/r/20180712185241.4017-7-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2e5ceb452c |
ipc/util.c: correct comment in ipc_obtain_object_check
The comment that explains ipc_obtain_object_check is wrong: The function checks the sequence number, not the reference counter. Note that checking the reference counter would be meaningless: The reference counter is decreased without holding any locks, thus an object with kern_ipc_perm.deleted=true may disappear at the end of the next rcu grace period. Link: http://lkml.kernel.org/r/20180712185241.4017-6-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4241c1a304 |
ipc: rename ipcctl_pre_down_nolock()
Both the comment and the name of ipcctl_pre_down_nolock() are misleading: The function must be called while holdling the rw semaphore. Therefore the patch renames the function to ipcctl_obtain_check(): This name matches the other names used in util.c: - "obtain" function look up a pointer in the idr, without acquiring the object lock. - The caller is responsible for locking. - _check means that the sequence number is checked. Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
39cfffd774 |
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc_addid() is impossible to use: - for certain failures, the caller must not use ipc_rcu_putref(), because the reference counter is not yet initialized. - for other failures, the caller must use ipc_rcu_putref(), because parallel operations could be ongoing already. The patch cleans that up, by initializing the refcount early, and by modifying all callers. The issues is related to the finding of syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an issue with reading kern_ipc_perm.seq, here both read and write to already released memory could happen. Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e2652ae6bd |
ipc: reorganize initialization of kern_ipc_perm.seq
ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc() (within ipc_idr_alloc()). Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check() may see an uninitialized value. The patch moves the initialization of kern_ipc_perm.seq before the calls of idr_alloc(). Notes: 1) This patch has a user space visible side effect: If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and if semget()/msgget()/shmget() fails in the final step of adding the id to the rhash tree, then .._next_id is cleared. Before the patch, is remained unmodified. There is no change of the behavior after a successful ..get() call: It always clears .._next_id, there is no impact to non checkpoint/restore code as that code does not use .._next_id. 2) The patch correctly documents that after a call to ipc_idr_alloc(), the full tear-down sequence must be used. The callers of ipc_addid() do not fullfill that, i.e. more bugfixes are required. The patch is a squash of a patch from Dmitry and my own changes. Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0eb71a9da5 |
rhashtable: split rhashtable.h
Due to the use of rhashtables in net namespaces, rhashtable.h is included in lots of the kernel, so a small changes can required a large recompilation. This makes development painful. This patch splits out rhashtable-types.h which just includes the major type declarations, and does not include (non-trivial) inline code. rhashtable.h is no longer included by anything in the include/ directory. Common include files only include rhashtable-types.h so a large recompilation is only triggered when that changes. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
e74a0effff |
proc: move /proc/sysvipc creation to where it belongs
Move the proc_mkdir() call within the sysvipc subsystem such that we avoid polluting proc_root_init() with petty cpp. [dave@stgolabs.net: contributed changelog] Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
03f1fc0918 |
ipc/util: Helpers for making the sysvipc operations pid namespace aware
Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
and /proc/sysvipc/sem are opened, and make it available through
the new helper ipc_seq_pid_ns.
This makes it possible to report the pids in these files in the
pid namespace of the opener of the files.
Implement ipc_update_pid. A simple impline helper that will only update
a struct pid pointer if the new value does not equal the old value. This
removes the need for wordy code sequences like:
old = object->pid;
object->pid = new;
put_pid(old);
and
old = object->pid;
if (old != new) {
object->pid = new;
put_pid(old);
}
Allowing the following to be written instead:
ipc_update_pid(&object->pid, new);
Which is easier to read and ensures that the pid reference count is
not touched the old and the new values are the same. Not touching
the reference count in this case is important to help avoid issues
like af_unix experienced, where multiple threads of the same
process managed to bounce the struct pid between cpu cache lines,
but updating the pids reference count.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
||
|
|
87ad4b0d85 |
ipc: fix ipc data structures inconsistency
As described in the title, this patch fixes <ipc>id_ds inconsistency when
<ipc>ctl_stat executes concurrently with some ds-changing function, e.g.
shmat, msgsnd or whatever.
For instance, if shmctl(IPC_STAT) is running concurrently
with shmat, following data structure can be returned:
{... shm_lpid = 0, shm_nattch = 1, ...}
Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems
Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|