Commit Graph

184 Commits

Author SHA1 Message Date
Davidlohr Bueso
a5091fda4e ipc/mqueue: optimize msg_get()
Our msg priorities became an rbtree as of d6629859b3 ("ipc/mqueue:
improve performance of send/recv").  However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course).  By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases.  Furthermore, some callers can call msg_get() in a
loop.

A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game.  Passes ltp mq testcases.

Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Davidlohr Bueso
0ecb58210b ipc/mqueue: remove redundant wq task assignment
We already store the current task fo the new waiter before calling
wq_sleep() in both send and recv paths.  Trivially remove the redundant
assignment.

Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Li Rongqing
d6a2946a88 ipc: prevent lockup on alloc_msg and free_msg
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.

After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
  rcu:     (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
  msgctl10        R  running task    21608 32505   2794 0x00000082
  Call Trace:
   preempt_schedule_irq+0x4c/0xb0
   retint_kernel+0x1b/0x2d
  RIP: 0010:__is_insn_slot_addr+0xfb/0x250
  Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
  RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
  RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
  RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
  RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
  R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
  R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
   kernel_text_address+0xc1/0x100
   __kernel_text_address+0xe/0x30
   unwind_get_return_address+0x2f/0x50
   __save_stack_trace+0x92/0x100
   create_object+0x380/0x650
   __kmalloc+0x14c/0x2b0
   load_msg+0x38/0x1a0
   do_msgsnd+0x19e/0xcf0
   do_syscall_64+0x117/0x400
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
  rcu:     (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
  msgctl10        R  running task    21608 32170  32155 0x00000082
  Call Trace:
   preempt_schedule_irq+0x4c/0xb0
   retint_kernel+0x1b/0x2d
  RIP: 0010:lock_acquire+0x4d/0x340
  Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
  RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
  RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
  RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
  R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
   is_bpf_text_address+0x32/0xe0
   kernel_text_address+0xec/0x100
   __kernel_text_address+0xe/0x30
   unwind_get_return_address+0x2f/0x50
   __save_stack_trace+0x92/0x100
   save_stack+0x32/0xb0
   __kasan_slab_free+0x130/0x180
   kfree+0xfa/0x2d0
   free_msg+0x24/0x50
   do_msgrcv+0x508/0xe60
   do_syscall_64+0x117/0x400
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Davidlohr said:
 "So after releasing the lock, the msg rbtree/list is empty and new
  calls will not see those in the newly populated tmp_msg list, and
  therefore they cannot access the delayed msg freeing pointers, which
  is good. Also the fact that the node_cache is now freed before the
  actual messages seems to be harmless as this is wanted for
  msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
  info->lock the thing is freed anyway so it should not change things"

Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Al Viro
015d795618 mqueue: switch to ->free_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01 22:43:26 -04:00
Linus Torvalds
7b47a9e7c8 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount infrastructure updates from Al Viro:
 "The rest of core infrastructure; no new syscalls in that pile, but the
  old parts are switched to new infrastructure. At that point
  conversions of individual filesystems can happen independently; some
  are done here (afs, cgroup, procfs, etc.), there's also a large series
  outside of that pile dealing with NFS (quite a bit of option-parsing
  stuff is getting used there - it's one of the most convoluted
  filesystems in terms of mount-related logics), but NFS bits are the
  next cycle fodder.

  It got seriously simplified since the last cycle; documentation is
  probably the weakest bit at the moment - I considered dropping the
  commit introducing Documentation/filesystems/mount_api.txt (cutting
  the size increase by quarter ;-), but decided that it would be better
  to fix it up after -rc1 instead.

  That pile allows to do followup work in independent branches, which
  should make life much easier for the next cycle. fs/super.c size
  increase is unpleasant; there's a followup series that allows to
  shrink it considerably, but I decided to leave that until the next
  cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
  afs: Use fs_context to pass parameters over automount
  afs: Add fs_context support
  vfs: Add some logging to the core users of the fs_context log
  vfs: Implement logging through fs_context
  vfs: Provide documentation for new mount API
  vfs: Remove kern_mount_data()
  hugetlbfs: Convert to fs_context
  cpuset: Use fs_context
  kernfs, sysfs, cgroup, intel_rdt: Support fs_context
  cgroup: store a reference to cgroup_ns into cgroup_fs_context
  cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
  cgroup_do_mount(): massage calling conventions
  cgroup: stash cgroup_root reference into cgroup_fs_context
  cgroup2: switch to option-by-option parsing
  cgroup1: switch to option-by-option parsing
  cgroup: take options parsing into ->parse_monolithic()
  cgroup: fold cgroup1_mount() into cgroup1_get_tree()
  cgroup: start switching to fs_context
  ipc: Convert mqueue fs to fs_context
  proc: Add fs_context support to procfs
  ...
2019-03-12 14:08:19 -07:00
David Howells
935c6912b1 ipc: Convert mqueue fs to fs_context
Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

 (1) The relevant ipc namespace is selected in when the context is
     initialised (and it defaults to the current task's ipc namespace).
     The caller can override this before calling vfs_get_tree().

 (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
     mq_internal_mount() create a context, adjust it and then do the rest
     of the mount procedure.

 (3) The lazy mqueue mounting on creation of a new namespace is retained
     from a previous patch, but the avoidance of sget() if no superblock
     yet exists is reverted and the superblock is again keyed on the
     namespace pointer.

     Yes, there was a performance gain in not searching the superblock
     hash, but it's only paid once per ipc namespace - and only if someone
     uses mqueue within that namespace, so I'm not sure it's worth it,
     especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28 03:29:29 -05:00
Arnd Bergmann
8dabe7245b y2038: syscalls: rename y2038 compat syscalls
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Linus Torvalds
4dcb9239da Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
 "The timers and timekeeping departement provides:

   - Another large y2038 update with further preparations for providing
     the y2038 safe timespecs closer to the syscalls.

   - An overhaul of the SHCMT clocksource driver

   - SPDX license identifier updates

   - Small cleanups and fixes all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  tick/sched : Remove redundant cpu_online() check
  clocksource/drivers/dw_apb: Add reset control
  clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
  clocksource/drivers: Unify the names to timer-* format
  clocksource/drivers/sh_cmt: Add R-Car gen3 support
  dt-bindings: timer: renesas: cmt: document R-Car gen3 support
  clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
  clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
  clocksource/drivers/sh_cmt: Fixup for 64-bit machines
  clocksource/drivers/sh_tmu: Convert to SPDX identifiers
  clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
  clocksource/drivers/sh_cmt: Convert to SPDX identifiers
  clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
  clocksource: Convert to using %pOFn instead of device_node.name
  tick/broadcast: Remove redundant check
  RISC-V: Request newstat syscalls
  y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
  y2038: socket: Change recvmmsg to use __kernel_timespec
  y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
  y2038: utimes: Rework #ifdef guards for compat syscalls
  ...
2018-10-25 11:14:36 -07:00
Eric W. Biederman
ae7795bc61 signal: Distinguish between kernel_siginfo and siginfo
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
traditional name for the userspace definition.  While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-10-03 16:47:43 +02:00
Arnd Bergmann
9afc5eee65 y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old				new
---				---
compat_time_t			old_time32_t
struct compat_timeval		struct old_timeval32
struct compat_timespec		struct old_timespec32
struct compat_itimerspec	struct old_itimerspec32
ns_to_compat_timeval()		ns_to_old_timeval32()
get_compat_itimerspec64()	get_old_itimerspec32()
put_compat_itimerspec64()	put_old_itimerspec32()
compat_get_timespec64()		get_old_timespec32()
compat_put_timespec64()		put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27 14:48:48 +02:00
Arnd Bergmann
b0d175781a y2038: ipc: Enable COMPAT_32BIT_TIME
Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
take a timespec argument. After we move 32-bit architectures over to
useing 64-bit time_t based syscalls, we need seperate entry points for
the old 32-bit based interfaces.

This changes the #ifdef guards for the existing 32-bit compat syscalls
to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
enabled on all existing 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20 16:20:27 +02:00
Arnd Bergmann
21fc538d81 y2038: ipc: Use __kernel_timespec
This is a preparatation for changing over __kernel_timespec to 64-bit
times, which involves assigning new system call numbers for mq_timedsend(),
mq_timedreceive() and semtimedop() for compatibility with future y2038
proof user space.

The existing ABIs will remain available through compat code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20 16:20:24 +02:00
Eric W. Biederman
cfb2f6f6e0 Revert "mqueue: switch to on-demand creation of internal mount"
This reverts commit 36735a6a2b.

Aleksa Sarai <asarai@suse.de> writes:
> [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
>
> Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
> which was introduced by 36735a6a2b ("mqueue: switch to on-demand
> creation of internal mount").
>
> Basically, the reproducer boils down to being able to mount mqueue if
> you create a new user namespace, even if you don't unshare the IPC
> namespace.
>
> Previously this was not possible, and you would get an -EPERM. The mount
> is the *host* mqueue mount, which is being cached and just returned from
> mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
> it was intentional -- since I'm not familiar with mqueue).
>
> To me it looks like there is a missing permission check. I've included a
> patch below that I've compile-tested, and should block the above case.
> Can someone please tell me if I'm missing something? Is this actually
> safe?
>
> [1]: https://github.com/docker/docker/issues/36674

The issue is a lot deeper than a missing permission check.  sb->s_user_ns
was is improperly set as well.  So in addition to the filesystem being
mounted when it should not be mounted, so things are not allow that should
be.

We are practically to the release of 4.16 and there is no agreement between
Al Viro and myself on what the code should looks like to fix things properly.
So revert the code to what it was before so that we can take our time
and discuss this properly.

Fixes: 36735a6a2b ("mqueue: switch to on-demand creation of internal mount")
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Reported-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24 19:34:23 -05:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Jonathan Haws
68e34f4e89 ipc/mqueue.c: have RT tasks queue in by priority in wq_add()
Previous behavior added tasks to the work queue using the static_prio
value instead of the dynamic priority value in prio.  This caused RT tasks
to be added to the work queue in a FIFO manner rather than by priority.
Normal tasks were handled by priority.

This fix utilizes the dynamic priority of the task to ensure that both RT
and normal tasks are added to the work queue in priority order.  Utilizing
the dynamic priority (prio) rather than the base priority (normal_prio)
was chosen to ensure that if a task had a boosted priority when it was
added to the work queue, it would be woken sooner to to ensure that it
releases any other locks it may be holding in a more timely manner.  It is
understood that the task could have a lower priority when it wakes than
when it was added to the queue in this (unlikely) case.

Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu
Signed-off-by: Jonathan Haws <jhaws@sdl.usu.edu>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:46 -08:00
Linus Torvalds
8b0fdf631c Merge branch 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mqueue/bpf vfs cleanups from Al Viro:
 "mqueue and bpf go through rather painful and similar contortions to
  create objects in their dentry trees. Provide a primitive for doing
  that without abusing ->mknod(), switch bpf and mqueue to it.

  Another mqueue-related thing that has ended up in that branch is
  on-demand creation of internal mount (based upon the work of Giuseppe
  Scrivano)"

* 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mqueue: switch to on-demand creation of internal mount
  tidy do_mq_open() up a bit
  mqueue: clean prepare_open() up
  do_mq_open(): move all work prior to dentry_open() into a helper
  mqueue: fold mq_attr_ok() into mqueue_get_inode()
  move dentry_open() calls up into do_mq_open()
  mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
  bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
  new primitive: vfs_mkobj()
2018-01-30 18:32:21 -08:00
Linus Torvalds
168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Eric W. Biederman
faf1f22b61 signal: Ensure generic siginfos the kernel sends have all bits initialized
Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:07 -06:00
Al Viro
36735a6a2b mqueue: switch to on-demand creation of internal mount
Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.

Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
added handling of userland mqueue mounts (original had been broken in
that area) and added a switch to mount_nodev().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:37 -05:00
Al Viro
a713fd7f52 tidy do_mq_open() up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:36 -05:00
Al Viro
9b20d7fc52 mqueue: clean prepare_open() up
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:36 -05:00
Al Viro
066cc813e9 do_mq_open(): move all work prior to dentry_open() into a helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:35 -05:00
Al Viro
05c1b29038 mqueue: fold mq_attr_ok() into mqueue_get_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:35 -05:00
Al Viro
af4a5372e4 move dentry_open() calls up into do_mq_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:34 -05:00
Al Viro
eecec19d9e mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:33 -05:00