Commit Graph

5933 Commits

Author SHA1 Message Date
Jason Gunthorpe
0f45e69d62 Merge tag 'verbs_flow_counters' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git into for-next
Pull verbs counters series from Leon Romanovsky:

====================
Verbs flow counters support

This series comes to allow user space applications to monitor real time
traffic activity and events of the verbs objects it manages, e.g.: ibv_qp,
ibv_wq, ibv_flow.

The API enables generic counters creation and define mapping to
association with a verbs object, the current mlx5 driver is using this API
for flow counters.

With this API, an application can monitor the entire life cycle of object
activity, defined here as a static counters attachment.  This API also
allows dynamic counters monitoring of measurement points for a partial
period in the verbs object life cycle.

In addition it presents the implementation of the generic counters
interface.

This will be achieved by extending flow creation by adding a new flow
count specification type which allows the user to associate a previously
created flow counters using the generic verbs counters interface to the
created flow, once associated the user could read statistics by using the
read function of the generic counters interface.

The API includes:
1. create and destroyed API of a new counters objects
2. read the counters values from HW

Note:
Attaching API to allow application to define the measurement points per
objects is a user space only API and this data is passed to kernel when
the counted object (e.g. flow) is created with the counters object.
===================

* tag 'verbs_flow_counters':
  IB/mlx5: Add counters read support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add counters create and destroy support
  IB/uverbs: Add support for flow counters
  IB/core: Add support for flow counters
  IB/core: Support passing uhw for create_flow
  IB/uverbs: Add read counters support
  IB/core: Introduce counters read verb
  IB/uverbs: Add create/destroy counters support
  IB/core: Introduce counters object and its create/destroy
  IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
  net/mlx5: Export flow counter related API
  net/mlx5: Use flow counter pointer as input to the query function
2018-06-04 08:48:11 -06:00
Raed Salem
3b3233fbf0 IB/mlx5: Add flow counters binding support
Associates a counters with a flow when IB_FLOW_SPEC_ACTION_COUNT is part
of the flow specifications.

The counters user space placements of location and description (index,
description) pairs are passed as private data of the counters flow
specification.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:35:32 +03:00
Raed Salem
b6ba4a9aa5 IB/uverbs: Add support for flow counters
The struct ib_uverbs_flow_spec_action_count associates a counters object
with the flow.

Post this association the flow counters can be read via the counters
object.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:56 +03:00
Raed Salem
ebb6796bd3 IB/uverbs: Add read counters support
This patch exposes the read counters verb to user space applications.  By
that verb the user can read the hardware counters which are associated
with the counters object.

The application needs to provide a sufficient memory to hold the
statistics.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:55 +03:00
Raed Salem
d9a5a6441e IB/uverbs: Add create/destroy counters support
User space application which uses counters functionality, is expected to
allocate/release the counters resources by calling create/destroy verbs
and in turn get a unique handle that can be used to attach the counters to
its counted type.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:54 +03:00
Jason Gunthorpe
f3ca0ab114 Merge branch 'mini_cqe' into git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma for-next
Leon Romanovsky says:

====================
Introduce new internal to mlx5 CQE format - mini-CQE. It is a CQE in
compressed form that holds data needed to extra a single full CQE.

It is a stride index, byte count and packet checksum.
====================

* mini_cqe:
  IB/mlx5: Introduce a new mini-CQE format
  IB/mlx5: Refactor CQE compression response
  net/mlx5: Exposing a new mini-CQE format

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-29 15:23:18 -06:00
Yonatan Cohen
6f1006a438 IB/mlx5: Introduce a new mini-CQE format
The new mini-CQE format includes the stride index, byte count and
packet checksum.
Stride index is needed for striding WQ feature.
This patch exposes this capability and enables its setting
via mlx5 UHW data as part of query device and cq creation.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-29 15:18:38 -06:00
Ariel Levkovich
e818e255a5 IB/mlx5: Expose MPLS related tunneling offloads
This patch reports the device's capbilities to offload
encapsulated MPLS tunnel protocols to user-space:
- Capability to offload MPLS over GRE.
- Capability to offload MPLS over UDP.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:55 -06:00
Ariel Levkovich
0d86bbec71 IB/uverbs: Expose MPLS flow spec to the user-kernel ABI header
Add ib_uverbs_flow_spec_mpls to define a rule to match the MPLS
protocol.

The spec includes the generic specs header, type, size and reserved
fields while the filter itself is defined as ib_uverbs_flow_mpls_filter
and includes a single 32bit field named 'label' which consists of:
Bits 0:19  - The MPLS label.
Bits 20:22 - Traffic class field.
Bit  23    - Bottom of stack bit.
Bits 24:31 - Time to live (TTL) field.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:54 -06:00
Ariel Levkovich
20b6563ba1 IB/uverbs: Expose GRE flow spec to the user-kernel ABI header
Add ib_uverbs_flow_spec_gre to define a rule to match the GRE
encapsulation protocol.

The spec includes the generic specs header, type, size and reserved
fields while the filter itself is defined as ib_uverbs_flow_gre_filter
and includes:
1. Checksum present bit, key present bit and version bits in a single
   16bit field.
2. Protocol type field - Indicates the ether protocol type of the
   encapsulated payload.
3. Key field - present if key bit is set and contains an application
   specific key value.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:54 -06:00
Doug Ledford
0d52d80376 RDMA/uapi: Fix uapi breakage
During this merge window, we added support for addition RDMA netlink
operations.  Unfortunately, we added the items in the middle of our uapi
enum.  Fix that before final release.

Fixes: da5c850782 ("RDMA/nldev: add driver-specific resource
tracking")
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-15 15:54:46 -04:00
Doug Ledford
f5e27a203f Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
Several items of conflict have arisen between the RDMA stack's for-rc
branch and upcoming for-next work:

9fd4350ba8 ("IB/rxe: avoid double kfree_skb") directly conflicts with
2e47350789 ("IB/rxe: optimize the function duplicate_request")

Patches already submitted by Intel for the hfi1 driver will fail to
apply cleanly without this merge

Other people on the mailing list have notified that their upcoming
patches also fail to apply cleanly without this merge

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 15:48:48 -04:00
Steve Wise
da5c850782 RDMA/nldev: add driver-specific resource tracking
Each driver can register a "fill entry" function with the restrack core.
This function will be called when filling out a resource, allowing the
driver to add driver-specific details.  The details consist of a
nltable of nested attributes, that are in the form of <key, [print-type],
value> tuples.  Both key and value attributes are mandatory.  The key
nlattr must be a string, and the value nlattr can be one of the driver
attributes that are generic, but typed, allowing the attributes to be
validated.  Currently the driver nlattr types include string, s32,
u32, s64, and u64.  The print-type nlattr allows a driver to specify
an alternative display format for user tools displaying the attribute.
For example, a u32 attribute will default to "%u", but a print-type
attribute can be included for it to be displayed in hex.  This allows
the user tool to print the number in the format desired by the driver
driver.

More attrs can be defined as they become needed by drivers.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:51:27 -04:00
Steve Wise
25a0ad8515 RDMA/nldev: Add explicit pad attribute
Add a specific RDMA_NLDEV_ATTR_PAD attribute to be used for 64b
attribute padding.  To preserve the ABI, make this attribute equal to
RDMA_NLDEV_ATTR_UNSPEC, which has a value of 0, because that has been
used up until now as the pad attribute.

Change all the previous use of 0 as the pad with this
new enum.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-03 15:51:27 -04:00
Jason Gunthorpe
d50e14abe2 uapi: Fix SPDX tags for files referring to the 'OpenIB.org' license
Based on discussion with Kate Stewart this license is not a
BSD-2-Clause, but is now formally identified as Linux-OpenIB
by SPDX.

The key difference between the licenses is in the 'warranty'
paragraph.

if_infiniband.h refers to the 'OpenIB.org' license, but
does not include the text, instead it links to an obsolete
web site that contains a license that matches the BSD-2-Clause
SPX. There is no 'three clause' version of the OpenIB.org
license.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 11:10:33 -04:00
Linus Torvalds
681857ef0d Merge branch 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - fix panic when halting system via "shutdown -h now"

 - drop own coding in favour of generic CONFIG_COMPAT_BINFMT_ELF
   implementation

 - add FPE_CONDTRAP constant: last outstanding parisc-specific cleanup
   for Eric Biedermans siginfo patches

 - move some functions to .init and some to .text.hot linker sections

* 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Prevent panic at system halt
  parisc: Switch to generic COMPAT_BINFMT_ELF
  parisc: Move cache flush functions into .text.hot section
  parisc/signal: Add FPE_CONDTRAP for conditional trap handling
2018-04-12 17:07:04 -07:00
Linus Torvalds
e241e3f2bf Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio update from Michael Tsirkin:
 "This adds reporting hugepage stats to virtio-balloon"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: export hugetlb page allocation counts
2018-04-11 18:58:27 -07:00
Masahiro Yamada
21e7bc600e linux/const.h: refactor _BITUL and _BITULL a bit
Minor cleanups available by _UL and _ULL.

Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2dd8a62c64 linux/const.h: move UL() macro to include/linux/const.h
ARM, ARM64 and UniCore32 duplicate the definition of UL():

  #define UL(x) _AC(x, UL)

This is not actually arch-specific, so it will be useful to move it to a
common header.  Currently, we only have the uapi variant for
linux/const.h, so I am creating include/linux/const.h.

I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
replaced in follow-up cleanups.  The underscore-prefixed ones should
be used for exported headers.

Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2a6cc8a6c0 linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
BIT() etc", v3.

ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
architectures may introduce it in the future.

UL() is arch-agnostic, and useful. So let's move it to
include/linux/const.h

Currently, <asm/memory.h> must be included to use UL().  It pulls in more
bloats just for defining some bit macros.

I posted V2 one year ago.

The previous posts are:
https://patchwork.kernel.org/patch/9498273/
https://patchwork.kernel.org/patch/9498275/
https://patchwork.kernel.org/patch/9498269/
https://patchwork.kernel.org/patch/9498271/

At that time, what blocked this series was a comment from
David Howells:
  You need to be very careful doing this.  Some userspace stuff
  depends on the guard macro names on the kernel header files.

(https://patchwork.kernel.org/patch/9498275/)

Looking at the code closer, I noticed this is not a problem.

See the following line.
https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40

scripts/headers_install.sh rips off _UAPI prefix from guard macro names.

I ran "make headers_install" and confirmed the result is what I expect.

So, we can prefix the include guard of include/uapi/linux/const.h,
and add a new include/linux/const.h.

This patch (of 4):

I am going to add include/linux/const.h for the kernel space.

Add _UAPI to the include guard of include/uapi/linux/const.h to
prepare for that.

Please notice the guard name of the exported one will be kept as-is.
So, this commit has no impact to the userspace even if some userspace
stuff depends on the guard macro names.

scripts/headers_install.sh processes exported headers by SED, and
rips off "_UAPI" from guard macro names.

  #ifndef _UAPI_LINUX_CONST_H
  #define _UAPI_LINUX_CONST_H

will be turned into

  #ifndef _LINUX_CONST_H
  #define _LINUX_CONST_H

Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
4ed2863951 fs, elf: drop MAP_FIXED usage from elf_map
Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that.  This is
however dangerous thing prone to silent data corruption which can be
even exploitable.

Let's take CVE-2017-1000253 as an example.  At the time (before commit
eab09532d4: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e2: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d4 and later by c715b72c1b: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6
("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
safe and any attack should be impractical.  On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED.  Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
  Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
a4ff8e8620 mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g.  stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g.  for
shared or file backed mappings).

One way around this would be excluding those archs which do alignment
tricks from the hardening [6].  The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution.  We basically want a non-destructive MAP_FIXED.

The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one.  The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility.  We really want a never-clobber semantic even on older
kernels which do not recognize the flag.  Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags.  On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping.  I do not see a good way
around that.  Except we won't export expose the new semantic to the
userspace at all.

It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]

Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints.  This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable.  We would like to change that and supply a random address in a
: window of the address space.  If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.

John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space.  "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
:     p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
:     IMPROVEMENT: --> if instead, we could call this:
:
:             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
:         , then we could skip the munmap() call upon failure. This
:         is a small thing, but it is useful here. (Thanks to Piotr
:         Jaroszynski and Mark Hairgrove for helping me get that detail
:         exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
:      q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.

Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.

The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
should follow.  Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.

[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

This patch (of 2):

MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range.  This
can cause silent memory corruptions.  Some of them even with serious
security implications.  While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range.  Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g.  arch specific code is
allowed to apply an alignment.

Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping.  We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.

The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility.  We really want a
never-clobber semantic even on older kernels which do not recognize the
flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
EINVAL on unknown flags.  On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping.  I do not see a good way around that.

[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz NapieraƂa <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Davidlohr Bueso
23c8cec8cf ipc/msg: introduce msgctl(MSG_STAT_ANY)
There is a permission discrepancy when consulting msq ipc object
metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the msq metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new MSG_STAT_ANY command such that the msq ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
a280d6dc77 ipc/sem: introduce semctl(SEM_STAT_ANY)
There is a permission discrepancy when consulting shm ipc object
metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the sma metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new SEM_STAT_ANY command such that the sem ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
c21a6970ae ipc/shm: introduce shmctl(SHM_STAT_ANY)
Patch series "sysvipc: introduce STAT_ANY commands", v2.

The following patches adds the discussed (see [1]) new command for shm
as well as for sems and msq as they are subject to the same
discrepancies for ipc object permission checks between the syscall and
via procfs.  These new commands are justified in that (1) we are stuck
with this semantics as changing syscall and procfs can break userland;
and (2) some users can benefit from performance (for large amounts of
shm segments, for example) from not having to parse the procfs
interface.

Once merged, I will submit the necesary manpage updates.  But I'm thinking
something like:

: diff --git a/man2/shmctl.2 b/man2/shmctl.2
: index 7bb503999941..bb00bbe21a57 100644
: --- a/man2/shmctl.2
: +++ b/man2/shmctl.2
: @@ -41,6 +41,7 @@
:  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
:  .\"	attaches to a segment that has already been marked for deletion.
:  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
: +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
:  .\"
:  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
:  .SH NAME
: @@ -242,6 +243,18 @@ However, the
:  argument is not a segment identifier, but instead an index into
:  the kernel's internal array that maintains information about
:  all shared memory segments on the system.
: +.TP
: +.BR SHM_STAT_ANY " (Linux-specific)"
: +Return a
: +.I shmid_ds
: +structure as for
: +.BR SHM_STAT .
: +However, the
: +.I shm_perm.mode
: +is not checked for read access for
: +.IR shmid ,
: +resembing the behaviour of
: +/proc/sysvipc/shm.
:  .PP
:  The caller can prevent or allow swapping of a shared
:  memory segment with the following \fIcmd\fP values:
: @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
:  kernel's internal array recording information about all
:  shared memory segments.
:  (This information can be used with repeated
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operations to obtain information about all shared memory segments
:  on the system.)
:  A successful
: @@ -328,7 +341,7 @@ isn't accessible.
:  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
:  is not a valid command.
:  Or: for a
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operation, the index value specified in
:  .I shmid
:  referred to an array slot that is currently unused.

This patch (of 3):

There is a permission discrepancy when consulting shm ipc object metadata
between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
later does permission checks for the object vs S_IRUGO.  As such there can
be cases where EACCESS is returned via syscall but the info is displayed
anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the shm metadata), this behavior goes way back and showing all
the objects regardless of the permissions was most likely an overlook - so
we are stuck with it.  Furthermore, modifying either the syscall or the
procfs file can cause userspace programs to break (ie ipcs).  Some
applications require getting the procfs info (without root privileges) and
can be rather slow in comparison with a syscall -- up to 500x in some
reported cases.

This patch introduces a new SHM_STAT_ANY command such that the shm ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

[1] https://lkml.org/lkml/2017/12/19/220

Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Robert Kettler <robert.kettler@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00