Commit Graph

839 Commits

Author SHA1 Message Date
Kees Cook
1c542f38ab mm: Introduce kvcalloc()
The kv*alloc()-family was missing kvcalloc(). Adding this allows for
2-argument multiplication conversions of kvzalloc(a * b, ...) into
kvcalloc(a, b, ...).

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
7d3bf613e9 Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
b56845794e Merge branch 'for-4.18/dax' into libnvdimm-for-next 2018-06-08 15:16:40 -07:00
Linus Torvalds
68abbe7295 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - v9fs updates

 - MM

 - procfs updates

 - lib/ updates

 - autofs updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  autofs: small cleanup in autofs_getpath()
  autofs: clean up includes
  autofs: comment on selinux changes needed for module autoload
  autofs: update MAINTAINERS entry for autofs
  autofs: use autofs instead of autofs4 in documentation
  autofs: rename autofs documentation files
  autofs: create autofs Kconfig and Makefile
  autofs: delete fs/autofs4 source files
  autofs: update fs/autofs4/Makefile
  autofs: update fs/autofs4/Kconfig
  autofs: copy autofs4 to autofs
  autofs4: use autofs instead of autofs4 everywhere
  autofs4: merge auto_fs.h and auto_fs4.h
  fs/binfmt_misc.c: do not allow offset overflow
  checkpatch: improve patch recognition
  lib/ucs2_string.c: add MODULE_LICENSE()
  lib/mpi: headers cleanup
  lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
  lib/idr.c: remove simple_ida_lock
  lib/bitmap.c: micro-optimization for __bitmap_complement()
  ...
2018-06-07 18:39:37 -07:00
Sahara
72eb7de9c1 mm: remove page_is_poisoned() from linux/mm.h
When commit bd33ef3681 ("mm: enable page poisoning early at boot") got
rid of the PAGE_EXT_DEBUG_POISON, page_is_poisoned in the header left
behind.  This patch cleans up the leftovers under the table.

Link: http://lkml.kernel.org/r/1528101069-21637-1-git-send-email-kpark3469@gmail.com
Signed-off-by: Sahara <keun-o.park@darkmatter.ae>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Matthew Wilcox
1d40a5ea01 mm: mark pages in use for page tables
Define a new PageTable bit in the page_type and use it to mark pages in
use as page tables.  This can be helpful when debugging crashdumps or
analysing memory fragmentation.  Add a KPF flag to report these pages to
userspace and update page-types.c to interpret that flag.

Note that only pages currently accounted as NR_PAGETABLES are tracked as
PageTable; this does not include pgd/p4d/pud/pmd pages.  Those will be the
subject of a later patch.

Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:37 -07:00
Souptick Joarder
2bcd6454ba mm: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:36 -07:00
Souptick Joarder
ab77dab462 fs/dax.c: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

commit 1c8f422059 ("mm: change return type to vm_fault_t")

There was an existing bug inside dax_load_hole() if vm_insert_mixed had
failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of
VM_FAULT_OOM.  With new vmf_insert_mixed() this issue is addressed.

vm_insert_mixed_mkwrite has inefficiency when it returns an error value,
driver has to convert it to vm_fault_t type.  With new
vmf_insert_mixed_mkwrite() this limitation will be addressed.

Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:33 -07:00
Linus Torvalds
a1cdde8c41 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "This has been a quiet cycle for RDMA, the big bulk is the usual
  smallish driver updates and bug fixes. About four new uAPI related
  things. Not as much Szykaller patches this time, the bugs it finds are
  getting harder to fix.

  Summary:

   - More work cleaning up the RDMA CM code

   - Usual driver bug fixes and cleanups for qedr, qib, hfi1, hns,
     i40iw, iw_cxgb4, mlx5, rxe

   - Driver specific resource tracking and reporting via netlink

   - Continued work for name space support from Parav

   - MPLS support for the verbs flow steering uAPI

   - A few tricky IPoIB fixes improving robustness

   - HFI1 driver support for the '16B' management packet format

   - Some auditing to not print kernel pointers via %llx or similar

   - Mark the entire 'UCM' user-space interface as BROKEN with the
     intent to remove it entirely. The user space side of this was long
     ago replaced with RDMA-CM and syzkaller is finding bugs in the
     residual UCM interface nobody wishes to fix because nobody uses it.

   - Purge more bogus BUG_ON's from Leon

   - 'flow counters' verbs uAPI

   - T10 fixups for iser/isert, these are Acked by Martin but going
     through the RDMA tree due to dependencies"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (138 commits)
  RDMA/mlx5: Update SPDX tags to show proper license
  RDMA/restrack: Change SPDX tag to properly reflect license
  IB/hfi1: Fix comment on default hdr entry size
  IB/hfi1: Rename exp_lock to exp_mutex
  IB/hfi1: Add bypass register defines and replace blind constants
  IB/hfi1: Remove unused variable
  IB/hfi1: Ensure VL index is within bounds
  IB/hfi1: Fix user context tail allocation for DMA_RTAIL
  IB/hns: Use zeroing memory allocator instead of allocator/memset
  infiniband: fix a possible use-after-free bug
  iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
  IB/isert: use T10-PI check mask definitions from core layer
  IB/iser: use T10-PI check mask definitions from core layer
  RDMA/core: introduce check masks for T10-PI offload
  IB/isert: fix T10-pi check mask setting
  IB/mlx5: Add counters read support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add counters create and destroy support
  IB/uverbs: Add support for flow counters
  ...
2018-06-07 13:04:07 -07:00
Linus Torvalds
c90fca951e Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for split PMD page table lock on 64-bit Book3S (Power8/9).

   - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
     live patching again.

   - Add support for patching barrier_nospec in copy_from_user() and
     syscall entry.

   - A couple of fixes for our data breakpoints on Book3S.

   - A series from Nick optimising TLB/mm handling with the Radix MMU.

   - Numerous small cleanups to squash sparse/gcc warnings from Mathieu
     Malaterre.

   - Several series optimising various parts of the 32-bit code from
     Christophe Leroy.

   - Removal of support for two old machines, "SBC834xE" and "C2K"
     ("GEFanuc,C2K"), which is why the diffstat has so many deletions.

  And many other small improvements & fixes.

  There's a few out-of-area changes. Some minor ftrace changes OK'ed by
  Steve, and a fix to our powernv cpuidle driver. Then there's a series
  touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
  around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
  been in next for several weeks.

  Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
  Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
  Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
  Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
  Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
  Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
  Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
  Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
  Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
  Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
  Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
  Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"

* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
  powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
  cpuidle: powernv: Fix promotion from snooze if next state disabled
  powerpc: fix build failure by disabling attribute-alias warning in pci_32
  ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
  powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
  powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
  powerpc/pkeys: Detach execute_only key on !PROT_EXEC
  powerpc/powernv: copy/paste - Mask SO bit in CR
  powerpc: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove support for Marvell mv64x60 i2c controller
  powerpc/boot: Remove support for Marvell MPSC serial controller
  powerpc/embedded6xx: Remove C2K board support
  powerpc/lib: optimise PPC32 memcmp
  powerpc/lib: optimise 32 bits __clear_user()
  powerpc/time: inline arch_vtime_task_switch()
  powerpc/Makefile: set -mcpu=860 flag for the 8xx
  powerpc: Implement csum_ipv6_magic in assembly
  powerpc/32: Optimise __csum_partial()
  powerpc/lib: Adjust .balign inside string functions for PPC32
  ...
2018-06-07 10:23:33 -07:00
Linus Torvalds
2857676045 Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Kees Cook
3b3b1a29eb mm: Use overflow helpers in kvmalloc()
Instead of open-coded multiplication and bounds checking, use the new
overflow helper. Additionally prepare for vmalloc() users to add
array_size()-family helpers in the future.

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-05 12:16:51 -07:00
Leon Romanovsky
27d036e332 mm: Remove return value of zap_vma_ptes()
All callers of zap_vma_ptes() are not interested in the return value of
that function, so let's simplify its interface and drop the return
value.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-06-01 11:20:43 -04:00
Joonsoo Kim
d883c6cf3b Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"
This reverts the following commits that change CMA design in MM.

 3d2054ad8c ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")

 1d47a3ec09 ("mm/cma: remove ALLOC_CMA")

 bad8c6c0b1 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")

Ville reported a following error on i386.

  Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
  microcode: microcode updated early to revision 0x4, date = 2013-06-28
  Initializing CPU#0
  Initializing HighMem for node 0 (000377fe:00118000)
  Initializing Movable for node 0 (00000001:00118000)
  BUG: Bad page state in process swapper  pfn:377fe
  page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
  flags: 0x80000000()
  raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
  page dumped because: nonzero mapcount
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
  Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
  Call Trace:
   dump_stack+0x60/0x96
   bad_page+0x9a/0x100
   free_pages_check_bad+0x3f/0x60
   free_pcppages_bulk+0x29d/0x5b0
   free_unref_page_commit+0x84/0xb0
   free_unref_page+0x3e/0x70
   __free_pages+0x1d/0x20
   free_highmem_page+0x19/0x40
   add_highpages_with_active_regions+0xab/0xeb
   set_highmem_pages_init+0x66/0x73
   mem_init+0x1b/0x1d7
   start_kernel+0x17a/0x363
   i386_start_kernel+0x95/0x99
   startup_32_smp+0x164/0x168

The reason for this error is that the span of MOVABLE_ZONE is extended
to whole node span for future CMA initialization, and, normal memory is
wrongly freed here.  I submitted the fix and it seems to work, but,
another problem happened.

It's so late time to fix the later problem so I decide to reverting the
series.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-24 10:07:50 -07:00
Dan Williams
e763848843 mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 06:59:39 -07:00
Souptick Joarder
d97baf9470 include/linux/mm.h: add new inline function vmf_error()
Many places in drivers/ file systems, error was handled in a common way
like below:

	ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;

vmf_error() will replace this and return vm_fault_t type err.

A lot of drivers and filesystems currently have a rather complex mapping
of errno-to-VM_FAULT code.  We have been able to eliminate a lot of it
by just returning VM_FAULT codes directly from functions which are
called exclusively from the fault handling path.

Some functions can be called both from the fault handler and other
context which are expecting an errno, so they have to continue to return
an errno.  Some users still need to choose different behaviour for
different errnos, but vmf_error() captures the essential error
translation that's common to all users, and those that need to handle
additional errors can handle them first.

Link: http://lkml.kernel.org/r/20180510174826.GA14268@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-18 17:17:12 -07:00
Willy Tarreau
7f7ccc2ccc proc: do not access cmdline nor environ from file-backed areas
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.

Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.

This was assigned CVE-2018-1120.

Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-17 09:27:47 -07:00
Ram Pai
2c9e0a6fa2 mm, powerpc, x86: introduce an additional vma bit for powerpc pkey
Currently only 4bits are allocated in the vma flags to hold 16
keys. This is sufficient for x86. PowerPC  supports  32  keys,
which needs 5bits. This patch allocates an  additional bit.

Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Fold in #if VM_PKEY_BIT4 as noticed by Dave Hansen]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-09 11:50:27 +10:00
Ram Pai
5212213aa5 mm, powerpc, x86: define VM_PKEY_BITx bits if CONFIG_ARCH_HAS_PKEYS is enabled
VM_PKEY_BITx are defined only if CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
is enabled. Powerpc also needs these bits. Hence lets define the
VM_PKEY_BITx bits for any architecture that enables
CONFIG_ARCH_HAS_PKEYS.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-09 00:35:33 +10:00
Matthew Wilcox
b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Matthew Wilcox
f82b376413 export __set_page_dirty
XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Joonsoo Kim
bad8c6c0b1 mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE
Patch series "mm/cma: manage the memory of the CMA area by using the
ZONE_MOVABLE", v2.

0. History

This patchset is the follow-up of the discussion about the "Introduce
ZONE_CMA (v7)" [1].  Please reference it if more information is needed.

1. What does this patch do?

This patch changes the management way for the memory of the CMA area in
the MM subsystem.  Currently the memory of the CMA area is managed by
the zone where their pfn is belong to.  However, this approach has some
problems since MM subsystem doesn't have enough logic to handle the
situation that different characteristic memories are in a single zone.
To solve this issue, this patch try to manage all the memory of the CMA
area by using the MOVABLE zone.  In MM subsystem's point of view,
characteristic of the memory on the MOVABLE zone and the memory of the
CMA area are the same.  So, managing the memory of the CMA area by using
the MOVABLE zone will not have any problem.

2. Motivation

There are some problems with current approach.  See following.  Although
these problem would not be inherent and it could be fixed without this
conception change, it requires many hooks addition in various code path
and it would be intrusive to core MM and would be really error-prone.
Therefore, I try to solve them with this new approach.  Anyway,
following is the problems of the current implementation.

o CMA memory utilization

First, following is the freepage calculation logic in MM.

 - For movable allocation: freepage = total freepage
 - For unmovable allocation: freepage = total freepage - CMA freepage

Freepages on the CMA area is used after the normal freepages in the zone
where the memory of the CMA area is belong to are exhausted.  At that
moment that the number of the normal freepages is zero, so

 - For movable allocation: freepage = total freepage = CMA freepage
 - For unmovable allocation: freepage = 0

If unmovable allocation comes at this moment, allocation request would
fail to pass the watermark check and reclaim is started.  After reclaim,
there would exist the normal freepages so freepages on the CMA areas
would not be used.

FYI, there is another attempt [2] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

Useless reclaim:

There is no logic to distinguish CMA pages in the reclaim path.  Hence,
CMA page is reclaimed even if the system just needs the page that can be
usable for the kernel allocation.

Atomic allocation failure:

This is also related to the fallback allocation policy for the memory of
the CMA area.  Consider the situation that the number of the normal
freepages is *zero* since the bunch of the movable allocation requests
come.  Kswapd would not be woken up due to following freepage
calculation logic.

- For movable allocation: freepage = total freepage = CMA freepage

If atomic unmovable allocation request comes at this moment, it would
fails due to following logic.

- For unmovable allocation: freepage = total freepage - CMA freepage = 0

It was reported by Aneesh [3].

Useless compaction:

Usual high-order allocation request is unmovable allocation request and
it cannot be served from the memory of the CMA area.  In compaction,
migration scanner try to migrate the page in the CMA area and make
high-order page there.  As mentioned above, it cannot be usable for the
unmovable allocation request so it's just waste.

3. Current approach and new approach

Current approach is that the memory of the CMA area is managed by the
zone where their pfn is belong to.  However, these memory should be
distinguishable since they have a strong limitation.  So, they are
marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
as mentioned in section 2, the MM subsystem doesn't have enough logic to
deal with this special pageblock so many problems raised.

New approach is that the memory of the CMA area is managed by the
MOVABLE zone.  MM already have enough logic to deal with special zone
like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
area by the MOVABLE zone just naturally work well because constraints
for the memory of the CMA area that the memory should always be
migratable is the same with the constraint for the MOVABLE zone.

There is one side-effect for the usability of the memory of the CMA
area.  The use of MOVABLE zone is only allowed for a request with
GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
only allowed for this gfp flag.  Before this patchset, a request with
GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
cache page and anonymous page.  However, file cache page for blockdev
file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
pros and cons on this exception.  In my experience, blockdev file cache
pages are one of the top reason that causes cma_alloc() to fail
temporarily.  So, we can get more guarantee of cma_alloc() success by
discarding this case.

Note that there is no change in admin POV since this patchset is just
for internal implementation change in MM subsystem.  Just one minor
difference for admin is that the memory stat for CMA area will be
printed in the MOVABLE zone.  That's all.

4. Result

Following is the experimental result related to utilization problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before>
  CMA area:               0 MB            512 MB
  Elapsed-time:           92.4		186.5
  pswpin:                 82		18647
  pswpout:                160		69839

<After>
  CMA        :            0 MB            512 MB
  Elapsed-time:           93.1		93.4
  pswpin:                 84		46
  pswpout:                183		92

akpm: "kernel test robot" reported a 26% improvement in
vm-scalability.throughput:
http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

[1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
[2]: https://lkml.org/lkml/2014/10/15/623
[3]: http://www.spinics.net/lists/linux-mm/msg100562.html

Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Souptick Joarder
1c8f422059 mm: change return type to vm_fault_t
The plan for these patches is to introduce the typedef, initially just
as documentation ("These functions should return a VM_FAULT_ status").
We'll trickle the patches to individual drivers/filesystems in through
the maintainers, as far as possible.  Then we'll change the typedef to
an unsigned int and break the compilation of any unconverted
drivers/filesystems.

vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three
newly added functions.  The various drivers/filesystems where return
value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get
converted, will need them.  These functions will return correct
VM_FAULT_ code based on err value.

We've had bugs before where drivers returned -EFOO.  And we have this
silly inefficiency where vm_insert_xxx() return an errno which (afaict)
every driver then converts into a VM_FAULT code.  In many cases drivers
failed to return correct VM_FAULT code value despite of vm_insert_xxx()
fails.  We have indentified and clean up all those existing bugs and
silly inefficiencies in driver/filesystems by adding these three new
inline wrappers.  As mentioned above, we will trickle those patches to
individual drivers/filesystems in through maintainers after these three
wrapper functions are merged.

Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and
remove these inline wrappers, but these are a good intermediate step.

Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Huang Ying
cb9f753a37 mm: fix races between swapoff and flush dcache
Thanks to commit 4b3ef9daa4 ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed.  So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function.  But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

  CPU1					CPU2
  __get_user_pages()			swapoff()
    flush_dcache_page()
      mapping = page_mapping()
        ...				  exit_swap_address_space()
        ...				    kvfree(spaces)
        mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used.  The implementation of flush_dcache_page() in all architectures
follows this too.  They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately.  And they will use interval tree (mapping->i_mmap)
to find all user space mappings.  While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise.  All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Guenter Roeck
5844a486da include/linux/mm.h: provide consistent declaration for num_poisoned_pages
clang reports the following compile warning.

  In file included from mm/vmscan.c:56:
  ./include/linux/swapops.h:327:22: warning:
	section attribute is specified on redeclared variable [-Wsection]
  extern atomic_long_t num_poisoned_pages __read_mostly;
                       ^
  ./include/linux/mm.h:2585:22: note: previous declaration is here
  extern atomic_long_t num_poisoned_pages;
                     ^

Let's use __read_mostly everywhere.

Link: http://lkml.kernel.org/r/1519686565-8224-1-git-send-email-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00