Commit Graph

859 Commits

Author SHA1 Message Date
Linus Torvalds
2d6bb6adb7 Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull stackleak gcc plugin from Kees Cook:
 "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
  was ported from grsecurity by Alexander Popov. It provides efficient
  stack content poisoning at syscall exit. This creates a defense
  against at least two classes of flaws:

   - Uninitialized stack usage. (We continue to work on improving the
     compiler to do this in other ways: e.g. unconditional zero init was
     proposed to GCC and Clang, and more plugin work has started too).

   - Stack content exposure. By greatly reducing the lifetime of valid
     stack contents, exposures via either direct read bugs or unknown
     cache side-channels become much more difficult to exploit. This
     complements the existing buddy and heap poisoning options, but
     provides the coverage for stacks.

  The x86 hooks are included in this series (which have been reviewed by
  Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
  been merged through the arm64 tree (written by Laura Abbott and
  reviewed by Mark Rutland and Will Deacon).

  With VLAs having been removed this release, there is no need for
  alloca() protection, so it has been removed from the plugin"

* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: Drop unneeded stackleak_check_alloca()
  stackleak: Allow runtime disabling of kernel stack erasing
  doc: self-protection: Add information about STACKLEAK feature
  fs/proc: Show STACKLEAK metrics in the /proc file system
  lkdtm: Add a test for STACKLEAK
  gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
  x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
2018-11-01 11:46:27 -07:00
Johannes Weiner
eb414681d5 psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Roman Gushchin
9b6f7e163c mm: rework memcg kernel stack accounting
If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT.  So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.

The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.

Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.

To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen.  As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.

As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.

Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
Fixes: ac496bf48d ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Nadav Amit
1ed0cc5a01 mm: respect arch_dup_mmap() return value
Commit d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(),
etc") ignored the return value of arch_dup_mmap(). As a result, on x86,
a failure to duplicate the LDT (e.g. due to memory allocation error)
would leave the duplicated memory mapping in an inconsistent state.

Fix by using the return value, as it was before the change.

Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com
Fixes: d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Alexander Popov
afaef01c00 x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
The STACKLEAK feature (initially developed by PaX Team) has the following
benefits:

1. Reduces the information that can be revealed through kernel stack leak
   bugs. The idea of erasing the thread stack at the end of syscalls is
   similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
   crypto, which all comply with FDP_RIP.2 (Full Residual Information
   Protection) of the Common Criteria standard.

2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
   CVE-2010-2963). That kind of bugs should be killed by improving C
   compilers in future, which might take a long time.

This commit introduces the code filling the used part of the kernel
stack with a poison value before returning to userspace. Full
STACKLEAK feature also contains the gcc plugin which comes in a
separate commit.

The STACKLEAK feature is ported from grsecurity/PaX. More information at:
  https://grsecurity.net/
  https://pax.grsecurity.net/

This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.

Performance impact:

Hardware: Intel Core i7-4770, 16 GB RAM

Test #1: building the Linux kernel on a single core
        0.91% slowdown

Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
        4.2% slowdown

So the STACKLEAK description in Kconfig includes: "The tradeoff is the
performance impact: on a single CPU system kernel compilation sees a 1%
slowdown, other systems and workloads may vary and you are advised to
test this feature on your expected workload before deploying it".

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04 10:35:47 -07:00
Linus Torvalds
cd9b44f907 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - procfs updates

 - various misc things

 - more y2038 fixes

 - get_maintainer updates

 - lib/ updates

 - checkpatch updates

 - various epoll updates

 - autofs updates

 - hfsplus

 - some reiserfs work

 - fatfs updates

 - signal.c cleanups

 - ipc/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
  ipc/util.c: update return value of ipc_getref from int to bool
  ipc/util.c: further variable name cleanups
  ipc: simplify ipc initialization
  ipc: get rid of ids->tables_initialized hack
  lib/rhashtable: guarantee initial hashtable allocation
  lib/rhashtable: simplify bucket_table_alloc()
  ipc: drop ipc_lock()
  ipc/util.c: correct comment in ipc_obtain_object_check
  ipc: rename ipcctl_pre_down_nolock()
  ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
  ipc: reorganize initialization of kern_ipc_perm.seq
  ipc: compute kern_ipc_perm.id under the ipc lock
  init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
  fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
  adfs: use timespec64 for time conversion
  kernel/sysctl.c: fix typos in comments
  drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
  fork: don't copy inconsistent signal handler state to child
  signal: make get_signal() return bool
  signal: make sigkill_pending() return bool
  ...
2018-08-22 12:34:08 -07:00
Jann Horn
06e62a46bb fork: don't copy inconsistent signal handler state to child
Before this change, if a multithreaded process forks while one of its
threads is changing a signal handler using sigaction(), the memcpy() in
copy_sighand() can race with the struct assignment in do_sigaction().  It
isn't clear whether this can cause corruption of the userspace signal
handler pointer, but it definitely can cause inconsistency between
different fields of struct sigaction.

Take the appropriate spinlock to avoid this.

I have tested that this patch prevents inconsistency between sa_sigaction
and sa_flags, which is possible before this patch.

Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Dmitry Vyukov
a2e5144538 kernel/hung_task.c: allow to set checking interval separately from timeout
Currently task hung checking interval is equal to timeout, as the result
hung is detected anywhere between timeout and 2*timeout.  This is fine for
most interactive environments, but this hurts automated testing setups
(syzbot).  In an automated setup we need to strictly order CPU lockup <
RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
is not detected as task hung and task hung is not detected as silent
machine loss.  The large variance in task hung detection timeout requires
setting silent machine loss timeout to a very large value (e.g.  if task
hung is 3 mins, then silent loss need to be set to ~7 mins).  The
additional 3 minutes significantly reduce testing efficiency because
usually we crash kernel within a minute, and this can add hours to bug
localization process as it needs to do dozens of tests.

Allow setting checking interval separately from timeout.  This allows to
set timeout to, say, 3 minutes, but checking interval to 10 secs.

The interval is controlled via a new hung_task_check_interval_secs sysctl,
similar to the existing hung_task_timeout_secs sysctl.  The default value
of 0 results in the current behavior: checking interval is equal to
timeout.

[akpm@linux-foundation.org: update hung_task_timeout_max's comment]
Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Andrew Morton
a670468f5e mm: zero out the vma in vma_init()
Rather than in vm_area_alloc().  To ensure that the various oddball
stack-based vmas are in a good state.  Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Shakeel Butt
d46eb14b73 fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.

The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs.  All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.

The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated.  However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg.  This patch series
contains two such concrete use-cases i.e.  fsnotify and buffer_head.

The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener.  The events
are allocated in the context of the event producer.  However they should
be charged to the event consumer.  Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.

To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg.  In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged.  For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.

This patch (of 2):

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer.  This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
  Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Linus Torvalds
73ba2fb33c Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Linus Torvalds
203b4fc903 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Thomas Gleixner:

 - Make lazy TLB mode even lazier to avoid pointless switch_mm()
   operations, which reduces CPU load by 1-2% for memcache workloads

 - Small cleanups and improvements all over the place

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove redundant check for kmem_cache_create()
  arm/asm/tlb.h: Fix build error implicit func declaration
  x86/mm/tlb: Make clear_asid_other() static
  x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
  x86/mm/tlb: Always use lazy TLB mode
  x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Restructure switch_mm_irqs_off()
  x86/mm/tlb: Leave lazy TLB mode at page table free time
  mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
  x86/mm: Add TLB purge to free pmd/pte page interfaces
  ioremap: Update pgtable free interfaces with addr
  x86/mm: Disable ioremap free page handling on x86-PAE
2018-08-13 16:29:35 -07:00
Eric W. Biederman
c3ad2c3b02 signal: Don't restart fork when signals come in.
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pessimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
    - Initialize signal_struct.multiprocess in init_task
    - Move setting of shared_pending to before the new task
      is visible to signals.  This prevents signals from comming
      in before shared_pending.signal is set to delayed.signal
      and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837 ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-09 13:07:01 -05:00
Jens Axboe
05b9ba4b55 Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Eric W. Biederman
924de3b8c9 fork: Have new threads join on-going signal group stops
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered.  Which
leaves only SIGSTOP that needs to be considered when creating new
threads.

Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.

A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.

It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes.  Making it appear
the clone completed happened just before the SIGSTOP.

The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.

V2: The call to task_join_group_stop was moved before the new task is
    added to the thread group list.  This should not matter as
    sighand->siglock is held over both the addition of the threads,
    the call to task_join_group_stop and do_signal_stop.  But the change
    is trivial and it is one less thing to worry about when reading
    the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03 20:20:14 -05:00
Josef Bacik
2c323017e3 blk-cgroup: clear the throttle queue on fork
We were hitting a panic in production where we put too many times on the
request queue.  This is because we'd get the throttle_queue of the
parent if we fork()'ed while we needed to be throttled, but we didn't
have a reference on it.  Instead just clear these flags on fork so the
child doesn't pay for the sins of its father.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:04 -06:00
Kirill A. Shutemov
027232da7c mm: introduce vma_init()
Not all VMAs allocated with vm_area_alloc().  Some of them allocated on
stack or in data segment.

The new helper can be use to initialize VMA properly regardless where it
was allocated.

Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Eric W. Biederman
7673bf553b fork: Unconditionally exit if a fatal signal is pending
In practice this does not change anything as testing for fatal_signal_pending
and exiting for with an error code duplicates the work of the next clause
which recalculates pending signals and then exits fork if any are pending.
In both cases the pending signal will trigger the slow path when existing
to userspace, and the fatal signal will cause do_exit to be called.

The advantage of making this a separate test is that it makes it clear
processing the fatal signal will terminate the fork, and it allows the
rest of the signal logic to be updated without fear that this important
case will be lost.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-23 08:01:10 -05:00
Eric W. Biederman
4ca1d3ee46 fork: Move and describe why the code examines PIDNS_ADDING
Normally this would be something that would be handled by handling
signals that are sent to a group of processes but in this case the
forking process is not a member of the group being signaled.  Thus
special code is needed to prevent a race with pid namespaces exiting,
and fork adding new processes within them.

Move this test up before the signal restart just in case signals are
also pending.  Fatal conditions should take presedence over restarts.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-23 07:57:12 -05:00
Linus Torvalds
490fc05386 mm: make vm_area_alloc() initialize core fields
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 15:24:03 -07:00
Linus Torvalds
95faf6992d mm: make vm_area_dup() actually copy the old vma data
.. and re-initialize th eanon_vma_chain head.

This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 14:48:45 -07:00
Linus Torvalds
3928d4f5ee mm: use helper functions for allocating and freeing vm_area structs
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 13:48:51 -07:00
Eric W. Biederman
6883f81aac pid: Implement PIDTYPE_TGID
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id).  Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array.  Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest.  The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Eric W. Biederman
2c4704756c pids: Move the pgrp and session pid pointers from task_struct to signal_struct
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.

This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00