Commit Graph

1409 Commits

Author SHA1 Message Date
Linus Torvalds
aee3bfa330 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from Davic Miller:

 1) Support busy polling generically, for all NAPI drivers.  From Eric
    Dumazet.

 2) Add byte/packet counter support to nft_ct, from Floriani Westphal.

 3) Add RSS/XPS support to mvneta driver, from Gregory Clement.

 4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
    Frederic Sowa.

 5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.

 6) Add support for VLAN device bridging to mlxsw switch driver, from
    Ido Schimmel.

 7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.

 8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.

 9) Reorganize wireless drivers into per-vendor directories just like we
    do for ethernet drivers.  From Kalle Valo.

10) Provide a way for administrators "destroy" connected sockets via the
    SOCK_DESTROY socket netlink diag operation.  From Lorenzo Colitti.

11) Add support to add/remove multicast routes via netlink, from Nikolay
    Aleksandrov.

12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.

13) Add forwarding and packet duplication facilities to nf_tables, from
    Pablo Neira Ayuso.

14) Dead route support in MPLS, from Roopa Prabhu.

15) TSO support for thunderx chips, from Sunil Goutham.

16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.

17) Rationalize, consolidate, and more completely document the checksum
    offloading facilities in the networking stack.  From Tom Herbert.

18) Support aborting an ongoing scan in mac80211/cfg80211, from
    Vidyullatha Kanchanapally.

19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
  net: bnxt: always return values from _bnxt_get_max_rings
  net: bpf: reject invalid shifts
  phonet: properly unshare skbs in phonet_rcv()
  dwc_eth_qos: Fix dma address for multi-fragment skbs
  phy: remove an unneeded condition
  mdio: remove an unneed condition
  mdio_bus: NULL dereference on allocation error
  net: Fix typo in netdev_intersect_features
  net: freescale: mac-fec: Fix build error from phy_device API change
  net: freescale: ucc_geth: Fix build error from phy_device API change
  bonding: Prevent IPv6 link local address on enslaved devices
  IB/mlx5: Add flow steering support
  net/mlx5_core: Export flow steering API
  net/mlx5_core: Make ipv4/ipv6 location more clear
  net/mlx5_core: Enable flow steering support for the IB driver
  net/mlx5_core: Initialize namespaces only when supported by device
  net/mlx5_core: Set priority attributes
  net/mlx5_core: Connect flow tables
  net/mlx5_core: Introduce modify flow table command
  net/mlx5_core: Managing root flow table
  ...
2016-01-12 18:57:02 -08:00
Linus Torvalds
0f8c790103 Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
 "Workqueue changes for v4.5.  One cleanup patch and three to improve
  the debuggability.

  Workqueue now has a stall detector which dumps workqueue state if any
  worker pool hasn't made forward progress over a certain amount of time
  (30s by default) and also triggers a warning if a workqueue which can
  be used in memory reclaim path tries to wait on something which can't
  be.

  These should make workqueue hangs a lot easier to debug."

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: simplify the apply_workqueue_attrs_locked()
  workqueue: implement lockup detector
  watchdog: introduce touch_softlockup_watchdog_sched()
  workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue
2016-01-11 18:53:13 -08:00
willy tarreau
712f4aad40 unix: properly account for FDs passed over unix sockets
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 00:05:30 -05:00
Jiri Olsa
5a1078043f sched/core: Move sched_entity::avg into separate cache line
The sched_entity::avg collides with read-mostly sched_entity data.

The perf c2c tool showed many read HITM accesses across
many CPUs for sched_entity's cfs_rq and my_q, while having
at the same time tons of stores for avg.

After placing sched_entity::avg into separate cache line,
the perf bench sched pipe showed around 20 seconds speedup.

NOTE I cut out all perf events except for cycles and
instructions from following output.

Before:
  $ perf stat -r 5 perf bench sched pipe -l 10000000
  # Running 'sched/pipe' benchmark:
  # Executed 10000000 pipe operations between two processes

       Total time: 270.348 [sec]

        27.034805 usecs/op
            36989 ops/sec
   ...

     245,537,074,035      cycles                    #    1.433 GHz
     187,264,548,519      instructions              #    0.77  insns per cycle

       272.653840535 seconds time elapsed           ( +-  1.31% )

After:
  $ perf stat -r 5 perf bench sched pipe -l 10000000
  # Running 'sched/pipe' benchmark:
  # Executed 10000000 pipe operations between two processes

       Total time: 251.076 [sec]

        25.107678 usecs/op
            39828 ops/sec
  ...

     244,573,513,928      cycles                    #    1.572 GHz
     187,409,641,157      instructions              #    0.76  insns per cycle

       251.679315188 seconds time elapsed           ( +-  0.31% )

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449606239-28602-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:06:14 +01:00
Ingo Molnar
567bee2803 Merge branch 'sched/urgent' into sched/core, to pick up fixes before merging new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:02:29 +01:00
Peter Zijlstra
be958bdc96 sched/core: Fix unserialized r-m-w scribbling stuff
Some of the sched bitfieds (notably sched_reset_on_fork) can be set
on other than current, this can cause the r-m-w to race with other
updates.

Since all the sched bits are serialized by scheduler locks, pull them
in a separate word.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org
Cc: mhocko@kernel.org
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:07 +01:00
Sergey Senozhatsky
570f52412a sched/core: Check tgid in is_global_init()
Our global init task can have sub-threads, so ->pid check is not reliable
enough for is_global_init(), we need to check tgid instead. This has been
spotted by Oleg and a fix was proposed by Richard a long time ago (see the
link below).

Oleg wrote:

  : Because is_global_init() is only true for the main thread of /sbin/init.
  :
  : Just look at oom_unkillable_task(). It tries to not kill init. But, say,
  : select_bad_process() can happily find a sub-thread of is_global_init()
  : and still kill it.

I recently hit the problem in question; re-sending the patch (to the
best of my knowledge it has never been submitted) with updated function
comment. Credit goes to Oleg and Richard.

Suggested-by: Richard Guy Briggs <rgb@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W . Biederman <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge E . Hallyn <serge.hallyn@ubuntu.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://www.redhat.com/archives/linux-audit/2013-December/msg00086.html
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:06 +01:00
Tejun Heo
03e0d4610b watchdog: introduce touch_softlockup_watchdog_sched()
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected.  One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend.  The other is to account
for scheduler and timer going idle.

For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection.  This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead.  For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:29:42 -05:00
Frederic Weisbecker
b7ce2277f0 sched/cputime: Convert vtime_seqlock to seqcount
The cputime can only be updated by the current task itself, even in
vtime case. So we can safely use seqcount instead of seqlock as there
is no writer concurrency involved.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:46 +01:00
Frederic Weisbecker
7098c1eac7 sched/cputime: Clarify vtime symbols and document them
VTIME_SLEEPING state happens either when:

1) The task is sleeping and no tickless delta is to be added on the task
   cputime stats.
2) The CPU isn't running vtime at all, so the same properties of 1) applies.

Lets rename the vtime symbol to reflect both states.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:44 +01:00
Byungchul Park
525705d15e sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz()
Usually the tick can be stopped for an idle CPU in NOHZ. However in NOHZ_FULL
mode, a non-idle CPU's tick can also be stopped. However, update_cpu_load_nohz()
does not consider the case a non-idle CPU's tick has been stopped at all.

This patch makes the update_cpu_load_nohz() know if the calling path comes
from NOHZ_FULL or idle NOHZ.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447115762-19734-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Linus Torvalds
264015f8a8 Merge tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
 "Outside of the new ACPI-NFIT hot-add support this pull request is more
  notable for what it does not contain, than what it does.  There were a
  handful of development topics this cycle, dax get_user_pages, dax
  fsync, and raw block dax, that need more more iteration and will wait
  for 4.5.

  The patches to make devm and the pmem driver NUMA aware have been in
  -next for several weeks.  The hot-add support has not, but is
  contained to the NFIT driver and is passing unit tests.  The coredump
  support is straightforward and was looked over by Jeff.  All of it has
  received a 0day build success notification across 107 configs.

  Summary:

   - Add support for the ACPI 6.0 NFIT hot add mechanism to process
     updates of the NFIT at runtime.

   - Teach the coredump implementation how to filter out DAX mappings.

   - Introduce NUMA hints for allocations made by the pmem driver, and
     as a side effect all devm allocations now hint their NUMA node by
     default"

* tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  coredump: add DAX filtering for FDPIC ELF coredumps
  coredump: add DAX filtering for ELF coredumps
  acpi: nfit: Add support for hot-add
  nfit: in acpi_nfit_init, break on a 0-length table
  pmem, memremap: convert to numa aware allocations
  devm_memremap_pages: use numa_mem_id
  devm: make allocations numa aware by default
  devm_memremap: convert to return ERR_PTR
  devm_memunmap: use devres_release()
  pmem: kill memremap_pmem()
  x86, mm: quiet arch_add_memory()
2015-11-10 12:07:22 -08:00
Ross Zwisler
5037835c1f coredump: add DAX filtering for ELF coredumps
Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings.  This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.

Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags.  Also update the documented default value of
coredump_filter to be consistent with the core(5) man page.  The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config.  This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-09 13:29:54 -05:00
Oleg Nesterov
9a13049e83 signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
jffs2_garbage_collect_thread() can race with SIGCONT and sleep in
TASK_STOPPED state after it was already sent. Add the new helper,
kernel_signal_stop(), which does this correctly.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Oleg Nesterov
be0e6f290f signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
1. Rename dequeue_signal_lock() to kernel_dequeue_signal(). This
   matches another "for kthreads only" kernel_sigaction() helper.

2. Remove the "tsk" and "mask" arguments, they are always current
   and current->blocked. And it is simply wrong if tsk != current.

3. We could also remove the 3rd "siginfo_t *info" arg but it looks
   potentially useful. However we can simplify the callers if we
   change kernel_dequeue_signal() to accept info => NULL.

4. Remove _irqsave, it is never called from atomic context.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Oleg Nesterov
2e01fabe67 signals: kill block_all_signals() and unblock_all_signals()
It is hardly possible to enumerate all problems with block_all_signals()
and unblock_all_signals().  Just for example,

1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
   multithreaded. Another thread can dequeue the signal and force the
   group stop.

2. Even is the caller is single-threaded, it will "stop" anyway. It
   will not sleep, but it will spin in kernel space until SIGCONT or
   SIGKILL.

And a lot more. In short, this interface doesn't work at all, at least
the last 10+ years.

Daniel said:

  Yeah the only times I played around with the DRM_LOCK stuff was when
  old drivers accidentally deadlocked - my impression is that the entire
  DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
  purging where this leaks out of the drm subsystem.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Dave Airlie <airlied@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
2e3078af2c Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...
2015-11-05 23:10:54 -08:00
Tejun Heo
b23afb93d3 memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
speculative allocations, it can blow way past the high limit.  This is
actually easily reproducible by simply doing "find /".  slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.

This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.

- All over-high reclaims can use GFP_KERNEL regardless of the specific
  gfp mask in use, e.g. GFP_NOFS, when the limit was breached.

- It copes with prio inversion.  Previously, a low-prio task with
  small memory.high might perform over-high reclaim with a bunch of
  locks held.  If a higher prio task needed any of these locks, it
  would have to wait until the low prio task finished reclaim and
  released the locks.  By handing over-high reclaim to the task exit
  path this issue can be avoided.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
626ebc4100 memcg: flatten task_struct->memcg_oom
task_struct->memcg_oom is a sub-struct containing fields which are used
for async memcg oom handling.  Most task_struct fields aren't packaged
this way and it can lead to unnecessary alignment paddings.  This patch
flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields are
which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Don Zickus
ac1f591249 kernel/watchdog.c: add sysctl knob hardlockup_panic
The only way to enable a hardlockup to panic the machine is to set
'nmi_watchdog=panic' on the kernel command line.

This makes it awkward for end users and folks who want to run automate
tests (like myself).

Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
knob.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Linus Torvalds
69234acee5 Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The cgroup core saw several significant updates this cycle:

   - percpu_rwsem for threadgroup locking is reinstated.  This was
     temporarily dropped due to down_write latency issues.  Oleg's
     rework of percpu_rwsem which is scheduled to be merged in this
     merge window resolves the issue.

   - On the v2 hierarchy, when controllers are enabled and disabled, all
     operations are atomic and can fail and revert cleanly.  This allows
     ->can_attach() failure which is necessary for cpu RT slices.

   - Tasks now stay associated with the original cgroups after exit
     until released.  This allows tracking resources held by zombies
     (e.g.  pids) and makes it easy to find out where zombies came from
     on the v2 hierarchy.  The pids controller was broken before these
     changes as zombies escaped the limits; unfortunately, updating this
     behavior required too many invasive changes and I don't think it's
     a good idea to backport them, so the pids controller on 4.3, the
     first version which included the pids controller, will stay broken
     at least until I'm sure about the cgroup core changes.

   - Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
  cgroup: fix race condition around termination check in css_task_iter_next()
  blkcg: don't create "io.stat" on the root cgroup
  cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
  cgroup: replace error handling in cgroup_init() with WARN_ON()s
  cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  cgroup: keep zombies associated with their original cgroups
  cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  cgroup: don't hold css_set_rwsem across css task iteration
  cgroup: reorganize css_task_iter functions
  cgroup: factor out css_set_move_task()
  cgroup: keep css_set and task lists in chronological order
  cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  cgroup: make css_sets pin the associated cgroups
  cgroup: relocate cgroup_[try]get/put()
  cgroup: move check_for_release() invocation
  cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  cgroup: make cgroup->nr_populated count the number of populated css_sets
  cgroup: remove an unused parameter from cgroup_task_migrate()
  cgroup: fix too early usage of static_branch_disable()
  cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  ...
2015-11-05 14:51:32 -08:00
Linus Torvalds
b0f85fa11a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

Changes of note:

 1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

 2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

 3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps.  From Sowmini Varadhan.

 4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks.  From
    Eric W Biederman.

 5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

 6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

 7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

 8) Allow setting the ageing_time in switchdev/rocker.  From Scott
    Feldman.

 9) Properly autoload L2TP type modules, from Stephen Hemminger.

10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap.  From Nikolay Aleksandrov.

15) Make TCP listener handling completely lockless, this is a major
    accomplishment.  Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue.  From Peter Nørlund.

17) Several y2038 cures, mostly avoiding timespec.  From Arnd Bergmann.

18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet.  This
    influences the port binding selection logic used by SO_REUSEPORT.

20) Add ipv6 support to VRF, from David Ahern.

21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

24) Support multipath routes in MPLS, from Roopa Prabhu.

25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

28) Support ipv6 geneve tunnels, from John W Linville.

29) Add flood control support to switchdev layer, from Ido Schimmel.

30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

31) Support persistent maps and progs in bpf, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
  sh_eth: use DMA barriers
  switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
  net: sched: kill dead code in sch_choke.c
  irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
  net: dsa: mv88e6xxx: include DSA ports in VLANs
  net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
  net/core: fix for_each_netdev_feature
  vlan: Invoke driver vlan hooks only if device is present
  arcnet/com20020: add LEDS_CLASS dependency
  bpf, verifier: annotate verbose printer with __printf
  dp83640: Only wait for timestamps for packets with timestamping enabled.
  ptp: Change ptp_class to a proper bitmask
  dp83640: Prune rx timestamp list before reading from it
  dp83640: Delay scheduled work.
  dp83640: Include hash in timestamp/packet matching
  ipv6: fix tunnel error handling
  net/mlx5e: Fix LSO vlan insertion
  net/mlx5e: Re-eanble client vlan TX acceleration
  net/mlx5e: Return error in case mlx5e_set_features() fails
  net/mlx5e: Don't allow more than max supported channels
  ...
2015-11-04 09:41:05 -08:00
Linus Torvalds
53528695ff Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this cycle were:

   - sched/fair load tracking fixes and cleanups (Byungchul Park)

   - Make load tracking frequency scale invariant (Dietmar Eggemann)

   - sched/deadline updates (Juri Lelli)

   - stop machine fixes, cleanups and enhancements for bugs triggered by
     CPU hotplug stress testing (Oleg Nesterov)

   - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
     cleanups (Peter Zijlstra)

   - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

   - Optimize per entity utilization tracking (Peter Zijlstra)

   - ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
  sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
  sched: Start stopper early
  stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
  stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
  stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
  stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
  stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
  sched/x86: Fix typo in __switch_to() comments
  sched/core: Remove a parameter in the migrate_task_rq() function
  sched/core: Drop unlikely behind BUG_ON()
  sched/core: Fix task and run queue sched_info::run_delay inconsistencies
  sched/numa: Fix task_tick_fair() from disabling numa_balancing
  sched/core: Add preempt_count invariant check
  sched/core: More notrace annotations
  sched/core: Kill PREEMPT_ACTIVE
  sched/core, sched/x86: Kill thread_info::saved_preempt_count
  sched/core: Simplify preempt_count tests
  sched/core: Robustify preemption leak checks
  sched/core: Stop setting PREEMPT_ACTIVE
  ...
2015-11-03 18:03:50 -08:00
Linus Torvalds
2814228699 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "The main changes in this cycle were:

   - Improvements to expedited grace periods (Paul E McKenney)

   - Performance improvements to and locktorture tests for percpu-rwsem
     (Oleg Nesterov, Paul E McKenney)

   - Torture-test changes (Paul E McKenney, Davidlohr Bueso)

   - Documentation updates (Paul E McKenney)

   - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov,
     Patrick Marlier)"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()
  rcu: Better hotplug handling for synchronize_sched_expedited()
  rcu: Enable stall warnings for synchronize_rcu_expedited()
  rcu: Add tasks to expedited stall-warning messages
  rcu: Add online/offline info to expedited stall warning message
  rcu: Consolidate expedited CPU selection
  rcu: Prepare for consolidating expedited CPU selection
  cpu: Remove try_get_online_cpus()
  rcu: Stop excluding CPU hotplug in synchronize_sched_expedited()
  rcu: Stop silencing lockdep false positive for expedited grace periods
  rcu: Switch synchronize_sched_expedited() to IPI
  locktorture: Fix module unwind when bad torture_type specified
  torture: Forgive non-plural arguments
  rcutorture: Fix unused-function warning for torturing_tasks()
  rcutorture: Fix module unwind when bad torture_type specified
  rcu_sync: Cleanup the CONFIG_PROVE_RCU checks
  locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read()
  locking/percpu-rwsem: Fix the comments outdated by rcu_sync
  locking/percpu-rwsem: Make use of the rcu_sync infrastructure
  locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe
  ...
2015-11-03 15:40:38 -08:00
Ingo Molnar
c13dc31adb Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)

  - Improvements to expedited grace periods. (Paul E. McKenney)

  - Performance improvements to and locktorture tests for percpu-rwsem.
    (Oleg Nesterov, Paul E. McKenney)

  - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)

  - Documentation updates. (Paul E. McKenney)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 10:09:54 +02:00