Commit Graph

2001 Commits

Author SHA1 Message Date
Mathieu Desnoyers a961e40917 membarrier: Provide register expedited private command
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.

This new command allows processes to register their intent to use the
private expedited command.  This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches.  Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread.  Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-19 22:13:40 -04:00
Peter Zijlstra 024c9d2fae sched/core: Ensure load_balance() respects the active_mask
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.

Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:03 +02:00
Peter Zijlstra f2cdd9cc6c sched/core: Address more wake_affine() regressions
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.

hackbench:

		NO_WA_WEIGHT		WA_WEIGHT

hackbench-20  : 7.362717561 seconds	6.450509391 seconds

(win)

netperf:

		  NO_WA_WEIGHT		WA_WEIGHT

TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4

TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41

TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
TCP_RR-80	: Avg: 25330.3          Avg: 26867.9

UDP_RR-1	: Avg: 108266           Avg: 107844
UDP_RR-10	: Avg: 95480            Avg: 95245.2
UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
UDP_RR-40	: Avg: 76231            Avg: 75419.1
UDP_RR-80	: Avg: 34578.3          Avg: 35639.1

UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97

(wins and losses)

sysbench:

		    NO_WA_WEIGHT		WA_WEIGHT

sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.

sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.

(increase on the top end)

tbench:

NO_WA_WEIGHT

Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms

WA_WEIGHT

Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms

(increase on the top end)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:03 +02:00
Peter Zijlstra d153b15344 sched/core: Fix wake_affine() performance regression
Eric reported a sysbench regression against commit:

  3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")

Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.

PRE (current tip/master):

 ivb-ep sysbench:

   2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
   5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
  10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
  20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
  40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
  80: [30 secs]     transactions:                        355096 (11834.28 per sec.)

 hsw-ex NAS:

 OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
 lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
 lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
 lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83

POST (+patch):

 ivb-ep sysbench:

   2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
   5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
  10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
  20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
  40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
  80: [30 secs]     transactions:                        631739 (21055.88 per sec.)

 hsw-ex NAS:

 lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
 lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
 lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52

This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.

This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.

Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:02 +02:00
Peter Zijlstra 5d68cc95fb sched/debug: Ignore TASK_IDLE for SysRq-W
Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
which results in undesirable clutter.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 11:02:57 +02:00
Peter Zijlstra 65d5dc47fe sched/debug: Remove unused variable
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 10:09:09 +02:00
Tim Chen 11a19c7b09 sched/wait: Introduce wakeup boomark in wake_up_page_bit
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long.  This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-14 09:56:18 -07:00
Tim Chen 2554db9165 sched/wait: Break up long wake list walk
We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.

We saw page wait list that are up to 3700+ entries long in tests of
large 4 and 8 socket systems. It took 0.8 sec to traverse such list
during wake up. Any other CPU that contends for the list spin lock will
spin for a long time. It is a result of the numa balancing migration of
hot pages that are shared by many threads.

Multiple CPUs waking are queued up behind the lock, and the last one
queued has to wait until all CPUs did all the wakeups.

The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.

This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.

This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

[ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
  simply access to flags. ]

Reported-by: Kan Liang <kan.liang@intel.com>
Tested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-14 09:56:17 -07:00
Linus Torvalds ec846ecd63 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three CPU hotplug related fixes and a debugging improvement"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Add debugfs knob for "sched_debug"
  sched/core: WARN() when migrating to an offline CPU
  sched/fair: Plug hole between hotplug and active_load_balance()
  sched/fair: Avoid newidle balance for !active CPUs
2017-09-13 12:22:32 -07:00
Linus Torvalds 040b9d7ccf Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three fixes:

   - fix a suspend/resume cpusets bug

   - fix a !CONFIG_NUMA_BALANCING bug

   - fix a kerneldoc warning"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix nuisance kernel-doc warning
  sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
  sched/fair: Fix wake_affine_llc() balancing rules
2017-09-12 11:30:56 -07:00
Peter Zijlstra 9469eb01db sched/debug: Add debugfs knob for "sched_debug"
I'm forever late for editing my kernel cmdline, add a runtime knob to
disable the "sched_debug" thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170907150614.142924283@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:04 +02:00
Peter Zijlstra 4ff9083b8a sched/core: WARN() when migrating to an offline CPU
Migrating tasks to offline CPUs is a pretty big fail, warn about it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170907150614.094206976@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:04 +02:00
Peter Zijlstra edd8e41d2e sched/fair: Plug hole between hotplug and active_load_balance()
The load balancer applies cpu_active_mask to whatever sched_domains it
finds, however in the case of active_balance there is a hole between
setting rq->{active_balance,push_cpu} and running the stop_machine
work doing the actual migration.

The @push_cpu can go offline in this window, which would result in us
moving a task onto a dead cpu, which is a fairly bad thing.

Double check the active mask before the stop work does the migration.

  CPU0					CPU1

  <SoftIRQ>
					stop_machine(takedown_cpu)
    load_balance()			cpu_stopper_thread()
      ...				  work = multi_cpu_stop
      stop_one_cpu_nowait(		    /* wait for CPU0 */
	.func = active_load_balance_cpu_stop
      );
  </SoftIRQ>

  cpu_stopper_thread()
    work = multi_cpu_stop
      /* sync with CPU1 */
					    take_cpu_down()
					<idle>
					  play_dead();

    work = active_load_balance_cpu_stop
      set_task_cpu(p, CPU1); /* oops!! */

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170907150614.044460912@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:04 +02:00
Peter Zijlstra 2800486ee3 sched/fair: Avoid newidle balance for !active CPUs
On CPU hot unplug, when parking the last kthread we'll try and
schedule into idle to kill the CPU. This last schedule can (and does)
trigger newidle balance because at this point the sched domains are
still up because of commit:

  77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")

Obviously pulling tasks to an already offline CPU is a bad idea, and
all balancing operations _should_ be subject to cpu_active_mask, make
it so.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Link: http://lkml.kernel.org/r/20170907150613.994135806@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:03 +02:00
Randy Dunlap 46123355af sched/fair: Fix nuisance kernel-doc warning
Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"):

  ../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f18b30f9-6251-6d86-9d44-16501e386891@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-11 08:13:22 +02:00
Davidlohr Bueso 2161573ecd sched/deadline: replace earliest dl and rq leftmost caching
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.

Link: http://lkml.kernel.org/r/20170719014603.19029-9-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Davidlohr Bueso bfb068892d sched/fair: replace cfs_rq->rb_leftmost
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.

Link: http://lkml.kernel.org/r/20170719014603.19029-8-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:48 -07:00
Alexey Dobriyan 9b130ad5bb treewide: make "nr_cpu_ids" unsigned
First, number of CPUs can't be negative number.

Second, different signnnedness leads to suboptimal code in the following
cases:

1)
	kmalloc(nr_cpu_ids * sizeof(X));

"int" has to be sign extended to size_t.

2)
	while (loff_t *pos < nr_cpu_ids)

MOVSXD is 1 byte longed than the same MOV.

Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".

Code savings on allyesconfig kernel: -3KB

	add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
	function                                     old     new   delta
	coretemp_cpu_online                          450     512     +62
	rcu_init_one                                1234    1272     +38
	pci_device_probe                             374     399     +25

				...

	pgdat_reclaimable_pages                      628     556     -72
	select_fallback_rq                           446     369     -77
	task_numa_find_cpu                          1923    1807    -116

Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:48 -07:00
Peter Zijlstra 50e7663233 sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308e ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-07 11:45:21 +02:00
Peter Zijlstra a731ebe6f1 sched/fair: Fix wake_affine_llc() balancing rules
Chris Wilson reported that the SMT balance rules got the +1 on the
wrong side, resulting in a bias towards the current LLC; which the
load-balancer would then try and undo.

Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 90001d67be ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
Link: http://lkml.kernel.org/r/20170906105131.gqjmaextmn3u6tj2@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-07 09:29:31 +02:00
Linus Torvalds 439644096c Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "This time (again) cpufreq gets the majority of changes which mostly
  are driver updates (including a major consolidation of intel_pstate),
  some schedutil governor modifications and core cleanups.

  There also are some changes in the system suspend area, mostly related
  to diagnostics and debug messages plus some renames of things related
  to suspend-to-idle. One major change here is that suspend-to-idle is
  now going to be preferred over S3 on systems where the ACPI tables
  indicate to do so and provide requsite support (the Low Power Idle S0
  _DSM in particular). The system sleep documentation and the tools
  related to it are updated too.

  The rest is a few cpuidle changes (nothing major), devfreq updates,
  generic power domains (genpd) framework updates and a few assorted
  modifications elsewhere.

  Specifics:

   - Drop the P-state selection algorithm based on a PID controller from
     intel_pstate and make it use the same P-state selection method
     (based on the CPU load) for all types of systems in the active mode
     (Rafael Wysocki, Srinivas Pandruvada).

   - Rework the cpufreq core and governors to make it possible to take
     cross-CPU utilization updates into account and modify the schedutil
     governor to actually do so (Viresh Kumar).

   - Clean up the handling of transition latency information in the
     cpufreq core and untangle it from the information on which drivers
     cannot do dynamic frequency switching (Viresh Kumar).

   - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
     cpufreq driver and update its DT bindings (Sean Wang).

   - Modify the cpufreq dt-platdev driver to autimatically create
     cpufreq devices for the new (v2) Operating Performance Points (OPP)
     DT bindings and update its whitelist of supported systems (Viresh
     Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
     Xiao).

   - Add support for Ux500 to the cpufreq-dt driver and drop the
     obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).

   - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
     Nguyen).

   - Fix and clean up assorted issues in the cpufreq drivers and core
     (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
     Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).

   - Update the IO-wait boost handling in the schedutil governor to make
     it less aggressive (Joel Fernandes).

   - Rework system suspend diagnostics to make it print fewer messages
     to the kernel log by default, add a sysfs knob to allow more
     suspend-related messages to be printed and add Low Power S0 Idle
     constraints checks to the ACPI suspend-to-idle code (Rafael
     Wysocki, Srinivas Pandruvada).

   - Prefer suspend-to-idle over S3 on ACPI-based systems with the
     ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
     interface present in the ACPI tables (Rafael Wysocki).

   - Update documentation related to system sleep and rename a number of
     items in the code to make it cleare that they are related to
     suspend-to-idle (Rafael Wysocki).

   - Export a variable allowing device drivers to check the target
     system sleep state from the core system suspend code (Florian
     Fainelli).

   - Clean up the cpuidle subsystem to handle the polling state on x86
     in a more straightforward way and to use %pOF instead of full_name
     (Rafael Wysocki, Rob Herring).

   - Update the devfreq framework to fix and clean up a few minor issues
     (Chanwoo Choi, Rob Herring).

   - Extend diagnostics in the generic power domains (genpd) framework
     and clean it up slightly (Thara Gopinath, Rob Herring).

   - Fix and clean up a couple of issues in the operating performance
     points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).

   - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
     (AVS) driver (David Wu).

   - Fix the usage of notifiers in CPU power management on some
     platforms (Alex Shi).

   - Update the pm-graph system suspend/hibernation and boot profiling
     utility (Todd Brandt).

   - Make it possible to run the cpupower utility without CPU0 (Prarit
     Bhargava)"

* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
  cpuidle: Make drivers initialize polling state
  cpuidle: Move polling state initialization code to separate file
  cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
  cpufreq: imx6q: Fix imx6sx low frequency support
  cpufreq: speedstep-lib: make several arrays static, makes code smaller
  PM: docs: Delete the obsolete states.txt document
  PM: docs: Describe high-level PM strategies and sleep states
  PM / devfreq: Fix memory leak when fail to register device
  PM / devfreq: Add dependency on PM_OPP
  PM / devfreq: Move private devfreq_update_stats() into devfreq
  PM / devfreq: Convert to using %pOF instead of full_name
  PM / AVS: rockchip-io: add io selectors and supplies for RV1108
  cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
  cpufreq: dt-platdev: Drop few entries from whitelist
  cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
  ARM: ux500: don't select CPUFREQ_DT
  cpuidle: Convert to using %pOF instead of full_name
  cpufreq: Convert to using %pOF instead of full_name
  PM / Domains: Convert to using %pOF instead of full_name
  cpufreq: Cap the default transition delay value to 10 ms
  ...
2017-09-05 12:19:08 -07:00
Linus Torvalds 5f82e71a00 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Add 'cross-release' support to lockdep, which allows APIs like
   completions, where it's not the 'owner' who releases the lock, to be
   tracked. It's all activated automatically under
   CONFIG_PROVE_LOCKING=y.

 - Clean up (restructure) the x86 atomics op implementation to be more
   readable, in preparation of KASAN annotations. (Dmitry Vyukov)

 - Fix static keys (Paolo Bonzini)

 - Add killable versions of down_read() et al (Kirill Tkhai)

 - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

 - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

 - Remove smp_mb__before_spinlock() and convert its usages, introduce
   smp_mb__after_spinlock() (Peter Zijlstra)

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
  locking/lockdep/selftests: Fix mixed read-write ABBA tests
  sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
  acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
  locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
  smp: Avoid using two cache lines for struct call_single_data
  locking/lockdep: Untangle xhlock history save/restore from task independence
  locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
  futex: Remove duplicated code and fix undefined behaviour
  Documentation/locking/atomic: Finish the document...
  locking/lockdep: Fix workqueue crossrelease annotation
  workqueue/lockdep: 'Fix' flush_work() annotation
  locking/lockdep/selftests: Add mixed read-write ABBA tests
  mm, locking/barriers: Clarify tlb_flush_pending() barriers
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
  locking/lockdep: Explicitly initialize wq_barrier::done::map
  locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
  locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
  locking/refcounts, x86/asm: Implement fast refcount overflow protection
  locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
  ...
2017-09-04 11:52:29 -07:00
Linus Torvalds f213a6c84c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - fix affine wakeups (Peter Zijlstra)

   - improve CPU onlining (and general bootup) scalability on systems
     with ridiculous number (thousands) of CPUs (Peter Zijlstra)

   - sched/numa updates (Rik van Riel)

   - sched/deadline updates (Byungchul Park)

   - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

   - sched/debug enhancements (Xie XiuQi)

   - various fixes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  sched/debug: Optimize sched_domain sysctl generation
  sched/topology: Avoid pointless rebuild
  sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
  sched/topology: Improve comments
  sched/topology: Fix memory leak in __sdt_alloc()
  sched/completion: Document that reinit_completion() must be called after complete_all()
  sched/autogroup: Fix error reporting printk text in autogroup_create()
  sched/fair: Fix wake_affine() for !NUMA_BALANCING
  sched/debug: Intruduce task_state_to_char() helper function
  sched/debug: Show task state in /proc/sched_debug
  sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
  sched/core: Remove unnecessary initialization init_idle_bootup_task()
  sched/deadline: Change return value of cpudl_find()
  sched/deadline: Make find_later_rq() choose a closer CPU in topology
  sched/numa: Scale scan period with tasks in group and shared/private
  sched/numa: Slow down scan rate if shared faults dominate
  sched/pelt: Fix false running accounting
  sched: Mark pick_next_task_dl() and build_sched_domain() as static
  sched/cpupri: Don't re-initialize 'struct cpupri'
  sched/deadline: Don't re-initialize 'struct cpudl'
  ...
2017-09-04 09:10:24 -07:00
Linus Torvalds 0081a0ce80 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnad:
 "The main RCU related changes in this cycle were:

   - Removal of spin_unlock_wait()
   - SRCU updates
   - RCU torture-test updates
   - RCU Documentation updates
   - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
   - Miscellaneous RCU fixes
   - CPU-hotplug fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
  arch: Remove spin_unlock_wait() arch-specific definitions
  locking: Remove spin_unlock_wait() generic definitions
  drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
  ipc: Replace spin_unlock_wait() with lock/unlock pair
  exit: Replace spin_unlock_wait() with lock/unlock pair
  completion: Replace spin_unlock_wait() with lock/unlock pair
  doc: Set down RCU's scheduling-clock-interrupt needs
  doc: No longer allowed to use rcu_dereference on non-pointers
  doc: Add RCU files to docbook-generation files
  doc: Update memory-barriers.txt for read-to-write dependencies
  doc: Update RCU documentation
  membarrier: Provide expedited private command
  rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
  rcu: Add warning to rcu_idle_enter() for irqs enabled
  rcu: Make rcu_idle_enter() rely on callers disabling irqs
  rcu: Add assertions verifying blocked-tasks list
  rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
  rcu: Add TPS() protection for _rcu_barrier_trace strings
  rcu: Use idle versions of swait to make idle-hack clear
  swait: Add idle variants which don't contribute to load average
  ...
2017-09-04 08:13:52 -07:00
Ingo Molnar edc2988c54 Merge branch 'linus' into locking/core, to fix up conflicts
Conflicts:
	mm/page_alloc.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-04 11:01:18 +02:00