Commit Graph

2364 Commits

Author SHA1 Message Date
Linus Torvalds
71e5602817 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A memory (under-)allocation fix and a comment fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/topology: Fix off by one bug
  sched/rt: Update comment in pick_next_task_rt()
2018-11-03 18:37:09 -07:00
Peter Zijlstra
993f0b0510 sched/topology: Fix off by one bug
With the addition of the NUMA identity level, we increased @level by
one and will run off the end of the array in the distance sort loop.

Fixed: 051f3ca02e ("sched/topology: Introduce NUMA identity node sched domain")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-04 00:40:03 +01:00
Linus Torvalds
6ef746769e Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "These remove a questionable heuristic from the menu cpuidle governor,
  fix a recent build regression in the intel_pstate driver, clean up ARM
  big-Little support in cpufreq and fix up hung task watchdog's
  interaction with system-wide power management transitions.

  Specifics:

   - Fix build regression in the intel_pstate driver that doesn't build
     without CONFIG_ACPI after recent changes (Dominik Brodowski).

   - One of the heuristics in the menu cpuidle governor is based on a
     function returning 0 most of the time, so drop it and clean up the
     scheduler code related to it (Daniel Lezcano).

   - Prevent the arm_big_little cpufreq driver from being used on ARM64
     which is not suitable for it and drop the arm_big_little_dt driver
     that is not used any more (Sudeep Holla).

   - Prevent the hung task watchdog from triggering during resume from
     system-wide sleep states by disabling it before freezing tasks and
     enabling it again after they have been thawed (Vitaly Kuznetsov)"

* tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  kernel: hung_task.c: disable on suspend
  cpufreq: remove unused arm_big_little_dt driver
  cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
  cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
  cpuidle: menu: Remove get_loadavg() from the performance multiplier
  sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-30 09:08:07 -07:00
Muchun Song
a68d75081a sched/rt: Update comment in pick_next_task_rt()
Commit:

  f4ebcbc0d7 ("sched/rt: Substract number of tasks of throttled queues from rq->nr_running")

added a new rt_rq->rt_queued field, which is used to indicate the status of
rq->rt enqueue or dequeue. So, the ->rt_nr_running check was removed and we
now check ->rt_queued instead.

Fix the comment in pick_next_task_rt() as well, which was still referencing
the old logic.

Signed-off-by: Muchun Song <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181027030517.23292-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-29 07:18:04 +01:00
Johannes Weiner
2ce7135adc psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
eb414681d5 psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
246b3b3342 sched: introduce this_rq_lock_irq()
do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
patch is adding another site with the same pattern, so provide a
convenience function for it.

Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
1f351d7f75 sched: sched.h: make rq locking and clock functions available in stats.h
kernel/sched/sched.h includes "stats.h" half-way through the file.  The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
the file so they are available in stats.h.

Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
5c54f5b9ed sched: loadavg: make calc_load_n() public
It's going to be used in a later patch. Keep the churn separate.

Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
8508cf3ffa sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Linus Torvalds
4dcb9239da Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
 "The timers and timekeeping departement provides:

   - Another large y2038 update with further preparations for providing
     the y2038 safe timespecs closer to the syscalls.

   - An overhaul of the SHCMT clocksource driver

   - SPDX license identifier updates

   - Small cleanups and fixes all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  tick/sched : Remove redundant cpu_online() check
  clocksource/drivers/dw_apb: Add reset control
  clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
  clocksource/drivers: Unify the names to timer-* format
  clocksource/drivers/sh_cmt: Add R-Car gen3 support
  dt-bindings: timer: renesas: cmt: document R-Car gen3 support
  clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
  clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
  clocksource/drivers/sh_cmt: Fixup for 64-bit machines
  clocksource/drivers/sh_tmu: Convert to SPDX identifiers
  clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
  clocksource/drivers/sh_cmt: Convert to SPDX identifiers
  clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
  clocksource: Convert to using %pOFn instead of device_node.name
  tick/broadcast: Remove redundant check
  RISC-V: Request newstat syscalls
  y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
  y2038: socket: Change recvmmsg to use __kernel_timespec
  y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
  y2038: utimes: Rework #ifdef guards for compat syscalls
  ...
2018-10-25 11:14:36 -07:00
Daniel Lezcano
a7fe5190c0 cpuidle: menu: Remove get_loadavg() from the performance multiplier
The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.

In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:

unsigned long this_cpu_load(void)
{
        struct rq *this = this_rq();
        return this->cpu_load[0];
}

In 2014, the code was changed by commit 372ba8cb46 (cpuidle: menu: Lookup CPU
runqueues less) and the load is:

void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
        struct rq *rq = this_rq();
        *nr_waiters = atomic_read(&rq->nr_iowait);
        *load = rq->load.weight;
}

with the same result.

Both measurements show using the load in this code path does no matter
anymore. Removing it.

[1] 4dedd9f124

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-25 16:49:27 +02:00
Daniel Lezcano
145d952a29 sched: Factor out nr_iowait and nr_iowait_cpu
The function nr_iowait_cpu() can be used directly by nr_iowait() instead
of duplicating code.

Call nr_iowait_cpu() from nr_iowait()

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-25 16:49:26 +02:00
Linus Torvalds
99792e0cea Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Lots of changes in this cycle:

   - Lots of CPA (change page attribute) optimizations and related
     cleanups (Thomas Gleixner, Peter Zijstra)

   - Make lazy TLB mode even lazier (Rik van Riel)

   - Fault handler cleanups and improvements (Dave Hansen)

   - kdump, vmcore: Enable kdumping encrypted memory with AMD SME
     enabled (Lianbo Jiang)

   - Clean up VM layout documentation (Baoquan He, Ingo Molnar)

   - ... plus misc other fixes and enhancements"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
  x86/mm: Kill stray kernel fault handling comment
  x86/mm: Do not warn about PCI BIOS W+X mappings
  resource: Clean it up a bit
  resource: Fix find_next_iomem_res() iteration issue
  resource: Include resource end in walk_*() interfaces
  x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
  x86/mm: Remove spurious fault pkey check
  x86/mm/vsyscall: Consider vsyscall page part of user address space
  x86/mm: Add vsyscall address helper
  x86/mm: Fix exception table comments
  x86/mm: Add clarifying comments for user addr space
  x86/mm: Break out user address space handling
  x86/mm: Break out kernel address space handling
  x86/mm: Clarify hardware vs. software "error_code"
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Add freed_tables element to flush_tlb_info
  x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
  smp,cpumask: introduce on_each_cpu_cond_mask
  smp: use __cpumask_set_cpu in on_each_cpu_cond
  ...
2018-10-23 17:05:28 +01:00
Linus Torvalds
42f52e1c59 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes are:

   - Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems,
     to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen,
     Valentin Schneider)

   - Topology handling improvements, in particular when CPU capacity
     changes and related load-balancing fixes/improvements (Morten
     Rasmussen)

   - ... plus misc other improvements, fixes and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions
  sched/completions/Documentation: Clean up the document some more
  sched/completions/Documentation: Fix a couple of punctuation nits
  cpu/SMT: State SMT is disabled even with nosmt and without "=force"
  sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
  sched/fair: Remove setting task's se->runnable_weight during PELT update
  sched/fair: Disable LB_BIAS by default
  sched/pelt: Fix warning and clean up IRQ PELT config
  sched/topology: Make local variables static
  sched/debug: Use symbolic names for task state constants
  sched/numa: Remove unused numa_stats::nr_running field
  sched/numa: Remove unused code from update_numa_stats()
  sched/debug: Explicitly cast sched_feat() to bool
  sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
  sched/fair: Don't move tasks to lower capacity CPUs unless necessary
  sched/fair: Set rq->rd->overload when misfit
  sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
  sched/core: Change root_domain->overload type to int
  sched/fair: Change 'prefer_sibling' type to bool
  sched/fair: Kick nohz balance if rq->misfit_task_load
  ...
2018-10-23 15:00:03 +01:00
Christophe Leroy
977e4be5eb x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
The following commit:

  d7880812b3 ("idle: Add the stack canary init to cpu_startup_entry()")

... added an x86 specific boot_init_stack_canary() call to the generic
cpu_startup_entry() as a temporary hack, with the intention to remove
the #ifdef CONFIG_X86 later.

More than 5 years later let's finally realize that plan! :-)

While implementing stack protector support for PowerPC, we found
that calling boot_init_stack_canary() is also needed for PowerPC
which uses per task (TLS) stack canary like the X86.

However, calling boot_init_stack_canary() would break architectures
using a global stack canary (ARM, SH, MIPS and XTENSA).

Instead of modifying the #ifdef CONFIG_X86 to an even messier:

   #if defined(CONFIG_X86) || defined(CONFIG_PPC)

PowerPC implemented the call to boot_init_stack_canary() in the function
calling cpu_startup_entry().

Let's try the same cleanup on the x86 side as well.

On x86 we have two functions calling cpu_startup_entry():

 - start_secondary()
 - cpu_bringup_and_idle()

start_secondary() already calls boot_init_stack_canary(), so
it's good, and this patch adds the call to boot_init_stack_canary()
in cpu_bringup_and_idle().

I.e. now x86 catches up to the rest of the world and the ugly init
sequence in init/main.c can be removed from cpu_startup_entry().

As a final benefit we can also remove the <linux/stackprotector.h>
dependency from <linux/sched.h>.

[ mingo: Improved the changelog a bit, added language explaining x86 borkage and sched.h change. ]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181020072649.5B59310483E@pc16082vm.idsi0.si.c-s.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-22 04:07:24 +02:00
Song Muchun
9845c49cc9 sched/fair: Fix the min_vruntime update logic in dequeue_entity()
The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.

From commit:

  b60205c7c5 ("sched/fair: Fix min_vruntime tracking")

I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.

Signed-off-by: Song Muchun <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16 09:36:01 +02:00
Phil Auld
baa9be4ffb sched/fair: Fix throttle_list starvation with low CFS quota
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c704 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -976050
    2     ffff90b56cb2cc00  -484925
    3     ffff90b56cb2bc00  -658814
    4     ffff90b56cb2ba00  -275365
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -994147
    2     ffff90b56cb2cc00  -306051
    3     ffff90b56cb2bc00  -961321
    4     ffff90b56cb2ba00  -24490
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },
  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },

Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-11 13:10:18 +02:00
Rafael J. Wysocki
9c2298aad3 sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
The comment related to nr_iowait_cpu() and get_iowait_load() confuses
cpufreq with cpuidle and is not very useful for this reason, so fix it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM <linux-pm@vger.kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e33a9bba85 "sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler"
Link: http://lkml.kernel.org/r/3803514.xkx7zY50tF@aspire.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-04 11:25:56 +02:00
Mel Gorman
37355bdc5a sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.

Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.

With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.

                         4.19.0-rc5             4.19.0-rc5                 4.17.0
                         numab-v1r1       fastmigrate-v1r1                vanilla
MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 11:31:33 +02:00
Dietmar Eggemann
4a465e3ebb sched/fair: Remove setting task's se->runnable_weight during PELT update
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.

se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).

There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:

(1) A task switches to SCHED_IDLE.

(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
    (during which only its static priority gets set) switches to
    SCHED_OTHER or SCHED_BATCH.

Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:45:03 +02:00
Dietmar Eggemann
fdf5f315d5 sched/fair: Disable LB_BIAS by default
LB_BIAS allows the adjustment on how conservative load should be
balanced.

The rq->cpu_load[idx] array is used for this functionality. It contains
weighted CPU load decayed average values over different intervals
(idx = 1..4). Idx = 0 is the weighted CPU load itself.

The values are updated during scheduler_tick, before idle balance and at
nohz exit.

There are 5 different types of idx's per sched domain (sd). Each of them
is used to index into the rq->cpu_load[idx] array in a specific scenario
(busy, idle and newidle for load balancing, forkexec for wake-up
slow-path load balancing and wake for affine wakeup based on weight).
Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
respectively. All the other sd idx's are set to 0.

Conservative load balancing is achieved for sd idx's >= 1 by using the
min/max (source_load()/target_load()) value between the current weighted
CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
CPU load in load balancing or vice versa in the wake-up slow-path load
balancing.
There is no conservative balancing for sd idx = 0 since only current
weighted CPU load is used in this case.

It is very likely that LB_BIAS' influence on load balancing can be
neglected (see test results below). This is further supported by:

(1) Weighted CPU load today is by itself a decayed average value (PELT)
    (cfs_rq->avg->runnable_load_avg) and not the instantaneous load
    (rq->load.weight) it was when LB_BIAS was introduced.

(2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
    to sd's newidle and busy idx) in find_busiest_group() when comparing
    busiest and local avg load to make load balancing even more
    conservative.

(3) The sd forkexec and newidle idx are always set to 0 so there is no
    adjustment on how conservatively load balancing is done here.

(4) Affine wakeup based on weight (wake_affine_weight()) will not be
    impacted since the sd wake idx is always set to 0.

Let's disable LB_BIAS by default for a few kernel releases to make sure
that no workload and no scheduler topology is affected. The benefit of
being able to remove the LB_BIAS dependency from source_load() and
target_load() is that the entire rq->cpu_load[idx] code could be removed
in this case.

It is really hard to say if there is no regression w/o testing this with
a lot of different workloads on a lot of different platforms, especially
NUMA machines.
The following 104 LKP (Linux Kernel Performance) tests were run by the
0-Day guys mostly on multi-socket hosts with a larger number of logical
cpus (88, 192).
The base for the test was commit b3dae109fa ("sched/swait: Rename to
exclusive") (tip/sched/core v4.18-rc1).
Only 2 out of the 104 tests had a significant change in one of the
metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
Tests which showed a change in one of the metrics are marked with a '*'
and this change is listed as well.

(a) lkp-bdw-ep3:
      88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G

    dd-write/10m-1HDD-cfq-btrfs-100dd-performance
    fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
  * fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
      7.50  7%  8.00  ±  6%  fsmark.files_per_sec
    fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
    fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
    fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
    kbuild/300s-50%-vmlinux_prereq-performance
    kbuild/300s-200%-vmlinux_prereq-performance
    kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
    kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4

(b) lkp-skl-4sp1:
      192 threads Intel(R) Xeon(R) Platinum 8160 768G

    dbench/100%-performance
    ebizzy/200%-100x-10s-performance
    hackbench/1600%-process-pipe-performance
    iperf/300s-cs-localhost-tcp-performance
    iperf/300s-cs-localhost-udp-performance
    perf-bench-numa-mem/2t-300M-performance
    perf-bench-sched-pipe/10000000ops-process-performance
    perf-bench-sched-pipe/10000000ops-threads-performance
    schbench/2-16-300-30000-30000-performance
    tbench/100%-cs-localhost-performance

(c) lkp-bdw-ep6:
      88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G

    stress-ng/100%-60s-pipe-performance
    unixbench/300s-1-whetstone-double-performance
    unixbench/300s-1-shell1-performance
    unixbench/300s-1-shell8-performance
    unixbench/300s-1-pipe-performance
  * unixbench/300s-1-context1-performance
      312  315  unixbench.score
    unixbench/300s-1-spawn-performance
    unixbench/300s-1-syscall-performance
    unixbench/300s-1-dhry2reg-performance
    unixbench/300s-1-fstime-performance
    unixbench/300s-1-fsbuffer-performance
    unixbench/300s-1-fsdisk-performance
    unixbench/300s-100%-whetstone-double-performance
    unixbench/300s-100%-shell1-performance
    unixbench/300s-100%-shell8-performance
    unixbench/300s-100%-pipe-performance
    unixbench/300s-100%-context1-performance
    unixbench/300s-100%-spawn-performance
  * unixbench/300s-100%-syscall-performance
      3571  ±  3%  -11%  3183  ±  4%  unixbench.score
    unixbench/300s-100%-dhry2reg-performance
    unixbench/300s-100%-fstime-performance
    unixbench/300s-100%-fsbuffer-performance
    unixbench/300s-100%-fsdisk-performance
    unixbench/300s-1-execl-performance
    unixbench/300s-100%-execl-performance
  * will-it-scale/brk1-performance
      365004  360387  will-it-scale.per_thread_ops
  * will-it-scale/dup1-performance
      432401  437596  will-it-scale.per_thread_ops
    will-it-scale/eventfd1-performance
    will-it-scale/futex1-performance
    will-it-scale/futex2-performance
    will-it-scale/futex3-performance
    will-it-scale/futex4-performance
    will-it-scale/getppid1-performance
    will-it-scale/lock1-performance
    will-it-scale/lseek1-performance
    will-it-scale/lseek2-performance
  * will-it-scale/malloc1-performance
      47025  45817  will-it-scale.per_thread_ops
      77499  76529  will-it-scale.per_process_ops
    will-it-scale/malloc2-performance
  * will-it-scale/mmap1-performance
      123399  120815  will-it-scale.per_thread_ops
      152219  149833  will-it-scale.per_process_ops
  * will-it-scale/mmap2-performance
      107327  104714  will-it-scale.per_thread_ops
      136405  133765  will-it-scale.per_process_ops
    will-it-scale/open1-performance
  * will-it-scale/open2-performance
      171570  168805  will-it-scale.per_thread_ops
      532644  526202  will-it-scale.per_process_ops
    will-it-scale/page_fault1-performance
    will-it-scale/page_fault2-performance
    will-it-scale/page_fault3-performance
    will-it-scale/pipe1-performance
    will-it-scale/poll1-performance
  * will-it-scale/poll2-performance
      176134  172848  will-it-scale.per_thread_ops
      281361  275053  will-it-scale.per_process_ops
    will-it-scale/posix_semaphore1-performance
    will-it-scale/pread1-performance
    will-it-scale/pread2-performance
    will-it-scale/pread3-performance
    will-it-scale/pthread_mutex1-performance
    will-it-scale/pthread_mutex2-performance
    will-it-scale/pwrite1-performance
    will-it-scale/pwrite2-performance
    will-it-scale/pwrite3-performance
  * will-it-scale/read1-performance
      1190563  1174833  will-it-scale.per_thread_ops
  * will-it-scale/read2-performance
      1105369  1080427  will-it-scale.per_thread_ops
    will-it-scale/readseek1-performance
  * will-it-scale/readseek2-performance
      261818  259040  will-it-scale.per_thread_ops
    will-it-scale/readseek3-performance
  * will-it-scale/sched_yield-performance
      2408059  2382034  will-it-scale.per_thread_ops
    will-it-scale/signal1-performance
    will-it-scale/unix1-performance
    will-it-scale/unlink1-performance
    will-it-scale/unlink2-performance
  * will-it-scale/write1-performance
      976701  961588  will-it-scale.per_thread_ops
  * will-it-scale/writeseek1-performance
      831898  822448  will-it-scale.per_thread_ops
  * will-it-scale/writeseek2-performance
      228248  225065  will-it-scale.per_thread_ops
  * will-it-scale/writeseek3-performance
      226670  224058  will-it-scale.per_thread_ops
    will-it-scale/context_switch1-performance
    aim7/performance-fork_test-2000
  * aim7/performance-brk_test-3000
      74869  76676  aim7.jobs-per-min
    aim7/performance-disk_cp-3000
    aim7/performance-disk_rd-3000
    aim7/performance-sieve-3000
    aim7/performance-page_test-3000
    aim7/performance-creat-clo-3000
    aim7/performance-mem_rtns_1-8000
    aim7/performance-disk_wrt-8000
    aim7/performance-pipe_cpy-8000
    aim7/performance-ram_copy-8000

(d) lkp-avoton3:
      8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G

    netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Li Zhijian <zhijianx.li@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:45:01 +02:00
Vincent Guittot
11d4afd4ff sched/pelt: Fix warning and clean up IRQ PELT config
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.

Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:

  2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")

Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:45:00 +02:00
Ingo Molnar
b429f71bca Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:43:39 +02:00
Srikar Dronamraju
6fd98e775f sched/numa: Avoid task migration for small NUMA improvement
If NUMA improvement from the task migration is going to be very
minimal, then avoid task migration.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2  Node Haswell - X86
JVMS  Prev    Current  %Change
4     198512  205910   3.72673
1     313559  318491   1.57291

2 Socket - 4 Node Power8 - PowerNV
JVMS  Prev     Current  %Change
8     74761.9  74935.9  0.232739
1     214874   226796   5.54837

2 Socket - 2  Node Power9 - PowerNV
JVMS  Prev    Current  %Change
4     180536  189780   5.12031
1     210281  205695   -2.18089

4 Socket - 4  Node Power7 - PowerVM
JVMS  Prev     Current  %Change
8     56511.4  60370    6.828
1     104899   108100   3.05151

1/7 cases is regressing, if we look at events migrate_pages seem
to vary the most especially in the regressing case. Also some
amount of variance is expected between different runs of
Specjbb2005.

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
Event                     Before          After
cs                        13,818,546      13,801,554
migrations                1,149,960       1,151,541
faults                    385,583         433,246
cache-misses              55,259,546,768  55,168,691,835
sched:sched_move_numa     2,257           2,551
sched:sched_stick_numa    9               24
sched:sched_swap_numa     512             904
migrate:mm_migrate_pages  2,225           1,571

vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
Event                   Before  After
numa_hint_faults        72692   113682
numa_hint_faults_local  62270   102163
numa_hit                238762  240181
numa_huge_pte_updates   48      36
numa_interleave         75      64
numa_local              238676  240103
numa_other              86      78
numa_pages_migrated     2225    1564
numa_pte_updates        98557   134080

perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
Event                     Before          After
cs                        3,173,490       3,079,150
migrations                36,966          31,455
faults                    108,776         99,081
cache-misses              12,200,075,320  11,588,126,740
sched:sched_move_numa     1,264           1
sched:sched_stick_numa    0               0
sched:sched_swap_numa     0               0
migrate:mm_migrate_pages  899             36

vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
Event                   Before  After
numa_hint_faults        21109   430
numa_hint_faults_local  17120   77
numa_hit                72934   71277
numa_huge_pte_updates   42      0
numa_interleave         33      22
numa_local              72866   71218
numa_other              68      59
numa_pages_migrated     915     23
numa_pte_updates        42326   0

perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
Event                     Before       After
cs                        8,312,022    8,707,565
migrations                231,705      171,342
faults                    310,242      310,820
cache-misses              402,324,573  136,115,400
sched:sched_move_numa     193          215
sched:sched_stick_numa    0            6
sched:sched_swap_numa     3            24
migrate:mm_migrate_pages  93           162

vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
Event                   Before  After
numa_hint_faults        11838   8985
numa_hint_faults_local  11216   8154
numa_hit                90689   93819
numa_huge_pte_updates   0       0
numa_interleave         1579    882
numa_local              89634   93496
numa_other              1055    323
numa_pages_migrated     92      169
numa_pte_updates        12109   9217

perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
Event                     Before      After
cs                        2,170,481   2,152,072
migrations                10,126      10,704
faults                    160,962     164,376
cache-misses              10,834,845  3,818,437
sched:sched_move_numa     10          16
sched:sched_stick_numa    0           0
sched:sched_swap_numa     0           7
migrate:mm_migrate_pages  2           199

vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
Event                   Before  After
numa_hint_faults        403     2248
numa_hint_faults_local  358     1666
numa_hit                25898   25704
numa_huge_pte_updates   0       0
numa_interleave         207     200
numa_local              25860   25679
numa_other              38      25
numa_pages_migrated     2       197
numa_pte_updates        400     2234

perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
Event                     Before           After
cs                        110,339,633      93,330,595
migrations                4,139,812        4,122,061
faults                    863,622          865,979
cache-misses              231,838,045,660  225,395,083,479
sched:sched_move_numa     2,196            2,372
sched:sched_stick_numa    33               24
sched:sched_swap_numa     544              769
migrate:mm_migrate_pages  2,469            1,677

vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
Event                   Before  After
numa_hint_faults        85748   91638
numa_hint_faults_local  66831   78096
numa_hit                242213  242225
numa_huge_pte_updates   0       0
numa_interleave         0       2
numa_local              242211  242219
numa_other              2       6
numa_pages_migrated     2376    1515
numa_pte_updates        86233   92274

perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
Event                     Before          After
cs                        59,331,057      51,487,271
migrations                552,019         537,170
faults                    266,586         256,921
cache-misses              73,796,312,990  70,073,831,187
sched:sched_move_numa     981             576
sched:sched_stick_numa    54              24
sched:sched_swap_numa     286             327
migrate:mm_migrate_pages  713             726

vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
Event                   Before  After
numa_hint_faults        14807   12000
numa_hint_faults_local  5738    5024
numa_hit                36230   36470
numa_huge_pte_updates   0       0
numa_interleave         0       0
numa_local              36228   36465
numa_other              2       5
numa_pages_migrated     703     726
numa_pte_updates        14742   11930

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:42:28 +02:00