Commit Graph

2210 Commits

Author SHA1 Message Date
Mathieu Malaterre
3febfc8a21 sched/deadline: Make the grub_reclaim() function static
Since the grub_reclaim() function can be made static, make it so.

Silences the following GCC warning (W=1):

  kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:22 +02:00
Mathieu Malaterre
f6a3463063 sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
In the following commit:

  6b55c9654f ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")

the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
right next to the prototypes for print_cfs_stats(), print_rt_stats()
and print_dl_stats().

Finish this previous commit and also move related prototypes for
print_rt_rq() and print_dl_rq().

Remove existing extern declarations now that they not needed anymore.

Silences the following GCC warning, triggered by W=1:

  kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
  kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:14 +02:00
Linus Torvalds
66e1c94db3 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "A mixed bag of fixes and updates for the ghosts which are hunting us.

  The scheduler fixes have been pulled into that branch to avoid
  conflicts.

   - A set of fixes to address a khread_parkme() race which caused lost
     wakeups and loss of state.

   - A deadlock fix for stop_machine() solved by moving the wakeups
     outside of the stopper_lock held region.

   - A set of Spectre V1 array access restrictions. The possible
     problematic spots were discuvered by Dan Carpenters new checks in
     smatch.

   - Removal of an unused file which was forgotten when the rest of that
     functionality was removed"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Remove unused file
  perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
  perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
  perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
  perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
  perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
  sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Introduce set_special_state()
  kthread, sched/wait: Fix kthread_parkme() completion issue
  kthread, sched/wait: Fix kthread_parkme() wait-loop
  sched/fair: Fix the update of blocked load when newly idle
  stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
2018-05-13 10:53:08 -07:00
Linus Torvalds
86a4ac433b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Revert the new NUMA aware placement approach which turned out to
  create more problems than it solved"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
2018-05-13 10:46:53 -07:00
Mel Gorman
789ba28013 Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc87df.

Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.

Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.

The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply.  This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-12 08:37:56 +02:00
Rafael J. Wysocki
97739501f2 cpufreq: schedutil: Avoid using invalid next_freq
If the next_freq field of struct sugov_policy is set to UINT_MAX,
it shouldn't be used for updating the CPU frequency (this is a
special "invalid" value), but after commit b7eaf1aab9 (cpufreq:
schedutil: Avoid reducing frequency of busy CPUs prematurely) it
may be passed as the new frequency to sugov_update_commit() in
sugov_update_single().

Fix that by adding an extra check for the special UINT_MAX value
of next_freq to sugov_update_single().

Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:21:17 +02:00
Juri Lelli
a744490f12 cpufreq: schedutil: remove stale comment
After commit 794a56ebd9 (sched/cpufreq: Change the worker kthread to
SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency
selection point of view, so the potential corner case for RT tasks is not
possible at all now.

Remove the stale comment mentioning it.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:20:24 +02:00
Peter Zijlstra
354d779307 sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, sanitize the array index.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:34:42 +02:00
Peter Zijlstra
7281c8dec8 sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, so sanitize the value before using it to
index an array.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:32:36 +02:00
Peter Zijlstra
b5bf9a90bb sched/core: Introduce set_special_state()
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.

When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.

Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.

There already is a fix for TASK_DEAD. Abstract that and also apply it
to TASK_STOPPED and TASK_TRACED, both of which are also without
condition based wait-loop.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 07:54:54 +02:00
Peter Zijlstra
85f1abe001 kthread, sched/wait: Fix kthread_parkme() completion issue
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().

When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.

Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.

However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:

 - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
   the park and set TASK_RUNNING, or

 - __kthread_bind()'s wait_task_inactive() to observe the competing
   TASK_RUNNING store.

Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.

Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.

The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-03 07:38:05 +02:00
Vincent Guittot
457be908c8 sched/fair: Fix the update of blocked load when newly idle
With commit:

  31e77c93e4 ("sched/fair: Update blocked load when newly idle")

... we release the rq->lock when updating blocked load of idle CPUs.

This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.

The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.

Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e4 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-03 07:38:03 +02:00
Linus Torvalds
71b8ebbf3d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "A few scheduler fixes:

   - Prevent a bogus warning vs. runqueue clock update flags in
     do_sched_rt_period_timer()

   - Simplify the helper functions which handle requests for skipping
     the runqueue clock updat.

   - Do not unlock the tunables mutex in the error path of the cpu
     frequency scheduler utils. Its not held.

   - Enforce proper alignement for 'struct util_est' in sched_avg to
     prevent a misalignment fault on IA64"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Force proper alignment of 'struct util_est'
  sched/core: Simplify helpers for rq clock update skip requests
  sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
  sched/cpufreq/schedutil: Fix error path mutex unlock
2018-04-15 12:43:30 -07:00
Linus Torvalds
1fe43114ea Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "These include one big-ticket item which is the rework of the idle loop
  in order to prevent CPUs from spending too much time in shallow idle
  states. It reduces idle power on some systems by 10% or more and may
  improve performance of workloads in which the idle loop overhead
  matters. This has been in the works for several weeks and it has been
  tested and reviewed quite thoroughly.

  Also included are changes that finalize the cpufreq cleanup moving
  frequency table validation from drivers to the core, a few fixes and
  cleanups of cpufreq drivers, a cpuidle documentation update and a PM
  QoS core update to mark the expected switch fall-throughs in it.

  Specifics:

   - Rework the idle loop in order to prevent CPUs from spending too
     much time in shallow idle states by making it stop the scheduler
     tick before putting the CPU into an idle state only if the idle
     duration predicted by the idle governor is long enough.

     That required the code to be reordered to invoke the idle governor
     before stopping the tick, among other things (Rafael Wysocki,
     Frederic Weisbecker, Arnd Bergmann).

   - Add the missing description of the residency sysfs attribute to the
     cpuidle documentation (Prashanth Prakash).

   - Finalize the cpufreq cleanup moving frequency table validation from
     drivers to the core (Viresh Kumar).

   - Fix a clock leak regression in the armada-37xx cpufreq driver
     (Gregory Clement).

   - Fix the initialization of the CPU performance data structures for
     shared policies in the CPPC cpufreq driver (Shunyong Yang).

   - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
     bit (Viresh Kumar, Rafael Wysocki).

   - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
     Silva)"

* tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  tick-sched: avoid a maybe-uninitialized warning
  cpufreq: Drop cpufreq_table_validate_and_show()
  cpufreq: SCMI: Don't validate the frequency table twice
  cpufreq: CPPC: Initialize shared perf capabilities of CPUs
  cpufreq: armada-37xx: Fix clock leak
  cpufreq: CPPC: Don't set transition_latency
  cpufreq: ti-cpufreq: Use builtin_platform_driver()
  cpufreq: intel_pstate: Do not include debugfs.h
  PM / QoS: mark expected switch fall-throughs
  cpuidle: Add definition of residency to sysfs documentation
  time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
  nohz: Avoid duplication of code related to got_idle_tick
  nohz: Gather tick_sched booleans under a common flag field
  cpuidle: menu: Avoid selecting shallow states with stopped tick
  cpuidle: menu: Refine idle state selection for running tick
  sched: idle: Select idle state before stopping the tick
  time: hrtimer: Introduce hrtimer_next_event_without()
  time: tick-sched: Split tick_nohz_stop_sched_tick()
  cpuidle: Return nohz hint from cpuidle_select()
  jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
  ...
2018-04-11 17:03:20 -07:00
Rafael J. Wysocki
554c8aa8ec sched: idle: Select idle state before stopping the tick
In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event.  If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.

If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.

In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().

That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped.  It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.

With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2018-04-09 11:54:07 +02:00
Rafael J. Wysocki
45f1ff59e2 cpuidle: Return nohz hint from cpuidle_select()
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0, or
 (2) the selected state is a polling one, or
 (3) the expected idle period duration is within the tick period
     range.

In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values.  To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.

Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-06 09:29:34 +02:00
Mark Rutland
3eda69c92d kernel/fork.c: detect early free of a live mm
KASAN splats indicate that in some cases we free a live mm, then
continue to access it, with potentially disastrous results.  This is
likely due to a mismatched mmdrop() somewhere in the kernel, but so far
the culprit remains elusive.

Let's have __mmdrop() verify that the mm isn't live for the current
task, similar to the existing check for init_mm.  This way, we can catch
this class of issue earlier, and without requiring KASAN.

Currently, idle_task_exit() leaves active_mm stale after it switches to
init_mm.  This isn't harmful, but will trigger the new assertions, so we
must adjust idle_task_exit() to update active_mm.

Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Rafael J. Wysocki
ed98c34919 sched: idle: Do not stop the tick before cpuidle_idle_call()
Make cpuidle_idle_call() decide whether or not to stop the tick.

First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick().  To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 19:01:29 +02:00
Rafael J. Wysocki
2aaf709a51 sched: idle: Do not stop the tick upfront in the idle loop
Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the nohz code on the duration of the
upcoming idle period.  Specifically, if the tick has been stopped and
the idle governor predicts short idle, the situation is bad regardless
of whether or not the prediction is accurate.  If it is accurate, the
tick has been stopped unnecessarily which means excessive overhead.
If it is not accurate, the CPU is likely to spend too much time in
the (shallow, because short idle has been predicted) idle state
selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 19:01:14 +02:00
Rafael J. Wysocki
0e7767687f time: tick-sched: Reorganize idle tick management code
Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.

The idea is to split the nohz idle entry call to decouple the idle
time stats accounting and preparatory work from the actual tick stop
code, in order to later be able to delay the tick stop once we reach
more power-knowledgeable callers.

Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.

Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 18:58:47 +02:00
Davidlohr Bueso
adcc8da885 sched/core: Simplify helpers for rq clock update skip requests
By renaming the functions we can get rid of the skip parameter
and have better code redability. It makes zero sense to have
things such as:

  rq_clock_skip_update(rq, false)

When the skip request is in fact not going to happen. Ever. Rename
things such that we end up with:

  rq_clock_skip_update(rq)
  rq_clock_cancel_skipupdate(rq)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:46 +02:00
Davidlohr Bueso
d29a20645d sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
While running rt-tests' pi_stress program I got the following splat:

  rq->clock_update_flags < RQCF_ACT_SKIP
  WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20

  [...]

  <IRQ>
  enqueue_top_rt_rq+0xf4/0x150
  ? cpufreq_dbs_governor_start+0x170/0x170
  sched_rt_rq_enqueue+0x65/0x80
  sched_rt_period_timer+0x156/0x360
  ? sched_rt_rq_enqueue+0x80/0x80
  __hrtimer_run_queues+0xfa/0x260
  hrtimer_interrupt+0xcb/0x220
  smp_apic_timer_interrupt+0x62/0x120
  apic_timer_interrupt+0xf/0x20
  </IRQ>

  [...]

  do_idle+0x183/0x1e0
  cpu_startup_entry+0x5f/0x70
  start_secondary+0x192/0x1d0
  secondary_startup_64+0xa5/0xb0

We can get rid of it be the "traditional" means of adding an
update_rq_clock() call after acquiring the rq->lock in
do_sched_rt_period_timer().

The case for the RT task throttling (which this workload also hits)
can be ignored in that the skip_update call is actually bogus and
quite the contrary (the request bits are removed/reverted).

By setting RQCF_UPDATED we really don't care if the skip is happening
or not and will therefore make the assert_clock_updated() check happy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:46 +02:00
Ingo Molnar
ea2a6af517 Merge branch 'linus' into sched/urgent, to pick up fixes and updates
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:34 +02:00
Linus Torvalds
642e7fd233 Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
 "System calls are interaction points between userspace and the kernel.
  Therefore, system call functions such as sys_xyzzy() or
  compat_sys_xyzzy() should only be called from userspace via the
  syscall table, but not from elsewhere in the kernel.

  At least on 64-bit x86, it will likely be a hard requirement from
  v4.17 onwards to not call system call functions in the kernel: It is
  better to use use a different calling convention for system calls
  there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
  which then hands processing over to the actual syscall function. This
  means that only those parameters which are actually needed for a
  specific syscall are passed on during syscall entry, instead of
  filling in six CPU registers with random user space content all the
  time (which may cause serious trouble down the call chain). Those
  x86-specific patches will be pushed through the x86 tree in the near
  future.

  Moreover, rules on how data may be accessed may differ between kernel
  data and user data. This is another reason why calling sys_xyzzy() is
  generally a bad idea, and -- at most -- acceptable in arch-specific
  code.

  This patchset removes all in-kernel calls to syscall functions in the
  kernel with the exception of arch/. On top of this, it cleans up the
  three places where many syscalls are referenced or prototyped, namely
  kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
  bpf: whitelist all syscalls for error injection
  kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
  kernel/sys_ni: sort cond_syscall() entries
  syscalls/x86: auto-create compat_sys_*() prototypes
  syscalls: sort syscall prototypes in include/linux/compat.h
  net: remove compat_sys_*() prototypes from net/compat.h
  syscalls: sort syscall prototypes in include/linux/syscalls.h
  kexec: move sys_kexec_load() prototype to syscalls.h
  x86/sigreturn: use SYSCALL_DEFINE0
  x86: fix sys_sigreturn() return type to be long, not unsigned long
  x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
  mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
  mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
  mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
  fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
  fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
  fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
  fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
  kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
  kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
  ...
2018-04-02 21:22:12 -07:00
Linus Torvalds
ce6eba3dba Merge branch 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull wait_var_event updates from Ingo Molnar:
 "This introduces the new wait_var_event() API, which is a more flexible
  waiting primitive than wait_on_atomic_t().

  All wait_on_atomic_t() users are migrated over to the new API and
  wait_on_atomic_t() is removed. The migration fixes one bug and should
  result in no functional changes for the other usecases"

* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/wait: Improve __var_waitqueue() code generation
  sched/wait: Remove the wait_on_atomic_t() API
  sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait: Introduce wait_var_event()
2018-04-02 16:50:39 -07:00