Commit Graph

2282 Commits

Author SHA1 Message Date
Linus Torvalds
7d8d20191c Merge tag 'timers-core-2023-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more timer updates from Thomas Gleixner:
 "Timekeeping and clocksource/event driver updates the second batch:

   - A trivial documentation fix in the timekeeping core

   - A really boring set of small fixes, enhancements and cleanups in
     the drivers code. No new clocksource/clockevent drivers for a
     change"

* tag 'timers-core-2023-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix references to nonexistent ktime_get_fast_ns()
  dt-bindings: timer: rockchip: Add rk3588 compatible
  dt-bindings: timer: rockchip: Drop superfluous rk3288 compatible
  clocksource/drivers/ti: Use of_property_read_bool() for boolean properties
  clocksource/drivers/timer-ti-dm: Fix finding alwon timer
  clocksource/drivers/davinci: Fix memory leak in davinci_timer_register when init fails
  clocksource/drivers/stm32-lp: Drop of_match_ptr for ID table
  clocksource/drivers/timer-ti-dm: Convert to platform remove callback returning void
  clocksource/drivers/timer-tegra186: Convert to platform remove callback returning void
  clocksource/drivers/timer-ti-dm: Improve error message in .remove
  clocksource/drivers/timer-stm32-lp: Mark driver as non-removable
  clocksource/drivers/sh_mtu2: Mark driver as non-removable
  clocksource/drivers/timer-ti-dm: Use of_address_to_resource()
  clocksource/drivers/timer-imx-gpt: Remove non-DT function
  clocksource/drivers/timer-mediatek: Split out CPUXGPT timers
  clocksource/drivers/exynos_mct: Explicitly return 0 for shared timer
2023-04-29 10:24:30 -07:00
Linus Torvalds
556eb8b791 Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Geert Uytterhoeven
158009f1b4 timekeeping: Fix references to nonexistent ktime_get_fast_ns()
There was never a function named ktime_get_fast_ns().
Presumably these should refer to ktime_get_mono_fast_ns() instead.

Fixes: c1ce406e80 ("timekeeping: Fix up function documentation for the NMI safe accessors")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/06df7b3cbd94f016403bbf6cd2b38e4368e7468f.1682516546.git.geert+renesas@glider.be
2023-04-26 23:43:16 +02:00
Linus Torvalds
e7989789c6 Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers and timekeeping updates from Thomas Gleixner:

 - Improve the VDSO build time checks to cover all dynamic relocations

   VDSO does not allow dynamic relocations, but the build time check is
   incomplete and fragile.

   It's based on architectures specifying the relocation types to search
   for and does not handle R_*_NONE relocation entries correctly.
   R_*_NONE relocations are injected by some GNU ld variants if they
   fail to determine the exact .rel[a]/dyn_size to cover trailing zeros.
   R_*_NONE relocations must be ignored by dynamic loaders, so they
   should be ignored in the build time check too.

   Remove the architecture specific relocation types to check for and
   validate strictly that no other relocations than R_*_NONE end up in
   the VSDO .so file.

 - Prefer signal delivery to the current thread for
   CLOCK_PROCESS_CPUTIME_ID based posix-timers

   Such timers prefer to deliver the signal to the main thread of a
   process even if the context in which the timer expires is the current
   task. This has the downside that it might wake up an idle thread.

   As there is no requirement or guarantee that the signal has to be
   delivered to the main thread, avoid this by preferring the current
   task if it is part of the thread group which shares sighand.

   This not only avoids waking idle threads, it also distributes the
   signal delivery in case of multiple timers firing in the context of
   different threads close to each other better.

 - Align the tick period properly (again)

   For a long time the tick was starting at CLOCK_MONOTONIC zero, which
   allowed users space applications to either align with the tick or to
   place a periodic computation so that it does not interfere with the
   tick. The alignement of the tick period was more by chance than by
   intention as the tick is set up before a high resolution clocksource
   is installed, i.e. timekeeping is still tick based and the tick
   period advances from there.

   The early enablement of sched_clock() broke this alignement as the
   time accumulated by sched_clock() is taken into account when
   timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is
   not longer a multiple of tick periods, which breaks applications
   which relied on that behaviour.

   Cure this by aligning the tick starting point to the next multiple of
   tick periods, i.e 1000ms/CONFIG_HZ.

 - A set of NOHZ fixes and enhancements:

     * Cure the concurrent writer race for idle and IO sleeptime
       statistics

       The statitic values which are exposed via /proc/stat are updated
       from the CPU local idle exit and remotely by cpufreq, but that
       happens without any form of serialization. As a consequence
       sleeptimes can be accounted twice or worse.

       Prevent this by restricting the accumulation writeback to the CPU
       local idle exit and let the remote access compute the accumulated
       value.

     * Protect idle/iowait sleep time with a sequence count

       Reading idle/iowait sleep time, e.g. from /proc/stat, can race
       with idle exit updates. As a consequence the readout may result
       in random and potentially going backwards values.

       Protect this by a sequence count, which fixes the idle time
       statistics issue, but cannot fix the iowait time problem because
       iowait time accounting races with remote wake ups decrementing
       the remote runqueues nr_iowait counter. The latter is impossible
       to fix, so the only way to deal with that is to document it
       properly and to remove the assertion in the selftest which
       triggers occasionally due to that.

     * Restructure struct tick_sched for better cache layout

     * Some small cleanups and a better cache layout for struct
       tick_sched

 - Implement the missing timer_wait_running() callback for POSIX CPU
   timers

   For unknown reason the introduction of the timer_wait_running()
   callback missed to fixup posix CPU timers, which went unnoticed for
   almost four years.

   While initially only targeted to prevent livelocks between a timer
   deletion and the timer expiry function on PREEMPT_RT enabled kernels,
   it turned out that fixing this for mainline is not as trivial as just
   implementing a stub similar to the hrtimer/timer callbacks.

   The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled
   systems there is a livelock issue independent of RT.

   CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU
   timers out from hard interrupt context to task work, which is handled
   before returning to user space or to a VM. The expiry mechanism moves
   the expired timers to a stack local list head with sighand lock held.
   Once sighand is dropped the task can be preempted and a task which
   wants to delete a timer will spin-wait until the expiry task is
   scheduled back in. In the worst case this will end up in a livelock
   when the preempting task and the expiry task are pinned on the same
   CPU.

   The timer wheel has a timer_wait_running() mechanism for RT, which
   uses a per CPU timer-base expiry lock which is held by the expiry
   code and the task waiting for the timer function to complete blocks
   on that lock.

   This does not work in the same way for posix CPU timers as there is
   no timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry
   lock can be used too in a slightly different way.

   Add a per task mutex to struct posix_cputimers_work, let the expiry
   task hold it accross the expiry function and let the deleting task
   which waits for the expiry to complete block on the mutex.

   In the non-contended case this results in an extra
   mutex_lock()/unlock() pair on both sides.

   This avoids spin-waiting on a task which is scheduled out, prevents
   the livelock and cures the problem for RT and !RT systems

* tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Implement the missing timer_wait_running callback
  selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity
  selftests/proc: Remove idle time monotonicity assertions
  MAINTAINERS: Remove stale email address
  timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
  timers/nohz: Add a comment about broken iowait counter update race
  timers/nohz: Protect idle/iowait sleep time under seqcount
  timers/nohz: Only ever update sleeptime from idle exit
  timers/nohz: Restructure and reshuffle struct tick_sched
  tick/common: Align tick period with the HZ tick.
  selftests/timers/posix_timers: Test delivery of signals across threads
  posix-timers: Prefer delivery of signals to the current thread
  vdso: Improve cmd_vdso_check to check all dynamic relocations
2023-04-25 11:22:46 -07:00
Thomas Gleixner
f7abf14f00 posix-cpu-timers: Implement the missing timer_wait_running callback
For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.

Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:

1) If not enabled, the expiry happens in hard interrupt context so
   spin waiting on the remote CPU is reasonably time bound.

   Implement an empty stub function for that case.

2) If enabled, the expiry happens in task work before returning to user
   space or guest mode. The expired timers are marked as firing and moved
   from the timer queue to a local list head with sighand lock held. Once
   the timers are moved, sighand lock is dropped and the expiry happens in
   fully preemptible context. That means the expiring task can be scheduled
   out, migrated, interrupted etc. So spin waiting on it is more than
   suboptimal.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way:

    - Add a mutex to struct posix_cputimers_work. This struct is per task
      and used to schedule the expiry task work from the timer interrupt.

    - Add a task_struct pointer to struct cpu_timer which is used to store
      a the task which runs the expiry. That's filled in when the task
      moves the expired timers to the local expiry list. That's not
      affecting the size of the k_itimer union as there are bigger union
      members already

    - Let the task take the expiry mutex around the expiry function

    - Let the waiter acquire a task reference with rcu_read_lock() held and
      block on the expiry mutex

   This avoids spin-waiting on a task which might not even be on a CPU and
   works nicely for RT too.

Fixes: ec8f954a40 ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
2023-04-21 15:34:33 +02:00
Frederic Weisbecker
289dafed38 timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
There is no need for the __tick_nohz_idle_stop_tick() function between
tick_nohz_idle_stop_tick() and its implementation. Remove that
unnecessary step.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-6-frederic@kernel.org
2023-04-18 16:35:12 +02:00
Frederic Weisbecker
ead70b7523 timers/nohz: Add a comment about broken iowait counter update race
The per-cpu iowait task counter is incremented locally upon sleeping.
But since the task can be woken to (and by) another CPU, the counter may
then be decremented remotely. This is the source of a race involving
readers VS writer of idle/iowait sleeptime.

The following scenario shows an example where a /proc/stat reader
observes a pending sleep time as IO whereas that pending sleep time
later eventually gets accounted as non-IO.

    CPU 0                       CPU  1                    CPU 2
    -----                       -----                     ------
    //io_schedule() TASK A
    current->in_iowait = 1
    rq(0)->nr_iowait++
    //switch to idle
                        // READ /proc/stat
                        // See nr_iowait_cpu(0) == 1
                        return ts->iowait_sleeptime +
                               ktime_sub(ktime_get(), ts->idle_entrytime)

                                                          //try_to_wake_up(TASK A)
                                                          rq(0)->nr_iowait--
    //idle exit
    // See nr_iowait_cpu(0) == 0
    ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)

As a result subsequent reads on /proc/stat may expose backward progress.

This is unfortunately hardly fixable. Just add a comment about that
condition.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-5-frederic@kernel.org
2023-04-18 16:35:12 +02:00
Frederic Weisbecker
620a30fa0b timers/nohz: Protect idle/iowait sleep time under seqcount
Reading idle/IO sleep time (eg: from /proc/stat) can race with idle exit
updates because the state machine handling the stats is not atomic and
requires a coherent read batch.

As a result reading the sleep time may report irrelevant or backward
values.

Fix this with protecting the simple state machine within a seqcount.
This is expected to be cheap enough not to add measurable performance
impact on the idle path.

Note this only fixes reader VS writer condition partitially. A race
remains that involves remote updates of the CPU iowait task counter. It
can hardly be fixed.

Reported-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-4-frederic@kernel.org
2023-04-18 16:35:12 +02:00
Frederic Weisbecker
07b65a800b timers/nohz: Only ever update sleeptime from idle exit
The idle and IO sleeptime statistics appearing in /proc/stat can be
currently updated from two sites: locally on idle exit and remotely
by cpufreq. However there is no synchronization mechanism protecting
concurrent updates. It is therefore possible to account the sleeptime
twice, among all the other possible broken scenarios.

To prevent from breaking the sleeptime accounting source, restrict the
sleeptime updates to the local idle exit site. If there is a delta to
add since the last update, IO/Idle sleep time readers will now only
compute the delta without actually writing it back to the internal idle
statistic fields.

This fixes a writer VS writer race. Note there are still two known
reader VS writer races to handle. A subsequent patch will fix one.

Reported-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-3-frederic@kernel.org
2023-04-18 16:35:12 +02:00
Frederic Weisbecker
605da849d5 timers/nohz: Restructure and reshuffle struct tick_sched
Restructure and group fields by access in order to optimize cache
layout. While at it, also add missing kernel doc for two fields:
@last_jiffies and @idle_expires.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-2-frederic@kernel.org
2023-04-18 16:35:12 +02:00
Sebastian Andrzej Siewior
e9523a0d81 tick/common: Align tick period with the HZ tick.
With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.

This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.

Align the tick period with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.

Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Reported-by: Gusenleitner Klaus <gus@keba.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20230406095735.0_14edn3@linutronix.de
Link: https://lore.kernel.org/r/20230418122639.ikgfvu3f@linutronix.de
2023-04-18 15:06:50 +02:00
Zqiang
db7b464df9 rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.

Fixes: df1e849ae4 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:43 +00:00
Joel Fernandes (Google)
58d7668242 tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined.
However, cpu_is_hotpluggable() still returns true for those CPUs. This causes
torture tests that do offlining to end up trying to offline this CPU causing
test failures. Such failure happens on all architectures.

Fix the repeated error messages thrown by this (even if the hotplug errors are
harmless) by asking the opinion of the nohz subsystem on whether the CPU can be
hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion:
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: rcu <rcu@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2987557f52 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:43 +00:00
Greg Kroah-Hartman
2243acd50a driver core: class: remove struct class_interface * from callbacks
The add_dev and remove_dev callbacks in struct class_interface currently
pass in a pointer back to the class_interface structure that is calling
them, but none of the callback implementations actually use this pointer
as it is pointless (the structure is known, the driver passed it in in
the first place if it is really needed again.)

So clean this up and just remove the pointer from the callbacks and fix
up all callback functions.

Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wang Weiyang <wangweiyang2@huawei.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Cai Xinchen <caixinchen1@huawei.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Logan Gunthorpe <logang@deltatee.com>
Link: https://lore.kernel.org/r/2023040250-pushover-platter-509c@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-03 21:42:52 +02:00
Linus Torvalds
560b803067 Merge tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Updates for timekeeping, timers and clockevent/source drivers:

  Core:

   - Yet another round of improvements to make the clocksource watchdog
     more robust:

       - Relax the clocksource-watchdog skew criteria to match the NTP
         criteria.

       - Temporarily skip the watchdog when high memory latencies are
         detected which can lead to false-positives.

       - Provide an option to enable TSC skew detection even on systems
         where TSC is marked as reliable.

     Sigh!

   - Initialize the restart block in the nanosleep syscalls to be
     directed to the no restart function instead of doing a partial
     setup on entry.

     This prevents an erroneous restart_syscall() invocation from
     corrupting user space data. While such a situation is clearly a
     user space bug, preventing this is a correctness issue and caters
     to the least suprise principle.

   - Ignore the hrtimer slack for realtime tasks in schedule_hrtimeout()
     to align it with the nanosleep semantics.

  Drivers:

   - The obligatory new driver bindings for Mediatek, Rockchip and
     RISC-V variants.

   - Add support for the C3STOP misfeature to the RISC-V timer to handle
     the case where the timer stops in deeper idle state.

   - Set up a static key in the RISC-V timer correctly before first use.

   - The usual small improvements and fixes all over the place"

* tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  clocksource/drivers/timer-sun4i: Add CLOCK_EVT_FEAT_DYNIRQ
  clocksource/drivers/em_sti: Mark driver as non-removable
  clocksource/drivers/sh_tmu: Mark driver as non-removable
  clocksource/drivers/riscv: Patch riscv_clock_next_event() jump before first use
  clocksource/drivers/timer-microchip-pit64b: Add delay timer
  clocksource/drivers/timer-microchip-pit64b: Select driver only on ARM
  dt-bindings: timer: sifive,clint: add comaptibles for T-Head's C9xx
  dt-bindings: timer: mediatek,mtk-timer: add MT8365
  clocksource/drivers/riscv: Get rid of clocksource_arch_init() callback
  clocksource/drivers/sh_cmt: Mark driver as non-removable
  clocksource/drivers/timer-microchip-pit64b: Drop obsolete dependency on COMPILE_TEST
  clocksource/drivers/riscv: Increase the clock source rating
  clocksource/drivers/timer-riscv: Set CLOCK_EVT_FEAT_C3STOP based on DT
  dt-bindings: timer: Add bindings for the RISC-V timer device
  RISC-V: time: initialize hrtimer based broadcast clock event device
  dt-bindings: timer: rk-timer: Add rktimer for rv1126
  time/debug: Fix memory leak with using debugfs_lookup()
  clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
  posix-timers: Use atomic64_try_cmpxchg() in __update_gt_cputime()
  clocksource: Verify HPET and PMTMR when TSC unverified
  ...
2023-02-21 09:45:13 -08:00
Linus Torvalds
1f2d9ffc7a Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix & rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs & quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code & self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
2023-02-20 17:41:08 -08:00
Thomas Gleixner
d125d1349a alarmtimer: Prevent starvation by small intervals and SIG_IGN
syzbot reported a RCU stall which is caused by setting up an alarmtimer
with a very small interval and ignoring the signal. The reproducer arms the
alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a
problem per se, but that's an issue when the signal is ignored because then
the timer is immediately rearmed because there is no way to delay that
rearming to the signal delivery path.  See posix_timer_fn() and commit
58229a1899 ("posix-timers: Prevent softirq starvation by small intervals
and SIG_IGN") for details.

The reproducer does not set SIG_IGN explicitely, but it sets up the timers
signal with SIGCONT. That has the same effect as explicitely setting
SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and
the task is not ptraced.

The log clearly shows that:

   [pid  5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} ---

It works because the tasks are traced and therefore the signal is queued so
the tracer can see it, which delays the restart of the timer to the signal
delivery path. But then the tracer is killed:

   [pid  5087] kill(-5102, SIGKILL <unfinished ...>
   ...
   ./strace-static-x86_64: Process 5107 detached

and after it's gone the stall can be observed:

   syzkaller login: [   79.439102][    C0] hrtimer: interrupt took 68471 ns
   [  184.460538][    C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
   ...
   [  184.658237][    C1] rcu: Stack dump where RCU GP kthread last ran:
   [  184.664574][    C1] Sending NMI from CPU 1 to CPUs 0:
   [  184.669821][    C0] NMI backtrace for cpu 0
   [  184.669831][    C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0
   ...
   [  184.670036][    C0] Call Trace:
   [  184.670041][    C0]  <IRQ>
   [  184.670045][    C0]  alarmtimer_fired+0x327/0x670

posix_timer_fn() prevents that by checking whether the interval for
timers which have the signal ignored is smaller than a jiffie and
artifically delay it by shifting the next expiry out by a jiffie. That's
accurate vs. the overrun accounting, but slightly inaccurate
vs. timer_gettimer(2).

The comment in that function says what needs to be done and there was a fix
available for the regular userspace induced SIG_IGN mechanism, but that did
not work due to the implicit ignore for SIGCONT and similar signals. This
needs to be worked on, but for now the only available workaround is to do
exactly what posix_timer_fn() does:

Increase the interval of self-rearming timers, which have their signal
ignored, to at least a jiffie.

Interestingly this has been fixed before via commit ff86bf0c65
("alarmtimer: Rate limit periodic intervals") already, but that fix got
lost in a later rework.

Reported-by: syzbot+b9564ba6e8e00694511b@syzkaller.appspotmail.com
Fixes: f2c45807d3 ("alarmtimer: Switch over to generic set/get/rearm routine")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx
2023-02-14 11:18:35 +01:00
Thomas Gleixner
ab407a1919 Merge tag 'clocksource.2023.02.06b' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into timers/core
Pull clocksource watchdog changes from Paul McKenney:

     o	Improvements to clocksource-watchdog console messages.

     o	Loosening of the clocksource-watchdog skew criteria to match
     	those of NTP (500 parts per million, relaxed from 400 parts
     	per million).  If it is good enough for NTP, it is good enough
     	for the clocksource watchdog.

     o	Suspend clocksource-watchdog checking temporarily when high
     	memory latencies are detected.	This avoids the false-positive
     	clock-skew events that have been seen on production systems
     	running memory-intensive workloads.

     o	On systems where the TSC is deemed trustworthy, use it as the
     	watchdog timesource, but only when specifically requested using
     	the tsc=watchdog kernel boot parameter.  This permits clock-skew
     	events to be detected, but avoids forcing workloads to use the
     	slow HPET and ACPI PM timers.  These last two timers are slow
     	enough to cause systems to be needlessly marked bad on the one
     	hand, and real skew does sometimes happen on production systems
     	running production workloads on the other.  And sometimes it is
     	the fault of the TSC, or at least of the firmware that told the
     	kernel to program the TSC with the wrong frequency.

     o	Add a tsc=revalidate kernel boot parameter to allow the kernel
     	to diagnose cases where the TSC hardware works fine, but was told
     	by firmware to tick at the wrong frequency.  Such cases are rare,
     	but they really have happened on production systems.

Link: https://lore.kernel.org/r/20230210193640.GA3325193@paulmck-ThinkPad-P17-Gen-1
2023-02-13 19:28:48 +01:00
Greg Kroah-Hartman
5b268d8aba time/debug: Fix memory leak with using debugfs_lookup()
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time.  To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic at
once.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230202151214.2306822-1-gregkh@linuxfoundation.org
2023-02-09 20:12:27 +01:00
Uros Bizjak
915d4ad383 posix-timers: Use atomic64_try_cmpxchg() in __update_gt_cputime()
Use atomic64_try_cmpxchg() instead of atomic64_cmpxchg() in
__update_gt_cputime(). The x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg() (and related move
instruction in front of cmpxchg()).

Also, atomic64_try_cmpxchg() implicitly assigns old *ptr value to "old"
when cmpxchg() fails.  There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230116165337.5810-1-ubizjak@gmail.com
2023-02-06 14:22:09 +01:00
Ingo Molnar
57a30218fa Merge tag 'v6.2-rc6' into sched/core, to pick up fixes
Pick up fixes before merging another batch of cpuidle updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31 15:01:20 +01:00
Davidlohr Bueso
0c52310f26 hrtimer: Ignore slack time for RT tasks in schedule_hrtimeout_range()
While in theory the timer can be triggered before expires + delta, for the
cases of RT tasks they really have no business giving any lenience for
extra slack time, so override any passed value by the user and always use
zero for schedule_hrtimeout_range() calls. Furthermore, this is similar to
what the nanosleep(2) family already does with current->timer_slack_ns.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230123173206.6764-3-dave@stgolabs.net
2023-01-31 11:23:07 +01:00
Davidlohr Bueso
c14fd3dcac hrtimer: Rely on rt_task() for DL tasks too
Checking dl_task() is redundant as rt_task() returns true for deadline
tasks too.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230123173206.6764-2-dave@stgolabs.net
2023-01-31 11:23:07 +01:00
Feng Tang
b7082cdfc4 clocksource: Suspend the watchdog temporarily when high read latency detected
Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.

 [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
 [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
 [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
 [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
 [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
 [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
 [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
 [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
 [ 821.067990] clocksource: Switched to clocksource hpet

This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.

The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high.  Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer.  These latencies can
result in false-positive clocksource-unstable determinations.

These issues were initially reported by a customer running on a production
system, and this problem was reproduced on several generations of Xeon
servers, especially when running the stress-ng test.  These Xeon servers
were not production systems, but they did have the latest steppings
and firmware.

Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load.  Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-24 15:12:48 -08:00
Peter Zijlstra
e3ee5e66f7 time/tick-broadcast: Remove RCU_NONIDLE() usage
No callers left that have already disabled RCU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.927904612@infradead.org
2023-01-13 11:48:16 +01:00