Commit Graph

75 Commits

Author SHA1 Message Date
Namhyung Kim
ee042be16c locking: Apply contention tracepoints in the slow path
Adding the lock contention tracepoints in various lock function slow
paths.  Note that each arch can define spinlock differently, I only
added it only to the generic qspinlock for now.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Link: https://lkml.kernel.org/r/20220322185709.141236-3-namhyung@kernel.org
2022-04-05 10:24:35 +02:00
Waiman Long
1ee326196c locking/rwsem: Always try to wake waiters in out_nolock path
For writers, the out_nolock path will always attempt to wake up waiters.
This may not be really necessary if the waiter to be removed is not the
first one.

For readers, no attempt to wake up waiter is being made. However, if
the HANDOFF bit is set and the reader to be removed is the first waiter,
the waiter behind it will inherit the HANDOFF bit and for a write lock
waiter waking it up will allow it to spin on the lock to acquire it
faster. So it can be beneficial to do a wakeup in this case.

Add a new rwsem_del_wake_waiter() helper function to do that consistently
for both reader and writer out_nolock paths.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-4-longman@redhat.com
2022-04-05 10:24:35 +02:00
Waiman Long
54c1ee4d61 locking/rwsem: Conditionally wake waiters in reader/writer slowpaths
In an analysis of a recent vmcore, a reader-owned rwsem was found with
385 readers but no writer in the wait queue. That is kind of unusual
but it may be caused by some race conditions that we have not fully
understood yet. In such a case, all the readers in the wait queue should
join the other reader-owners and acquire the read lock.

In rwsem_down_write_slowpath(), an incoming writer will try to
wake up the front readers under such circumstance. That is not
the case for rwsem_down_read_slowpath(), add a new helper function
rwsem_cond_wake_waiter() to do wakeup and use it in both reader and
writer slowpaths to have a consistent and correct behavior.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-3-longman@redhat.com
2022-04-05 10:24:35 +02:00
Waiman Long
f9e21aa9e6 locking/rwsem: No need to check for handoff bit if wait queue empty
Since commit d257cc8cb8 ("locking/rwsem: Make handoff bit handling
more consistent"), the handoff bit is always cleared if the wait queue
becomes empty. There is no need to check for RWSEM_FLAG_HANDOFF when
the wait list is known to be empty.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-2-longman@redhat.com
2022-04-05 10:24:34 +02:00
Minchan Kim
c441e934b6 locking: Add missing __sched attributes
This patch adds __sched attributes to a few missing places
to show blocked function rather than locking function
in get_wchan.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220115231657.84828-1-minchan@kernel.org
2022-02-11 12:13:55 +01:00
Ingo Molnar
6773cc31a9 Merge tag 'v5.16-rc5' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-12-13 10:48:46 +01:00
Kefeng Wang
c0bed69daf locking: Make owner_on_cpu() into <linux/sched.h>
Move the owner_on_cpu() from kernel/locking/rwsem.c into
include/linux/sched.h with under CONFIG_SMP, then use it
in the mutex/rwsem/rtmutex to simplify the code.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211203075935.136808-2-wangkefeng.wang@huawei.com
2021-12-04 10:56:25 +01:00
Muchun Song
14c2404884 locking/rwsem: Optimize down_read_trylock() under highly contended case
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
  	cpu_set_t set;
  	CPU_ZERO(&set);
  	CPU_SET(cpu, &set);
  	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

  	while (mutex);

  	for (std::size_t i = 0; i < sz; i += 4096) {
  		*ptr = '\0';
  		ptr += 4096;
  	}
  }

  int main(int argc, char* argv[])
  {
  	std::size_t sz = 100;

  	if (argc > 1)
  		sz = atoi(argv[1]);

  	auto nproc = std::thread::hardware_concurrency();
  	std::vector<std::thread> thr;
  	sz <<= 30;
  	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
			 MAP_PRIVATE, -1, 0);
  	assert(ptr != MAP_FAILED);
  	char* cptr = static_cast<char*>(ptr);
  	auto run = sz / nproc;
  	run = (run >> 12) << 12;

  	mutex = 1;

  	for (auto i = 0U; i < nproc; ++i) {
  		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
  		cptr += run;
  	}

  	rusage usage_start;
  	getrusage(RUSAGE_SELF, &usage_start);
  	auto start = std::chrono::system_clock::now();

  	mutex = 0;

  	for (auto& t : thr)
  		t.join();

  	rusage usage_end;
  	getrusage(RUSAGE_SELF, &usage_end);
  	auto end = std::chrono::system_clock::now();
  	timeval utime;
  	timeval stime;
  	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
  	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
  	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
  	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
  	printf("real: %lu\n",
  	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
  	       start).count());

  	return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3a ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
2021-11-23 09:45:36 +01:00
Waiman Long
d257cc8cb8 locking/rwsem: Make handoff bit handling more consistent
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.

Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.

Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty.  For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.

Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.

To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.

The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition.  If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.

While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.

Fixes: 4f23dbc1e6 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
2021-11-23 09:45:35 +01:00
Yanfei Xu
5197fcd09a locking/rwsem: Fix comments about reader optimistic lock stealing conditions
After the commit 617f3ef951 ("locking/rwsem: Remove reader
optimistic spinning"), reader doesn't support optimistic spinning
anymore, there is no need meet the condition which OSQ is empty.

BTW, add an unlikely() for the max reader wakeup check in the loop.

Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-4-yanfei.xu@windriver.com
2021-10-19 17:27:06 +02:00
Yanfei Xu
6c2787f2a2 locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
preempt_disable/enable() is equal to RCU read-side crital section, and
the spinning codes in mutex and rwsem could ensure that the preemption
is disabled. So let's remove the unnecessary rcu_read_lock/unlock for
saving some cycles in hot codes.

Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-2-yanfei.xu@windriver.com
2021-10-19 17:27:06 +02:00
Yanfei Xu
7cdacc5f52 locking/rwsem: Disable preemption for spinning region
The spinning region rwsem_spin_on_owner() should not be preempted,
however the rwsem_down_write_slowpath() invokes it and don't disable
preemption. Fix it by adding a pair of preempt_disable/enable().

Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
[peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com
2021-10-19 17:27:05 +02:00
Mike Galbraith
15eb7c888e locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT
730633f0b7 became the first direct caller of __init_rwsem() vs the
usual init_rwsem(), exposing PREEMPT_RT's lack thereof.  Add it.

[ tglx: Move it out of line ]

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/50a936b7d8f12277d6ec7ed2ef0421a381056909.camel@gmx.de
2021-09-02 22:07:17 +02:00
Peter Zijlstra
add461325e locking/rtmutex: Extend the rtmutex core to support ww_mutex
Add a ww acquire context pointer to the waiter and various functions and
add the ww_mutex related invocations to the proper spots in the locking
code, similar to the mutex based variant.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211304.966139174@linutronix.de
2021-08-17 19:05:23 +02:00
Thomas Gleixner
e17ba59b7e locking/rtmutex: Guard regular sleeping locks specific functions
Guard the regular sleeping lock specific functionality, which is used for
rtmutex on non-RT enabled kernels and for mutex, rtmutex and semaphores on
RT enabled kernels so the code can be reused for the RT specific
implementation of spinlocks and rwlocks in a different compilation unit.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.311535693@linutronix.de
2021-08-17 17:23:27 +02:00
Thomas Gleixner
42254105df locking/rwsem: Add rtmutex based R/W semaphore implementation
The RT specific R/W semaphore implementation used to restrict the number of
readers to one, because a writer cannot block on multiple readers and
inherit its priority or budget.

The single reader restricting was painful in various ways:

 - Performance bottleneck for multi-threaded applications in the page fault
   path (mmap sem)

 - Progress blocker for drivers which are carefully crafted to avoid the
   potential reader/writer deadlock in mainline.

The analysis of the writer code paths shows that properly written RT tasks
should not take them. Syscalls like mmap(), file access which take mmap sem
write locked have unbound latencies, which are completely unrelated to mmap
sem. Other R/W sem users like graphics drivers are not suitable for RT tasks
either.

So there is little risk to hurt RT tasks when the RT rwsem implementation is
done in the following way:

 - Allow concurrent readers

 - Make writers block until the last reader left the critical section. This
   blocking is not subject to priority/budget inheritance.

 - Readers blocked on a writer inherit their priority/budget in the normal
   way.

There is a drawback with this scheme: R/W semaphores become writer unfair
though the applications which have triggered writer starvation (mostly on
mmap_sem) in the past are not really the typical workloads running on a RT
system. So while it's unlikely to hit writer starvation, it's possible. If
there are unexpected workloads on RT systems triggering it, the problem
has to be revisited.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.016885947@linutronix.de
2021-08-17 17:12:47 +02:00
Ingo Molnar
c87866ede4 Merge tag 'v5.14-rc6' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-08-17 16:16:29 +02:00
xuyehan
d4e5076c35 locking/rwsem: Remove an unused parameter of rwsem_wake()
The 2nd parameter 'count' is not used in this function.
The places where the function is called are also modified.

Signed-off-by: xuyehan <xuyehan@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/1625547043-28103-1-git-send-email-yehanxu1@gmail.com
2021-07-16 18:46:44 +02:00
Peter Zijlstra
2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Ingo Molnar
e2db7592be locking: Fix typos in comments
Fix ~16 single-word typos in locking code comments.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-22 02:45:52 +01:00
Bhaskar Chowdhury
4faf62b1ef locking/rwsem: Fix comment typo
s/folowing/following/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210317041806.4096156-1-unixbhaskar@gmail.com
2021-03-17 09:34:39 +01:00
Randy Dunlap
c034f48e99 kernel: delete repeated words in comments
Drop repeated words in kernel/events/.
{if, the, that, with, time}

Drop repeated words in kernel/locking/.
{it, no, the}

Drop repeated words in kernel/sched/.
{in, not}

Link: https://lkml.kernel.org/r/20210127023412.26292-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Will Deacon <will@kernel.org>	[kernel/locking/]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:03 -08:00
Waiman Long
617f3ef951 locking/rwsem: Remove reader optimistic spinning
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It also improves
the chance that a reader can get the lock as writer optimistic spinning
disproportionally favors writers much more than readers.

Since commit d3681e269f ("locking/rwsem: Wake up almost all readers
in wait queue"), all the waiting readers are woken up so that they can
all get the read lock and run in parallel. When the number of contending
readers is large, allowing reader optimistic spinning will likely cause
reader fragmentation where multiple smaller groups of readers can get
the read lock in a sequential manner separated by writers. That reduces
reader parallelism.

One possible way to address that drawback is to limit the number of
readers (preferably one) that can do optimistic spinning. These readers
act as representatives of all the waiting readers in the wait queue as
they will wake up all those waiting readers once they get the lock.

Alternatively, as reader optimistic lock stealing has already enhanced
fairness to readers, it may be easier to just remove reader optimistic
spinning and simplifying the optimistic spinning code as a result.

Performance measurements (locking throughput kops/s) using a locking
microbenchmark with 50/50 reader/writer distribution and turbo-boost
disabled was done on a 2-socket Cascade Lake system (48-core 96-thread)
to see the impacts of these changes:

  1) Vanilla     - 5.10-rc3 kernel
  2) Before      - 5.10-rc3 kernel with previous patches in this series
  2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch
  3) no-rspin    - 5.10-rc3 kernel with reader spinning disabled

  # of threads  CS Load   Vanilla  Before   limit-rspin   no-rspin
  ------------  -------   -------  ------   -----------   --------
       2            1      5,185    5,662      5,214       5,077
       4            1      5,107    4,983      5,188       4,760
       8            1      4,782    4,564      4,720       4,628
      16            1      4,680    4,053      4,567       3,402
      32            1      4,299    1,115      1,118       1,098
      64            1      3,218      983      1,001         957
      96            1      1,938      944        957         930

       2           20      2,008    2,128      2,264       1,665
       4           20      1,390    1,033      1,046       1,101
       8           20      1,472    1,155      1,098       1,213
      16           20      1,332    1,077      1,089       1,122
      32           20        967      914        917         980
      64           20        787      874        891         858
      96           20        730      836        847         844

       2          100        372      356        360         355
       4          100        492      425        434         392
       8          100        533      537        529         538
      16          100        548      572        568         598
      32          100        499      520        527         537
      64          100        466      517        526         512
      96          100        406      497        506         509

The column "CS Load" represents the number of pause instructions issued
in the locking critical section. A CS load of 1 is extremely short and
is not likey in real situations. A load of 20 (moderate) and 100 (long)
are more realistic.

It can be seen that the previous patches in this series have reduced
performance in general except in highly contended cases with moderate
or long critical sections that performance improves a bit. This change
is mostly caused by the "Prevent potential lock starvation" patch that
reduce reader optimistic spinning and hence reduce reader fragmentation.

The patch that further limit reader optimistic spinning doesn't seem to
have too much impact on overall performance as shown in the benchmark
data.

The patch that disables reader optimistic spinning shows reduced
performance at lightly loaded cases, but comparable or slightly better
performance on with heavier contention.

This patch just removes reader optimistic spinning for now. As readers
are not going to do optimistic spinning anymore, we don't need to
consider if the OSQ is empty or not when doing lock stealing.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com
2020-12-09 17:08:48 +01:00
Waiman Long
1a728dff85 locking/rwsem: Enable reader optimistic lock stealing
If the optimistic spinning queue is empty and the rwsem does not have
the handoff or write-lock bits set, it is actually not necessary to
call rwsem_optimistic_spin() to spin on it. Instead, it can steal the
lock directly as its reader bias is in the count already.  If it is
the first reader in this state, it will try to wake up other readers
in the wait queue.

With this patch applied, the following were the lock event counts
after rebooting a 2-socket system and a "make -j96" kernel rebuild.

  rwsem_opt_rlock=4437
  rwsem_rlock=29
  rwsem_rlock_steal=19

So lock stealing represents about 0.4% of all the read locks acquired
in the slow path.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com
2020-12-09 17:08:48 +01:00
Waiman Long
2f06f70292 locking/rwsem: Prevent potential lock starvation
The lock handoff bit is added in commit 4f23dbc1e6 ("locking/rwsem:
Implement lock handoff to prevent lock starvation") to avoid lock
starvation. However, allowing readers to do optimistic spinning does
introduce an unlikely scenario where lock starvation can happen.

The lock handoff bit may only be set when a waiter is being woken up.
In the case of reader unlock, wakeup happens only when the reader count
reaches 0. If there is a continuous stream of incoming readers acquiring
read lock via optimistic spinning, it is possible that the reader count
may never reach 0 and so the handoff bit will never be asserted.

One way to prevent this scenario from happening is to disallow optimistic
spinning if the rwsem is currently owned by readers. If the previous
or current owner is a writer, optimistic spinning will be allowed.

If the previous owner is a reader but the reader count has reached 0
before, a wakeup should have been issued. So the handoff mechanism
will be kicked in to prevent lock starvation. As a result, it should
be OK to do optimistic spinning in this case.

This patch may have some impact on reader performance as it reduces
reader optimistic spinning especially if the lock critical sections
are short the number of contending readers are small.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com
2020-12-09 17:08:48 +01:00