The current Wound-Wait mutex algorithm is actually not Wound-Wait but
Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait
is, contrary to Wait-Die a preemptive algorithm and is known to generate
fewer backoffs. Testing reveals that this is true if the
number of simultaneous contending transactions is small.
As the number of simultaneous contending threads increases, Wait-Wound
becomes inferior to Wait-Die in terms of elapsed time.
Possibly due to the larger number of held locks of sleeping transactions.
Update documentation and callers.
Timings using git://people.freedesktop.org/~thomash/ww_mutex_test
tag patch-18-06-15
Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly
chosen out of 100000. Four core Intel x86_64:
Algorithm #threads Rollbacks time
Wound-Wait 4 ~100 ~17s.
Wait-Die 4 ~150000 ~19s.
Wound-Wait 16 ~360000 ~109s.
Wait-Die 16 ~450000 ~82s.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Co-authored-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
There are a few issues on some kernel-doc markups that was
causing troubles with kernel-doc output on ReST format:
./kernel/futex.c:492: WARNING: Inline emphasis start-string without end-string.
./kernel/futex.c:1264: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:1721: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2338: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2426: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2899: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2972: WARNING: Block quote ends without a blank line; unexpected unindent.
Fix them.
No functional changes.
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
In commit:
659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")
I replaced a comment with a lockdep_assert_held(). However it turns out
we hide that lock from lockdep for hysterical raisins, which results
in the assertion always firing.
Remove the old debug code as lockdep will easily spot the abuse it was
meant to catch, which will make the lock visible to lockdep and make
the assertion work as intended.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Haehnle <Nicolai.Haehnle@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")
Link: http://lkml.kernel.org/r/20170117150609.GB32474@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the following scenario, thread #1 should back off its attempt to lock
ww1 and unlock ww2 (assuming the acquire context stamps are ordered
accordingly).
Thread #0 Thread #1
--------- ---------
successfully lock ww2
set ww1->base.owner
attempt to lock ww1
confirm ww1->ctx == NULL
enter mutex_spin_on_owner
set ww1->ctx
What was likely to happen previously is:
attempt to lock ww2
refuse to spin because
ww2->ctx != NULL
schedule()
detect thread #0 is off CPU
stop optimistic spin
return -EDEADLK
unlock ww2
wakeup thread #0
lock ww2
Now, we are more likely to see:
detect ww1->ctx != NULL
stop optimistic spin
return -EDEADLK
unlock ww2
successfully lock ww2
... because thread #1 will stop its optimistic spin as soon as possible.
The whole scenario is quite unlikely, since it requires thread #1 to get
between thread #0 setting the owner and setting the ctx. But since we're
idling here anyway, the additional check is basically free.
Found by inspection.
Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-10-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of inlining __mutex_lock_common() 5 times, once for each
{state,ww} variant. Reduce this to two, ww and !ww.
Then add __always_inline to mutex_optimistic_spin(), so that that will
get inlined all 4 remaining times, for all {waiter,ww} variants.
text data bss dec hex filename
6301 0 0 6301 189d defconfig-build/kernel/locking/mutex.o
4053 0 0 4053 fd5 defconfig-build/kernel/locking/mutex.o
4257 0 0 4257 10a1 defconfig-build/kernel/locking/mutex.o
This reduces total text size and better separates the ww and !ww mutex
code generation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add regular waiters in stamp order. Keep adding waiters that have no
context in FIFO order and take care not to starve them.
While adding our task as a waiter, back off if we detect that there is
a waiter with a lower stamp in front of us.
Make sure to call lock_contended even when we back off early.
For w/w mutexes, being first in the wait list is only stable when
taking the lock without a context. Therefore, the purpose of the first
flag is split into two: 'first' remains to indicate whether we want to
spin optimistically, while 'handoff' indicates that we should be
prepared to accept a handoff.
For w/w locking with a context, we always accept handoffs after the
first schedule(), to handle the following sequence of events:
1. Task #0 unlocks and hands off to Task #2 which is first in line
2. Task #1 adds itself in front of Task #2
3. Task #2 wakes up and must accept the handoff even though it is no
longer first in line
Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-7-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While reviewing the ww_mutex patches, I noticed that it was still
possible to (incorrectly) succeed for (incorrect) code like:
mutex_lock(&a);
mutex_lock(&a);
This was possible if the second mutex_lock() would block (as expected)
but then receive a spurious wakeup. At that point it would find itself
at the front of the queue, request a handoff and instantly claim
ownership and continue, since owner would point to itself.
Avoid this scenario and simplify the code by introducing a third low
bit to signal handoff pickup. So once we request handoff, unlock
clears the handoff bit and sets the pickup bit along with the new
owner.
This also removes the need for the .handoff argument to
__mutex_trylock(), since that becomes superfluous with PICKUP.
In order to guarantee enough low bits, ensure task_struct alignment is
at least L1_CACHE_BYTES (which seems a good ideal regardless).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d659ae14b ("locking/mutex: Add lock handoff to avoid starvation")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:
be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")
... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.
However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.
Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().
Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.
[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com
== 1. x86-64 ==
Avg runtime set_task_state(): 601 msecs
Avg runtime set_current_state(): 552 msecs
vanilla dirty
Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%)
Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%)
Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%)
Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%)
Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%)
Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%)
Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%)
Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%)
Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%)
Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%)
Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%)
Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%)
Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%)
== 2. ppc64le ==
Avg runtime set_task_state(): 938 msecs
Avg runtime set_current_state: 940 msecs
vanilla dirty
Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%)
Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%)
Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%)
Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%)
Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%)
Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%)
Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%)
Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%)
Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%)
Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%)
Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%)
Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%)
Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%)
Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%)
Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%)
Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%)
x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>