Instead of taking the spinlock, the lockless versions atomically check
that the lock is not taken, and do the reference count update using a
cmpxchg() loop. This is semantically identical to doing the reference
count update protected by the lock, but avoids the "wait for lock"
contention that you get when accesses to the reference count are
contended.
Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
Even when the lockref reference counts are updated atomically with
cmpxchg, the fact that they also verify the state of the spinlock means
that the lockless updates can never happen while somebody else holds the
spinlock.
So while "lockref_put_or_lock()" looks a lot like just another name for
"atomic_dec_and_lock()", and both optimize to lockless updates, they are
fundamentally different: the decrement done by atomic_dec_and_lock() is
truly independent of any lock (as long as it doesn't decrement to zero),
so a locked region can still see the count change.
The lockref structure, in contrast, really is a *locked* reference
count. If you hold the spinlock, the reference count will be stable and
you can modify the reference count without using atomics, because even
the lockless updates will see and respect the state of the lock.
In order to enable the cmpxchg lockless code, the architecture needs to
do three things:
(1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
in an aligned u64, and have a "cmpxchg()" implementation that works
on such a u64 data type.
(2) define a helper function to test for a spinlock being unlocked
("arch_spin_value_unlocked()")
(3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
Kconfig file.
This enables it for x86-64 (but not 32-bit, we'd need to make sure
cmpxchg() turns into the proper cmpxchg8b in order to enable it for
32-bit mode).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They aren't very good to inline, since they already call external
functions (the spinlock code), and we're going to create rather more
complicated versions of them that can do the reference count updates
locklessly.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
and re-implements it using the lockref infrastructure instead. It also
adds a lot of comments about what is actually going on, because turning
a dentry that was looked up using RCU into a long-lived reference
counted entry is one of the more subtle parts of the rcu walk.
We also used to be _particularly_ subtle in unlazy_walk() where we
re-validate both the dentry and its parent using the same sequence
count. We used to do it by nesting the locks and then verifying the
sequence count just once.
That was silly, because nested locking is expensive, but the sequence
count check is not. So this just re-validates the dentry and the parent
separately, avoiding the nested locking, and making the lockref lookup
possible.
Acked-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A valid parent pointer is always going to have a non-zero reference
count, but if we look up the parent optimistically without locking, we
have to protect against the (very unlikely) race against renaming
changing the parent from under us.
We do that by using lockref_get_not_zero(), and then re-checking the
parent pointer after getting a valid reference.
[ This is a re-implementation of a chunk from the original patch by
Waiman Long: "dcache: Enable lockless update of dentry's refcount".
I've completely rewritten the patch-series and split it up, but I'm
attributing this part to Waiman as it's close enough to his earlier
patch - Linus ]
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This behaves like "lockref_get_not_zero()", but instead of doing nothing
if the count was zero, it returns with the lock held.
This allows callers to revalidate the lockref-protected data structure
if required even if the count was zero to begin with, and possibly
increment the count if it passes muster.
In particular, the dentry code wants this when it wants to turn an
RCU-protected dentry into a stable refcounted one: if the dentry count
it zero, but the sequence number still validates the dentry, we can take
a reference to it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.
There are no semantic changes here, it's purely syntactic. The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.
This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.
[ As with the previous commit, this is a rewritten version of a concept
originally from Waiman, so credit goes to him, blame for any errors
goes to me.
Waiman's patch had some semantic differences for taking advantage of
the lockless update in dget_parent(), while this patch is
intentionally a pure search-and-replace change with no semantic
changes. - Linus ]
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This introduces a new "lockref" structure that supports the concept of
lockless updates of reference counts that still honor an attached
spinlock.
NOTE! This reference implementation is not the optimized lockless
version, rather it is the fallback implementation using standard
spinlocks. The actual optimized versions will be merged into 3.12, but
I wanted to get the infrastructure in place and document the new
interfaces.
[ Also note that this particular commit is drastically cut-down minimal
version of the original patch by Waiman. In order to properly credit
the original author I'm marking Waiman as the author here, but in the
end this patch bears little resemblance to the patch by Waiman. So
blame any errors on me editing things down to the point where I can
introduce the infrastructure before the merge window for 3.12 actually
opens. - Linus ]
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit bb2314b479.
It wasn't necessarily wrong per se, but we're still busily discussing
the exact details of this all, so I'm going to revert it for now.
It's true that you can already do flink() through /proc and that flink()
isn't new. But as Brad Spengler points out, some secure environments do
not mount proc, and flink adds a new interface that can avoid path
lookup of the source for those kinds of environments.
We may re-do this (and even mark it for stable backporting back in 3.11
and possibly earlier) once the whole discussion about the interface is done.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull regmap fixes from Mark Brown:
"Two changes here:
- Fix a bug in the rbtree code which could cause it to create two
different cache entries for the same register by adding a single
register at a time to the cache. This isn't awesome for
performance but it's non-invasive which we need for this late in
the release cycle and the I/O costs we're trying to avoid are high.
- Add another header used in the !CONFIG_REGMAP stubs where we had
been relying on implicit inclusion"
* tag 'regmap-v3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: rbtree: Fix overlapping rbnodes.
regmap: Add another missing header for !CONFIG_REGMAP stubs
Pull powerpc fixes from Ben Herrenschmidt:
"Here are 3 bug fixes that should probably go into 3.11 since I'm also
tagging them for stable.
Once fixes our old /proc/powerpc/lparcfg file which provides partition
informations when running under our hypervisor and also acts as a
user-triggerable Oops when hot :-(
The other two respectively are a one liner to fix a HVSI protocol
handshake problem causing the console to fail to show up on a bunch of
machines until we reach userspace, which I deem annoying enough to
warrant going to stable, and a nasty gcc miscompile causing us to pass
virtual instead of physical addresses to the firmware under some
circumstances"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/hvsi: Increase handshake timeout from 200ms to 400ms.
powerpc: Work around gcc miscompilation of __pa() on 64-bit
powerpc: Don't Oops when accessing /proc/powerpc/lparcfg without hypervisor
Dave reported corrupted swap entries
| [ 4588.541886] swap_free: Unused swap offset entry 00002d15
| [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067
and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless
the type of entry pte consists of. The trick here is that when we carry
soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY
instead, because this is the only place in pte which can be used for own
needs without intersecting with bits owned by swap entry type/offset.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Analyzed-by: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This solves a problem observed in kexec'ed kernel where 200ms timeout is
too short and bootconsole fails to initialize. Console did eventually
become workable but much later into the boot process.
Observed timeout was around 260ms, but I decided to make it a little bigger
for more reliability.
This has been tested on Power7 machine with Petitboot as a primary
bootloader and PowerNV firmware.
CC: <stable@vger.kernel.org>
Signed-off-by: Eugene Surovegin <surovegin@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On 64-bit, __pa(&static_var) gets miscompiled by recent versions of
gcc as something like:
addis 3,2,.LANCHOR1+4611686018427387904@toc@ha
addi 3,3,.LANCHOR1+4611686018427387904@toc@l
This ends up effectively ignoring the offset, since its bottom 32 bits
are zero, and means that the result of __pa() still has 0xC in the top
nibble. This happens with gcc 4.8.1, at least.
To work around this, for 64-bit we make __pa() use an AND operator,
and for symmetry, we make __va() use an OR operator. Using an AND
operator rather than a subtraction ends up with slightly shorter code
since it can be done with a single clrldi instruction, whereas it
takes three instructions to form the constant (-PAGE_OFFSET) and add
it on. (Note that MEMORY_START is always 0 on 64-bit.)
CC: <stable@vger.kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
/proc/powerpc/lparcfg is an ancient facility (though still actively used)
which allows access to some informations relative to the partition when
running underneath a PAPR compliant hypervisor.
It makes no sense on non-pseries machines. However, currently, not only
can it be created on these if the kernel has pseries support, but accessing
it on such a machine will crash due to trying to do hypervisor calls.
In fact, it should also not do HV calls on older pseries that didn't have
an hypervisor either.
Finally, it has the plumbing to be a module but is a "bool" Kconfig option.
This fixes the whole lot by turning it into a machine_device_initcall
that is only created on pseries, and adding the necessary hypervisor
check before calling the H_GET_EM_PARMS hypercall
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull USB bugfix from Greg KH:
"Here is a single bugfix that resolves the "can not build the OHCI
driver with CONFIG_PM disabled" problem that lots of people have been
reporting with 3.11-rc7. Sorry about that one, it missed my build
tests, and it seems, a number of others as well.
Thank goodness for Guenter :)"
* tag 'usb-3.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: OHCI: fix build error related to ohci_suspend/resume
Pull jfs fix from Dave Kleikamp:
"One JFS patch to fix an incompatibility with NFSv4 resulting in the
nfs client reporting a readdir loop"
* tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy:
jfs: fix readdir cookie incompatibility with NFSv4
Commit 9a11899c5e (USB: OHCI: add missing PCI PM callbacks to
ohci-pci.c) added missing ohci_suspend and ohci_resume callback
pointers, but forgot that these callbacks are declared and defined
only when CONFIG_PM is enabled.
This patch adds a preprocessor conditional to avoid build errors when
PM is disabled.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Meelis Roos <mroos@linux.ee>,
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull staging fixes from Greg KH:
"Here are two tiny staging tree fixes (well, one is for an iio driver,
but those updates come through the staging tree due to dependancies)
One fixes a problem with an IIO driver, and the other fixes a bug in
the comedi driver core"
* tag 'staging-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: comedi: bug-fix NULL pointer dereference on failed attach
iio: adjd_s311: Fix non-scan mode data read
Pull USB fixes from Greg KH:
"Here are two USB fixes for 3.11-rc7
One fixes a reported regression in the OHCI driver, and the other
fixes a reported build breakage in the USB phy drivers"
* tag 'usb-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: phy: fix build breakage
USB: OHCI: add missing PCI PM callbacks to ohci-pci.c
Pull ARM fixes from Russell King:
"This round of fixes is smaller than previous: a couple more updates
for the security fixes, and a one-liner kexec fix"
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7816/1: CONFIG_KUSER_HELPERS: fix help text
ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
ARM: 7819/1: fiq: Cast the first argument of flush_icache_range()
Pull vfs fixes from Al Viro:
"Assorted fixes from the last week or so"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: collect_mounts() should return an ERR_PTR
bfs: iget_locked() doesn't return an ERR_PTR
efs: iget_locked() doesn't return an ERR_PTR()
proc: kill the extra proc_readfd_common()->dir_emit_dots()
cope with potentially long ->d_dname() output for shmem/hugetlb
Pull ACPI fix from Rafael Wysocki:
"I really hoped that it wouldn't be necessary to change anything in
ACPI at this point, but it turns out that we need to revert one more
ACPI video commit causing trouble.
This reverts a change in the ACPI video driver that caused the ACPI
backlight initialization to be carried out even if acpi_backlight=vendor
is passed in the kernel command line which turns out to break things
at least on one system"
* tag 'acpi-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI / video: Always call acpi_video_init_brightness() on init"
Pull SCSI fixes from James Bottomley:
"This is a set of small bug fixes for lpfc and zfcp and a fix for a
fairly nasty bug in sg where a process which cancels I/O completes in
a kernel thread which would then try to write back to the now gone
userspace and end up writing to a random kernel address instead"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
[SCSI] zfcp: remove access control tables interface (keep sysfs files)
[SCSI] zfcp: fix schedule-inside-lock in scsi_device list loops
[SCSI] zfcp: fix lock imbalance by reworking request queue locking
[SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
[SCSI] lpfc: Don't force CONFIG_GENERIC_CSUM on