Commit Graph

389255 Commits

Author SHA1 Message Date
Linus Torvalds bc08b449ee lockref: implement lockless reference count updates using cmpxchg()
Instead of taking the spinlock, the lockless versions atomically check
that the lock is not taken, and do the reference count update using a
cmpxchg() loop.  This is semantically identical to doing the reference
count update protected by the lock, but avoids the "wait for lock"
contention that you get when accesses to the reference count are
contended.

Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
Even when the lockref reference counts are updated atomically with
cmpxchg, the fact that they also verify the state of the spinlock means
that the lockless updates can never happen while somebody else holds the
spinlock.

So while "lockref_put_or_lock()" looks a lot like just another name for
"atomic_dec_and_lock()", and both optimize to lockless updates, they are
fundamentally different: the decrement done by atomic_dec_and_lock() is
truly independent of any lock (as long as it doesn't decrement to zero),
so a locked region can still see the count change.

The lockref structure, in contrast, really is a *locked* reference
count.  If you hold the spinlock, the reference count will be stable and
you can modify the reference count without using atomics, because even
the lockless updates will see and respect the state of the lock.

In order to enable the cmpxchg lockless code, the architecture needs to
do three things:

 (1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
     in an aligned u64, and have a "cmpxchg()" implementation that works
     on such a u64 data type.

 (2) define a helper function to test for a spinlock being unlocked
     ("arch_spin_value_unlocked()")

 (3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
     Kconfig file.

This enables it for x86-64 (but not 32-bit, we'd need to make sure
cmpxchg() turns into the proper cmpxchg8b in order to enable it for
32-bit mode).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 12:12:15 -07:00
Linus Torvalds 2f4f12e571 lockref: uninline lockref helper functions
They aren't very good to inline, since they already call external
functions (the spinlock code), and we're going to create rather more
complicated versions of them that can do the reference count updates
locklessly.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 11:58:20 -07:00
Linus Torvalds 15570086b5 vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()
This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
and re-implements it using the lockref infrastructure instead.  It also
adds a lot of comments about what is actually going on, because turning
a dentry that was looked up using RCU into a long-lived reference
counted entry is one of the more subtle parts of the rcu walk.

We also used to be _particularly_ subtle in unlazy_walk() where we
re-validate both the dentry and its parent using the same sequence
count.  We used to do it by nesting the locks and then verifying the
sequence count just once.

That was silly, because nested locking is expensive, but the sequence
count check is not.  So this just re-validates the dentry and the parent
separately, avoiding the nested locking, and making the lockref lookup
possible.

Acked-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 11:38:06 -07:00
Waiman Long df3d0bbcdb vfs: use lockref_get_not_zero() for optimistic lockless dget_parent()
A valid parent pointer is always going to have a non-zero reference
count, but if we look up the parent optimistically without locking, we
have to protect against the (very unlikely) race against renaming
changing the parent from under us.

We do that by using lockref_get_not_zero(), and then re-checking the
parent pointer after getting a valid reference.

[ This is a re-implementation of a chunk from the original patch by
  Waiman Long: "dcache: Enable lockless update of dentry's refcount".
  I've completely rewritten the patch-series and split it up, but I'm
  attributing this part to Waiman as it's close enough to his earlier
  patch  - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 11:29:22 -07:00
Linus Torvalds b3abd80250 lockref: add 'lockref_get_or_lock() helper
This behaves like "lockref_get_not_zero()", but instead of doing nothing
if the count was zero, it returns with the lock held.

This allows callers to revalidate the lockref-protected data structure
if required even if the count was zero to begin with, and possibly
increment the count if it passes muster.

In particular, the dentry code wants this when it wants to turn an
RCU-protected dentry into a stable refcounted one: if the dentry count
it zero, but the sequence number still validates the dentry, we can take
a reference to it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 11:14:19 -07:00
Waiman Long 98474236f7 vfs: make the dentry cache use the lockref infrastructure
This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.

There are no semantic changes here, it's purely syntactic.  The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.

This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.

[ As with the previous commit, this is a rewritten version of a concept
  originally from Waiman, so credit goes to him, blame for any errors
  goes to me.

  Waiman's patch had some semantic differences for taking advantage of
  the lockless update in dget_parent(), while this patch is
  intentionally a pure search-and-replace change with no semantic
  changes.     - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 18:24:59 -07:00
Waiman Long 0f8f2aaaab Add new lockref infrastructure reference implementation
This introduces a new "lockref" structure that supports the concept of
lockless updates of reference counts that still honor an attached
spinlock.

NOTE! This reference implementation is not the optimized lockless
version, rather it is the fallback implementation using standard
spinlocks.  The actual optimized versions will be merged into 3.12, but
I wanted to get the infrastructure in place and document the new
interfaces.

[ Also note that this particular commit is drastically cut-down minimal
  version of the original patch by Waiman.  In order to properly credit
  the original author I'm marking Waiman as the author here, but in the
  end this patch bears little resemblance to the patch by Waiman.  So
  blame any errors on me editing things down to the point where I can
  introduce the infrastructure before the merge window for 3.12 actually
  opens.     - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 18:13:27 -07:00
Linus Torvalds f0cc6ffb8c Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"
This reverts commit bb2314b479.

It wasn't necessarily wrong per se, but we're still busily discussing
the exact details of this all, so I'm going to revert it for now.

It's true that you can already do flink() through /proc and that flink()
isn't new.  But as Brad Spengler points out, some secure environments do
not mount proc, and flink adds a new interface that can avoid path
lookup of the source for those kinds of environments.

We may re-do this (and even mark it for stable backporting back in 3.11
and possibly earlier) once the whole discussion about the interface is done.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 09:18:05 -07:00
Linus Torvalds fa8218def1 Merge tag 'regmap-v3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
 "Two changes here:

   - Fix a bug in the rbtree code which could cause it to create two
     different cache entries for the same register by adding a single
     register at a time to the cache.  This isn't awesome for
     performance but it's non-invasive which we need for this late in
     the release cycle and the I/O costs we're trying to avoid are high.

   - Add another header used in the !CONFIG_REGMAP stubs where we had
     been relying on implicit inclusion"

* tag 'regmap-v3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: rbtree: Fix overlapping rbnodes.
  regmap: Add another missing header for !CONFIG_REGMAP stubs
2013-08-27 10:10:30 -07:00
Linus Torvalds 0c6b5c5b45 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes from Ben Herrenschmidt:
 "Here are 3 bug fixes that should probably go into 3.11 since I'm also
  tagging them for stable.

  Once fixes our old /proc/powerpc/lparcfg file which provides partition
  informations when running under our hypervisor and also acts as a
  user-triggerable Oops when hot :-(

  The other two respectively are a one liner to fix a HVSI protocol
  handshake problem causing the console to fail to show up on a bunch of
  machines until we reach userspace, which I deem annoying enough to
  warrant going to stable, and a nasty gcc miscompile causing us to pass
  virtual instead of physical addresses to the firmware under some
  circumstances"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/hvsi: Increase handshake timeout from 200ms to 400ms.
  powerpc: Work around gcc miscompilation of __pa() on 64-bit
  powerpc: Don't Oops when accessing /proc/powerpc/lparcfg without hypervisor
2013-08-27 10:09:22 -07:00
Cyrill Gorcunov 6dec97dc92 mm: move_ptes -- Set soft dirty bit depending on pte type
Dave reported corrupted swap entries

 | [ 4588.541886] swap_free: Unused swap offset entry 00002d15
 | [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 pmd:22c01f067

and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless
the type of entry pte consists of.  The trick here is that when we carry
soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY
instead, because this is the only place in pte which can be used for own
needs without intersecting with bits owned by swap entry type/offset.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Analyzed-by: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-27 09:36:17 -07:00
Eugene Surovegin d220980b70 powerpc/hvsi: Increase handshake timeout from 200ms to 400ms.
This solves a problem observed in kexec'ed kernel where 200ms timeout is
too short and bootconsole fails to initialize. Console did eventually
become workable but much later into the boot process.

Observed timeout was around 260ms, but I decided to make it a little bigger
for more reliability.

This has been tested on Power7 machine with Petitboot as a primary
bootloader and PowerNV firmware.

CC: <stable@vger.kernel.org>
Signed-off-by: Eugene Surovegin <surovegin@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-27 16:59:56 +10:00
Paul Mackerras bdbc29c19b powerpc: Work around gcc miscompilation of __pa() on 64-bit
On 64-bit, __pa(&static_var) gets miscompiled by recent versions of
gcc as something like:

        addis 3,2,.LANCHOR1+4611686018427387904@toc@ha
        addi 3,3,.LANCHOR1+4611686018427387904@toc@l

This ends up effectively ignoring the offset, since its bottom 32 bits
are zero, and means that the result of __pa() still has 0xC in the top
nibble.  This happens with gcc 4.8.1, at least.

To work around this, for 64-bit we make __pa() use an AND operator,
and for symmetry, we make __va() use an OR operator.  Using an AND
operator rather than a subtraction ends up with slightly shorter code
since it can be done with a single clrldi instruction, whereas it
takes three instructions to form the constant (-PAGE_OFFSET) and add
it on.  (Note that MEMORY_START is always 0 on 64-bit.)

CC: <stable@vger.kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-27 16:59:30 +10:00
Benjamin Herrenschmidt f5f6cbb616 powerpc: Don't Oops when accessing /proc/powerpc/lparcfg without hypervisor
/proc/powerpc/lparcfg is an ancient facility (though still actively used)
which allows access to some informations relative to the partition when
running underneath a PAPR compliant hypervisor.

It makes no sense on non-pseries machines. However, currently, not only
can it be created on these if the kernel has pseries support, but accessing
it on such a machine will crash due to trying to do hypervisor calls.

In fact, it should also not do HV calls on older pseries that didn't have
an hypervisor either.

Finally, it has the plumbing to be a module but is a "bool" Kconfig option.

This fixes the whole lot by turning it into a machine_device_initcall
that is only created on pseries, and adding the necessary hypervisor
check before calling the H_GET_EM_PARMS hypercall

CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-27 16:38:33 +10:00
Linus Torvalds 9b50683321 Merge tag 'usb-3.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB bugfix from Greg KH:
 "Here is a single bugfix that resolves the "can not build the OHCI
  driver with CONFIG_PM disabled" problem that lots of people have been
  reporting with 3.11-rc7.  Sorry about that one, it missed my build
  tests, and it seems, a number of others as well.

  Thank goodness for Guenter :)"

* tag 'usb-3.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: OHCI: fix build error related to ohci_suspend/resume
2013-08-26 19:23:29 -07:00
Linus Torvalds 83c425d222 Merge tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy
Pull jfs fix from Dave Kleikamp:
 "One JFS patch to fix an incompatibility with NFSv4 resulting in the
  nfs client reporting a readdir loop"

* tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy:
  jfs: fix readdir cookie incompatibility with NFSv4
2013-08-26 19:22:49 -07:00
Alan Stern d3474049ab USB: OHCI: fix build error related to ohci_suspend/resume
Commit 9a11899c5e (USB: OHCI: add missing PCI PM callbacks to
ohci-pci.c) added missing ohci_suspend and ohci_resume callback
pointers, but forgot that these callbacks are declared and defined
only when CONFIG_PM is enabled.

This patch adds a preprocessor conditional to avoid build errors when
PM is disabled.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Meelis Roos <mroos@linux.ee>,
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-26 15:22:15 -07:00
Linus Torvalds d8dfad3876 Linux 3.11-rc7 2013-08-25 17:43:22 -07:00
Linus Torvalds c1c008cc55 Merge tag 'staging-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
 "Here are two tiny staging tree fixes (well, one is for an iio driver,
  but those updates come through the staging tree due to dependancies)

  One fixes a problem with an IIO driver, and the other fixes a bug in
  the comedi driver core"

* tag 'staging-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: comedi: bug-fix NULL pointer dereference on failed attach
  iio: adjd_s311: Fix non-scan mode data read
2013-08-25 12:44:15 -07:00
Linus Torvalds 5e25e4f304 Merge tag 'usb-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
 "Here are two USB fixes for 3.11-rc7

  One fixes a reported regression in the OHCI driver, and the other
  fixes a reported build breakage in the USB phy drivers"

* tag 'usb-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: phy: fix build breakage
  USB: OHCI: add missing PCI PM callbacks to ohci-pci.c
2013-08-25 12:43:44 -07:00
Linus Torvalds 1b4757ee6f Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
 "This round of fixes is smaller than previous: a couple more updates
  for the security fixes, and a one-liner kexec fix"

* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
  ARM: 7816/1: CONFIG_KUSER_HELPERS: fix help text
  ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
  ARM: 7819/1: fiq: Cast the first argument of flush_icache_range()
2013-08-25 12:41:37 -07:00
Linus Torvalds 4d4323ea2d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes from the last week or so"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: collect_mounts() should return an ERR_PTR
  bfs: iget_locked() doesn't return an ERR_PTR
  efs: iget_locked() doesn't return an ERR_PTR()
  proc: kill the extra proc_readfd_common()->dir_emit_dots()
  cope with potentially long ->d_dname() output for shmem/hugetlb
2013-08-25 12:25:38 -07:00
Linus Torvalds 8495e9c4a9 Merge tag 'acpi-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
 "I really hoped that it wouldn't be necessary to change anything in
  ACPI at this point, but it turns out that we need to revert one more
  ACPI video commit causing trouble.

  This reverts a change in the ACPI video driver that caused the ACPI
  backlight initialization to be carried out even if acpi_backlight=vendor
  is passed in the kernel command line which turns out to break things
  at least on one system"

* tag 'acpi-3.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "ACPI / video: Always call acpi_video_init_brightness() on init"
2013-08-24 11:34:33 -07:00
Linus Torvalds 5befb98b30 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "This is a set of small bug fixes for lpfc and zfcp and a fix for a
  fairly nasty bug in sg where a process which cancels I/O completes in
  a kernel thread which would then try to write back to the now gone
  userspace and end up writing to a random kernel address instead"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  [SCSI] zfcp: remove access control tables interface (keep sysfs files)
  [SCSI] zfcp: fix schedule-inside-lock in scsi_device list loops
  [SCSI] zfcp: fix lock imbalance by reworking request queue locking
  [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
  [SCSI] lpfc: Don't force CONFIG_GENERIC_CSUM on
2013-08-24 11:33:21 -07:00
Joern Rennecke b0f55f2a1a ARC: [lib] strchr breakage in Big-endian configuration
For a search buffer, 2 byte aligned, strchr() was returning pointer
outside of buffer (buf - 1)

------------->8----------------
    // Input buffer (default 4 byte aigned)
    char *buffer = "1AA_";

    // actual search start (to mimick 2 byte alignment)
    char *current_line = &(buffer[2]);

    // Character to search for
    char c = 'A';

    char *c_pos = strchr(current_line, c);

    printf("%s\n", c_pos) --> 'AA_' as oppose to 'A_'
------------->8----------------

Reported-by: Anton Kolesov <Anton.Kolesov@synopsys.com>
Debugged-by: Anton Kolesov <Anton.Kolesov@synopsys.com>
Cc: <stable@vger.kernel.org> # [3.9 and 3.10]
Cc: Noam Camus <noamc@ezchip.com>
Signed-off-by: Joern Rennecke  <joern.rennecke@embecosm.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-24 11:24:53 -07:00