Commit Graph

75 Commits

Author SHA1 Message Date
David Rientjes
668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Linus Torvalds
1b17366d69 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "So here's my next branch for powerpc.  A bit late as I was on vacation
  last week.  It's mostly the same stuff that was in next already, I
  just added two patches today which are the wiring up of lockref for
  powerpc, which for some reason fell through the cracks last time and
  is trivial.

  The highlights are, in addition to a bunch of bug fixes:

   - Reworked Machine Check handling on kernels running without a
     hypervisor (or acting as a hypervisor).  Provides hooks to handle
     some errors in real mode such as TLB errors, handle SLB errors,
     etc...

   - Support for retrieving memory error information from the service
     processor on IBM servers running without a hypervisor and routing
     them to the memory poison infrastructure.

   - _PAGE_NUMA support on server processors

   - 32-bit BookE relocatable kernel support

   - FSL e6500 hardware tablewalk support

   - A bunch of new/revived board support

   - FSL e6500 deeper idle states and altivec powerdown support

  You'll notice a generic mm change here, it has been acked by the
  relevant authorities and is a pre-req for our _PAGE_NUMA support"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
  powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
  powerpc: Add support for the optimised lockref implementation
  powerpc/powernv: Call OPAL sync before kexec'ing
  powerpc/eeh: Escalate error on non-existing PE
  powerpc/eeh: Handle multiple EEH errors
  powerpc: Fix transactional FP/VMX/VSX unavailable handlers
  powerpc: Don't corrupt transactional state when using FP/VMX in kernel
  powerpc: Reclaim two unused thread_info flag bits
  powerpc: Fix races with irq_work
  Move precessing of MCE queued event out from syscall exit path.
  pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
  powerpc: Make add_system_ram_resources() __init
  powerpc: add SATA_MV to ppc64_defconfig
  powerpc/powernv: Increase candidate fw image size
  powerpc: Add debug checks to catch invalid cpu-to-node mappings
  powerpc: Fix the setup of CPU-to-Node mappings during CPU online
  powerpc/iommu: Don't detach device without IOMMU group
  powerpc/eeh: Hotplug improvement
  powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
  powerpc/eeh: Add restore_config operation
  ...
2014-01-27 21:11:26 -08:00
Linus Torvalds
2d08cd0ef8 Merge tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio
Pull vfio update from Alex Williamson:
 - convert to misc driver to support module auto loading
 - remove unnecessary and dangerous use of device_lock

* tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio:
  vfio-pci: Don't use device_lock around AER interrupt setup
  vfio: Convert control interface to misc driver
  misc: Reserve minor for VFIO
2014-01-24 17:42:31 -08:00
Alex Williamson
890ed578df vfio-pci: Use pci "try" reset interface
PCI resets will attempt to take the device_lock for any device to be
reset.  This is a problem if that lock is already held, for instance
in the device remove path.  It's not sufficient to simply kill the
user process or skip the reset if called after .remove as a race could
result in the same deadlock.  Instead, we handle all resets as "best
effort" using the PCI "try" reset interfaces.  This prevents the user
from being able to induce a deadlock by triggering a reset.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-01-15 10:43:17 -07:00
Alex Williamson
3be3a074cf vfio-pci: Don't use device_lock around AER interrupt setup
device_lock is much too prone to lockups.  For instance if we have a
pending .remove then device_lock is already held.  If userspace
attempts to modify AER signaling after that point, a deadlock occurs.
eventfd setup/teardown is already protected in vfio with the igate
mutex.  AER is not a high performance interrupt, so we can also use
the same mutex to protect signaling versus setup races.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-14 16:12:55 -07:00
Alistair Popple
e589a4404f powerpc/iommu: Update constant names to reflect their hardcoded page size
The powerpc iommu uses a hardcoded page size of 4K. This patch changes
the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A
future patch will use the existing names to support dynamic page
sizes.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-30 14:17:06 +11:00
Alex Williamson
d10999016f vfio: Convert control interface to misc driver
This change allows us to support module auto loading using devname
support in userspace tools.  With this, /dev/vfio/vfio will always
be present and opening it will cause the vfio module to load.  This
should avoid needing to configure the system to statically load
vfio in order to get libvirt to correctly detect support for it.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-12-19 10:17:13 -07:00
Alex Williamson
274127a1fd PCI: Rename PCI_VC_PORT_REG1/2 to PCI_VC_PORT_CAP1/2
These are set of two capability registers, it's pretty much given that
they're registers, so reflect their purpose in the name.

Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2013-12-17 17:49:39 -07:00
Antonios Motakis
d93b3ac0ed VFIO: vfio_iommu_type1: fix bug caused by break in nested loop
In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking
that pages are not already mapped. Since the check is being done in a
for loop nested within the main loop, breaking out of it does not create
the intended behavior. If the underlying IOMMU driver returns a non-NULL
value, this will be ignored and mapping the DMA range will be attempted
anyway, leading to unpredictable behavior.

This interracts badly with the ARM SMMU driver issue fixed in the patch
that was submitted with the title:
"[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys"
Both fixes are required in order to use the vfio_iommu_type1 driver
with an ARM SMMU.

This patch refactors the function slightly, in order to also make this
kind of bug less likely.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-10-11 10:40:46 -06:00
Alex Williamson
8b27ee60bf vfio-pci: PCI hot reset interface
The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function.  This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset.  FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design.  We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.

This device specific extension to VFIO provides the user with this
ability.  Two new ioctls are introduced:
 - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
 - VFIO_DEVICE_PCI_HOT_RESET

The first provides the user with information about the extent of
devices affected by a hot reset.  This is essentially a list of
devices and the IOMMU groups they belong to.  The user may then
initiate a hot reset by calling the second ioctl.  We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset.  Each group
must have IOMMU protection established for the ioctl to succeed.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04 11:28:04 -06:00
Alex Williamson
17638db1b8 vfio-pci: Test for extended config space
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities.  Both specs define that the first capability
header is all zero if there are no extended capabilities.  Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04 10:58:52 -06:00
Alex Williamson
20e7745784 vfio-pci: Use fdget() rather than eventfd_fget()
eventfd_fget() tests to see whether the file is an eventfd file, which
we then immediately pass to eventfd_ctx_fileget(), which again tests
whether the file is an eventfd file.  Simplify slightly by using
fdget() so that we only test that we're looking at an eventfd once.
fget() could also be used, but fdget() makes use of fget_light() for
another slight optimization.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-28 09:49:55 -06:00
Alex Williamson
5d042fbdbb vfio: Add O_CLOEXEC flag to vfio device fd
Add the default O_CLOEXEC flag for device file descriptors.  This is
generally considered a safer option as it allows the user a race free
option to decide whether file descriptors are inherited across exec,
with the default avoiding file descriptor leaks.

Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22 10:33:41 -06:00
Yann Droneaud
a5d550703d vfio: use get_unused_fd_flags(0) instead of get_unused_fd()
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), functions anon_inode_getfd()
or get_unused_fd_flags() should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that
new code start using anon_inode_getfd() or get_unused_fd_flags()
with correct flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem basis,
and, if possible, set to O_CLOEXEC.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22 10:20:05 -06:00
Alexey Kardashevskiy
6cdd978213 vfio: add external user support
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.

However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.

The protocol includes:

1. do normal VFIO init operation:
	- opening a new container;
	- attaching group(s) to it;
	- setting an IOMMU driver for a container.
When IOMMU is set for a container, all groups in it are
considered ready to use by an external user.

2. User space passes a group fd to an external user.
The external user calls vfio_group_get_external_user()
to verify that:
	- the group is initialized;
	- IOMMU is set for it.
If both checks passed, vfio_group_get_external_user()
increments the container user counter to prevent
the VFIO group from disposal before KVM exits.

3. The external user calls vfio_external_user_iommu_id()
to know an IOMMU ID. PPC64 KVM uses it to link logical bus
number (LIOBN) with IOMMU ID.

4. When the external KVM finishes, it calls
vfio_group_put_external_user() to release the VFIO group.
This call decrements the container user counter.
Everything gets released.

The "vfio: Limit group opens" patch is also required for the consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-05 10:52:36 -06:00
Alex Williamson
d24cdbfd28 vfio-pci: Avoid deadlock on remove
If an attempt is made to unbind a device from vfio-pci while that
device is in use, the request is blocked until the device becomes
unused.  Unfortunately, that unbind path still grabs the device_lock,
which certain things like __pci_reset_function() also want to take.
This means we need to try to acquire the locks ourselves and use the
pre-locked version, __pci_reset_function_locked().

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:41 -06:00
Alex Williamson
c64019302b vfio: Ignore sprurious notifies
Remove debugging WARN_ON if we get a spurious notify for a group that
no longer exists.  No reports of anyone hitting this, but it would
likely be a race and not a bug if they did.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:40 -06:00
Alex Williamson
de9c7602ca vfio: Don't overreact to DEL_DEVICE
BUS_NOTIFY_DEL_DEVICE triggers IOMMU drivers to remove devices from
their iommu group, but there's really nothing we can do about it at
this point.  If the device is in use, then the vfio sub-driver will
block the device_del from completing until it's released.  If the
device is not in use or not owned by a vfio sub-driver, then we
really don't care that it's being removed.

The current code can be triggered just by unloading an sr-iov driver
(ex. igb) while the VFs are attached to vfio-pci because it makes an
incorrect assumption about the ordering of driver remove callbacks
vs the DEL_DEVICE notification.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:00 -06:00
Linus Torvalds
15a49b9a90 Merge tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
 "Largely hugepage support for vfio/type1 iommu and surrounding cleanups
  and fixes"

* tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Fix leak on error path
  vfio: Limit group opens
  vfio/type1: Fix missed frees and zero sized removes
  vfio: fix documentation
  vfio: Provide module option to disable vfio_iommu_type1 hugepage support
  vfio: hugepage support for vfio_iommu_type1
  vfio: Convert type1 iommu to use rbtree
2013-07-10 14:50:08 -07:00
Linus Torvalds
65b97fb730 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "This is the powerpc changes for the 3.11 merge window.  In addition to
  the usual bug fixes and small updates, the main highlights are:

   - Support for transparent huge pages by Aneesh Kumar for 64-bit
     server processors.  This allows the use of 16M pages as transparent
     huge pages on kernels compiled with a 64K base page size.

   - Base VFIO support for KVM on power by Alexey Kardashevskiy

   - Wiring up of our nvram to the pstore infrastructure, including
     putting compressed oopses in there by Aruna Balakrishnaiah

   - Move, rework and improve our "EEH" (basically PCI error handling
     and recovery) infrastructure.  It is no longer specific to pseries
     but is now usable by the new "powernv" platform as well (no
     hypervisor) by Gavin Shan.

   - I fixed some bugs in our math-emu instruction decoding and made it
     usable to emulate some optional FP instructions on processors with
     hard FP that lack them (such as fsqrt on Freescale embedded
     processors).

   - Support for Power8 "Event Based Branch" facility by Michael
     Ellerman.  This facility allows what is basically "userspace
     interrupts" for performance monitor events.

   - A bunch of Transactional Memory vs.  Signals bug fixes and HW
     breakpoint/watchpoint fixes by Michael Neuling.

  And more ...  I appologize in advance if I've failed to highlight
  something that somebody deemed worth it."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
  pstore: Add hsize argument in write_buf call of pstore_ftrace_call
  powerpc/fsl: add MPIC timer wakeup support
  powerpc/mpic: create mpic subsystem object
  powerpc/mpic: add global timer support
  powerpc/mpic: add irq_set_wake support
  powerpc/85xx: enable coreint for all the 64bit boards
  powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
  powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
  powerpc/mpic: Add get_version API both for internal and external use
  powerpc: Handle both new style and old style reserve maps
  powerpc/hw_brk: Fix off by one error when validating DAWR region end
  powerpc/pseries: Support compression of oops text via pstore
  powerpc/pseries: Re-organise the oops compression code
  pstore: Pass header size in the pstore write callback
  powerpc/powernv: Fix iommu initialization again
  powerpc/pseries: Inform the hypervisor we are using EBB regs
  powerpc/perf: Add power8 EBB support
  powerpc/perf: Core EBB support for 64-bit book3s
  powerpc/perf: Drop MMCRA from thread_struct
  powerpc/perf: Don't enable if we have zero events
  ...
2013-07-04 10:29:23 -07:00
Alex Williamson
8d38ef1948 vfio/type1: Fix leak on error path
We also don't handle unpinning zero pages as an error on other exits
so we can fix that inconsistency by rolling in the next conditional
return.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-01 08:28:58 -06:00
Al Viro
a47df1518e vfio: remap_pfn_range() sets all those flags...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:41 +04:00
Alex Williamson
6d6768c61b vfio: Limit group opens
vfio_group_fops_open attempts to limit concurrent sessions by
disallowing opens once group->container is set.  This really doesn't
do what we want and allow for inconsistent behavior, for instance a
group can be opened twice, then a container set giving the user two
file descriptors to the group.  But then it won't allow more to be
opened.  There's not much reason to have the group opened multiple
times since most access is through devices or the container, so
complete what the original code intended and only allow a single
instance.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-25 16:06:54 -06:00
Alex Williamson
f5bfdbf252 vfio/type1: Fix missed frees and zero sized removes
With hugepage support we can only properly aligned and sized ranges.
We only guarantee that we can unmap the same ranges mapped and not
arbitrary sub-ranges.  This means we might not free anything or might
free more than requested.  The vfio unmap interface started storing
the unmapped size to return to userspace to handle this.  This patch
fixes a few places where we don't properly handle those cases, moves
a memory allocation to a place where failure is an option and checks
our loops to make sure we don't get into an infinite loop trying to
remove an overlap.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-25 16:01:44 -06:00
Alex Williamson
5c6c2b21ec vfio: Provide module option to disable vfio_iommu_type1 hugepage support
Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support.  This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs.  If enabled
dynamically, only new mappings are restricted.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-21 09:38:11 -06:00