We want another single vector IRQ index to support signaling of
the device request to userspace. Generalize the error reporting
IRQ index to avoid code duplication.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When a request is made to unbind a device from a vfio bus driver,
we need to wait for the device to become unused, ie. for userspace
to release the device. However, we have a long standing TODO in
the code to do something proactive to make that happen. To enable
this, we add a request callback on the vfio bus driver struct,
which is intended to signal the user through the vfio device
interface to release the device. Instead of passively waiting for
the device to become unused, we can now pester the user to give
it up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move the iommu_group reference from the device to the vfio_group.
This ensures that the iommu_group persists as long as the vfio_group
remains. This can be important if all of the device from an
iommu_group are removed, but we still have an outstanding vfio_group
reference; we can still walk the empty list of devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There's a small window between the vfio bus driver calling
vfio_del_group_dev() and the device being completely unbound where
the vfio group appears to be non-viable. This creates a race for
users like QEMU/KVM where the kvm-vfio module tries to get an
external reference to the group in order to match and release an
existing reference, while the device is potentially being removed
from the vfio bus driver. If the group is momentarily non-viable,
kvm-vfio may not be able to release the group reference until VM
shutdown, making the group unusable until that point.
Bridge the gap between device removal from the group and completion
of the driver unbind by tracking it in a list. The device is added
to the list before the bus driver reference is released and removed
using the existing unbind notifier.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set. A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping. The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.
In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping. This works great when the IOMMU
supports superpages *and* they're in use. Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.
Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region. For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available. Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.
Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it. We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pull ARM SoC fixes from Olof Johansson:
"One more week's worth of fixes. Worth pointing out here are:
- A patch fixing detaching of iommu registrations when a device is
removed -- earlier the ops pointer wasn't managed properly
- Another set of Renesas boards get the same GIC setup fixup as
others have in previous -rcs
- Serial port aliases fixups for sunxi. We did the same to tegra but
we caught that in time before the merge window due to more machines
being affected. Here it took longer for anyone to notice.
- A couple more DT tweaks on sunxi
- A follow-up patch for the mvebu coherency disabling in last -rc
batch"
* tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
arm: dma-mapping: Set DMA IOMMU ops in arm_iommu_attach_device()
ARM: shmobile: r8a7790: Instantiate GIC from C board code in legacy builds
ARM: shmobile: r8a73a4: Instantiate GIC from C board code in legacy builds
ARM: mvebu: don't set the PL310 in I/O coherency mode when I/O coherency is disabled
ARM: sunxi: dt: Fix aliases
ARM: dts: sun4i: Add simplefb node with de_fe0-de_be0-lcd0-hdmi pipeline
ARM: dts: sun6i: ippo-q8h-v5: Fix serial0 alias
ARM: dts: sunxi: Fix usb-phy support for sun4i/sun5i
Pull input layer updates from Dmitry Torokhov:
"Just a few quirks for PS/2 this time"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: elantech - add more Fujtisu notebooks to force crc_enabled
Input: i8042 - add noloop quirk for Medion Akoya E7225 (MD98857)
Input: synaptics - adjust min/max for Lenovo ThinkPad X1 Carbon 2nd
Commit 8eb23b9f35 ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.
However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).
And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.
In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.
This fixes both cases:
- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.
- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.
(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge "Third Round of Renesas ARM Based SoC Fixes for v3.19" from Simon Horman:
* Instantiate GIC from C board code in legacy builds on r8a7790 and r8a73a4
* tag 'renesas-soc-fixes3-for-v3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
ARM: shmobile: r8a7790: Instantiate GIC from C board code in legacy builds
ARM: shmobile: r8a73a4: Instantiate GIC from C board code in legacy builds
Signed-off-by: Olof Johansson <olof@lixom.net>
Pull i2c fixes from Wolfram Sang:
"i2c driver bugfixes (s3c2410, slave-eeprom, sh_mobile), size
regression "bugfix" (i2c slave), documentation bugfix (st).
Also, one documentation update (da9063), so some devicetrees can now
be verified"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: sh_mobile: terminate DMA reads properly
i2c: Only include slave support if selected
i2c: s3c2410: fix ABBA deadlock by keeping clock prepared
i2c: slave-eeprom: fix boundary check when using sysfs
i2c: st: Rename clock reference to something that exists
DT: i2c: Add devices handled by the da9063 MFD driver
Pull char/misc driver fixes from Greg KH:
"Here are two tiny patches, one fixing up the drivers/Kconfig file, and
one adding a MAINTAINERS entry for the UIO git tree"
* tag 'char-misc-3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
drivers/Kconfig: remove duplicate entry for soc
MAINTAINERS: add git url entry for UIO
Pull staging tree fixes from Greg KH:
"Here are two tiny staging tree fixes. One for the nvec driver to
resolve a reported problem, and one to add a MAINTAINERS entry for the
Android drivers"
* tag 'staging-3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
MAINTAINERS: add Android driver entries
staging: nvec: specify a platform-device base id
Pull USB fixes from Greg KH:
"Here are some small USB fixes and quirk additions for 3.19-rc7.
All have been in linux-next for a while with no reported problems"
* tag 'usb-3.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: Add OTG PET device to TPL
usb-storage/SCSI: blacklist FUA on JMicron 152d:2566 USB-SATA controller
uas: Add no-report-opcodes quirk for Simpletech devices with id 4971:8017
storage: Revise/fix quirk for 04E6:000F SCM USB-SCSI converter
usb: phy: never defer probe in non-OF case
usb: dwc2: call dwc2_is_controller_alive() under spinlock
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also an event groups fix, two PMU driver
fixes and a CPU model variant addition"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Tighten (and fix) the grouping condition
perf/x86/intel: Add model number for Airmont
perf/rapl: Fix crash in rapl_scale()
perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
perf probe: Fix probing kretprobes
perf symbols: Introduce 'for' method to iterate over the symbols with a given name
perf probe: Do not rely on map__load() filter to find symbols
perf symbols: Introduce method to iterate symbols ordered by name
perf symbols: Return the first entry with a given name in find_by_name method
perf annotate: Fix memory leaks in LOCK handling
perf annotate: Handle ins parsing failures
perf scripting perl: Force to use stdbool
perf evlist: Remove extraneous 'was' on error message
Pull btrfs fix from Chris Mason:
"We have one more fix for btrfs in my for-linus branch - this was a bug
in the new raid5/6 scrubbing support"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix raid56 scrub failed in xfstests btrfs/072
Pull quota and UDF fix from Jan Kara:
"A fix for UDF to properly free preallocated blocks and a fix for quota
so that Q_GETQUOTA quotactl reports correct numbers for XFS filesystem
(and similarly Q_XGETQUOTA quotactl works properly for other
filesystems)"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units
udf: Release preallocation on last writeable close
Pull KVM fixes from Paolo Bonzini:
"The ARM changes are largish, but not too scary. And a simple fix for
x86 (bug introduced in 3.19)"
(Paolo sayus these are the "Final" fixes. We'll see).
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: check LAPIC presence when building apic_map
arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault
arm/arm64: KVM: Invalidate data cache on unmap
arm/arm64: KVM: Use set/way op trapping to track the state of the caches
Pull IOMMU fixes from Joerg Roedel:
"Two small fixes for the Tegra GART IOMMU driver:
- provide a .map_sg function for iommu_ops
- do not register Tegra GART driver as a workaround because of issues
with it when used from DRM code"
* tag 'iommu-fixes-v3.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/tegra: gart: Provide default ->map_sg() callback
iommu/tegra: gart: Do not register with bus
Pull intel and dp mst drm fixes from Dave Airlie:
"Intel had a few more fixes lined up and no point me sitting on them,
along with a DP MST fix from Rob for a race at undock + vt switch"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm: fix fb-helper vs MST dangling connector ptrs (v2)
drm/i915: BDW Fix Halo PCI IDs marked as ULT.
drm/i915: Fix and clean BDW PCH identification
drm/i915: Only fence tiled region of object.
drm/i915: fix inconsistent brightness after resume
drm/i915: Init PPGTT before context enable
DMA read requests could miss proper termination, so two more bytes would
have been read via PIO overwriting the end of the buffer with wrong
data. Make DMA stop handling more readable while we are here.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>