Reported syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
PGD 0
Oops: 0002 [#1] SMP
CPU: 1 PID: 125 Comm: kworker/1:1 Not tainted 4.9.0+ #1
Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
task: ffff9bbe0dfbb900 task.stack: ffffb61802014000
RIP: 0010:irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
Call Trace:
irqfd_shutdown+0x66/0xa0 [kvm]
process_one_work+0x16b/0x480
worker_thread+0x4b/0x500
kthread+0x101/0x140
? process_one_work+0x480/0x480
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x25/0x30
RIP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass] RSP: ffffb61802017e20
CR2: 0000000000000008
The syzkaller folks reported a NULL pointer dereference that due to
unregister an consumer which fails registration before. The syzkaller
creates two VMs w/ an equal eventfd occasionally. So the second VM
fails to register an irqbypass consumer. It will make irqfd as inactive
and queue an workqueue work to shutdown irqfd and unregister the irqbypass
consumer when eventfd is closed. However, the second consumer has been
initialized though it fails registration. So the token(same as the first
VM's) is taken to unregister the consumer through the workqueue, the
consumer of the first VM is found and unregistered, then NULL deref incurred
in the path of deleting consumer from the consumers list.
This patch fixes it by making irq_bypass_register/unregister_consumer()
looks for the consumer entry based on consumer pointer itself instead of
token matching.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull timer type cleanups from Thomas Gleixner:
"This series does a tree wide cleanup of types related to
timers/timekeeping.
- Get rid of cycles_t and use a plain u64. The type is not really
helpful and caused more confusion than clarity
- Get rid of the ktime union. The union has become useless as we use
the scalar nanoseconds storage unconditionally now. The 32bit
timespec alike storage got removed due to the Y2038 limitations
some time ago.
That leaves the odd union access around for no reason. Clean it up.
Both changes have been done with coccinelle and a small amount of
manual mopping up"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Get rid of ktime_equal()
ktime: Cleanup ktime_set() usage
ktime: Get rid of the union
clocksource: Use a plain u64 instead of cycle_t
Pull SMP hotplug notifier removal from Thomas Gleixner:
"This is the final cleanup of the hotplug notifier infrastructure. The
series has been reintgrated in the last two days because there came a
new driver using the old infrastructure via the SCSI tree.
Summary:
- convert the last leftover drivers utilizing notifiers
- fixup for a completely broken hotplug user
- prevent setup of already used states
- removal of the notifiers
- treewide cleanup of hotplug state names
- consolidation of state space
There is a sphinx based documentation pending, but that needs review
from the documentation folks"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/armada-xp: Consolidate hotplug state space
irqchip/gic: Consolidate hotplug state space
coresight/etm3/4x: Consolidate hotplug state space
cpu/hotplug: Cleanup state names
cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
staging/lustre/libcfs: Convert to hotplug state machine
scsi/bnx2i: Convert to hotplug state machine
scsi/bnx2fc: Convert to hotplug state machine
cpu/hotplug: Prevent overwriting of callbacks
x86/msr: Remove bogus cleanup from the error path
bus: arm-ccn: Prevent hotplug callback leak
perf/x86/intel/cstate: Prevent hotplug callback leak
ARM/imx/mmcd: Fix broken cpu hotplug handling
scsi: qedi: Convert to hotplug state machine
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.
Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unexport the low-level __get_user_pages_unlocked() function and replaces
invocations with calls to more appropriate higher-level functions.
In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
with get_user_pages_unlocked() since we can now pass gup_flags.
In async_pf_execute() and process_vm_rw_single_vec() we need to pass
different tsk, mm arguments so get_user_pages_remote() is the sane
replacement in these cases (having added manual acquisition and release
of mmap_sem.)
Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
flag. However, this flag was originally silently dropped by commit
1e9877902d ("mm/gup: Introduce get_user_pages_remote()"), so this
appears to have been unintentional and reintroducing it is therefore not
an issue.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull KVM updates from Paolo Bonzini:
"Small release, the most interesting stuff is x86 nested virt
improvements.
x86:
- userspace can now hide nested VMX features from guests
- nested VMX can now run Hyper-V in a guest
- support for AVX512_4VNNIW and AVX512_FMAPS in KVM
- infrastructure support for virtual Intel GPUs.
PPC:
- support for KVM guests on POWER9
- improved support for interrupt polling
- optimizations and cleanups.
s390:
- two small optimizations, more stuff is in flight and will be in
4.11.
ARM:
- support for the GICv3 ITS on 32bit platforms"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
KVM: arm/arm64: timer: Check for properly initialized timer on init
KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
KVM: x86: Handle the kthread worker using the new API
KVM: nVMX: invvpid handling improvements
KVM: nVMX: check host CR3 on vmentry and vmexit
KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
KVM: nVMX: propagate errors from prepare_vmcs02
KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
KVM: nVMX: support restore of VMX capability MSRs
KVM: nVMX: generate non-true VMX MSRs based on true versions
KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
KVM: x86: Add kvm_skip_emulated_instruction and use it.
KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
KVM: VMX: Reorder some skip_emulated_instruction calls
KVM: x86: Add a return value to kvm_emulate_cpuid
KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
...
Pull VFIO updates from Alex Williamson:
- VFIO updates for v4.10 primarily include a new Mediated Device
interface, which essentially allows software defined devices to be
exposed to users through VFIO. The host vendor driver providing this
virtual device polices, or mediates user access to the device.
These devices often incorporate portions of real devices, for
instance the primary initial users of this interface expose vGPUs
which allow the user to map mediated devices, or mdevs, to a portion
of a physical GPU. QEMU composes these mdevs into PCI representations
using the existing VFIO user API. This enables both Intel KVM-GT
support, which is also expected to arrive into Linux mainline during
the v4.10 merge window, as well as NVIDIA vGPU, and also Channel I/O
devices (aka CCW devices) for s390 virtualization support. (Kirti
Wankhede, Neo Jia)
- Drop unnecessary uses of pcibios_err_to_errno() (Cao Jin)
- Fixes to VFIO capability chain handling (Eric Auger)
- Error handling fixes for fallout from mdev (Christophe JAILLET)
- Notifiers to expose struct kvm to mdev vendor drivers (Jike Song)
- type1 IOMMU model search fixes (Kirti Wankhede, Neo Jia)
* tag 'vfio-v4.10-rc1' of git://github.com/awilliam/linux-vfio: (30 commits)
vfio iommu type1: Fix size argument to vfio_find_dma() in pin_pages/unpin_pages
vfio iommu type1: Fix size argument to vfio_find_dma() during DMA UNMAP.
vfio iommu type1: WARN_ON if notifier block is not unregistered
kvm: set/clear kvm to/from vfio_group when group add/delete
vfio: support notifier chain in vfio_group
vfio: vfio_register_notifier: classify iommu notifier
vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'
vfio: fix vfio_info_cap_add/shift
vfio/pci: Drop unnecessary pcibios_err_to_errno()
MAINTAINERS: Add entry VFIO based Mediated device drivers
docs: Sample driver to demonstrate how to use Mediated device framework.
docs: Sysfs ABI for mediated device framework
docs: Add Documentation for Mediated devices
vfio: Define device_api strings
vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
vfio: Introduce vfio_set_irqs_validate_and_prepare()
vfio_pci: Update vfio_pci to use vfio_info_add_capability()
vfio: Introduce common function to add capabilities
vfio iommu: Add blocking notifier to notify DMA_UNMAP
...
When the arch timer code fails to initialize (for example because the
memory mapped timer doesn't work, which is currently seen with the AEM
model), then KVM just continues happily with a final result that KVM
eventually does a NULL pointer dereference of the uninitialized cycle
counter.
Check directly for this in the init path and give the user a reasonable
error in this case.
Cc: Shih-Wei Li <shihwei@cs.columbia.edu>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The GICv2 spec says in section 4.3.12 that a "CPU targets field bit that
corresponds to an unimplemented CPU interface is RAZ/WI."
Currently we allow the guest to write any value in there and it can
read that back.
Mask the written value with the proper CPU mask to be spec compliant.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Sometimes users need to be aware when a vfio_group attaches to a
KVM or detaches from it. KVM already calls get/put method from vfio to
manipulate the vfio_group reference, it can notify vfio_group in
a similar way.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We should move the ops->destroy(dev) after the list_del(&dev->vm_node)
so that we don't use "dev" after freeing it.
Fixes: a28ebea2ad ("KVM: Protect device ops->create and list_add with kvm->lock")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The kvm module has the parameters halt_poll_ns, halt_poll_ns_grow, and
halt_poll_ns_shrink. Halt polling was recently added to the powerpc kvm-hv
module and these parameters were essentially duplicated for that. There is
no benefit to this duplication and it can lead to confusion when trying to
tune halt polling.
Thus move the definition of these variables to kvm_host.h and export them.
This will allow the kvm-hv module to use the same module parameters by
accessing these variables, which will be implemented in the next patch,
meaning that they will no longer be duplicated.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When we inject a level triggerered interrupt (and unless it
is backed by the physical distributor - timer style), we request
a maintenance interrupt. Part of the processing for that interrupt
is to feed to the rest of KVM (and to the eventfd subsystem) the
information that the interrupt has been EOIed.
But that notification only makes sense for SPIs, and not PPIs
(such as the PMU interrupt). Skip over the notification if
the interrupt is not an SPI.
Cc: stable@vger.kernel.org # 4.7+
Fixes: 140b086dd1 ("KVM: arm/arm64: vgic-new: Add GICv2 world switch backend")
Fixes: 59529f69f5 ("KVM: arm/arm64: vgic-new: Add GICv3 world switch backend")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This was reported by syzkaller:
[ INFO: possible recursive locking detected ]
4.9.0-rc4+ #49 Not tainted
---------------------------------------------
kworker/2:1/5658 is trying to acquire lock:
([ 1644.769018] (&work->work)
[< inline >] list_empty include/linux/compiler.h:243
[<ffffffff8128dd60>] flush_work+0x0/0x660 kernel/workqueue.c:1511
but task is already holding lock:
([ 1644.769018] (&work->work)
[<ffffffff812916ab>] process_one_work+0x94b/0x1900 kernel/workqueue.c:2093
stack backtrace:
CPU: 2 PID: 5658 Comm: kworker/2:1 Not tainted 4.9.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: events async_pf_execute
ffff8800676ff630 ffffffff81c2e46b ffffffff8485b930 ffff88006b1fc480
0000000000000000 ffffffff8485b930 ffff8800676ff7e0 ffffffff81339b27
ffff8800676ff7e8 0000000000000046 ffff88006b1fcce8 ffff88006b1fccf0
Call Trace:
...
[<ffffffff8128ddf3>] flush_work+0x93/0x660 kernel/workqueue.c:2846
[<ffffffff812954ea>] __cancel_work_timer+0x17a/0x410 kernel/workqueue.c:2916
[<ffffffff81295797>] cancel_work_sync+0x17/0x20 kernel/workqueue.c:2951
[<ffffffff81073037>] kvm_clear_async_pf_completion_queue+0xd7/0x400 virt/kvm/async_pf.c:126
[< inline >] kvm_free_vcpus arch/x86/kvm/x86.c:7841
[<ffffffff810b728d>] kvm_arch_destroy_vm+0x23d/0x620 arch/x86/kvm/x86.c:7946
[< inline >] kvm_destroy_vm virt/kvm/kvm_main.c:731
[<ffffffff8105914e>] kvm_put_kvm+0x40e/0x790 virt/kvm/kvm_main.c:752
[<ffffffff81072b3d>] async_pf_execute+0x23d/0x4f0 virt/kvm/async_pf.c:111
[<ffffffff8129175c>] process_one_work+0x9fc/0x1900 kernel/workqueue.c:2096
[<ffffffff8129274f>] worker_thread+0xef/0x1480 kernel/workqueue.c:2230
[<ffffffff812a5a94>] kthread+0x244/0x2d0 kernel/kthread.c:209
[<ffffffff831f102a>] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
The reason is that kvm_put_kvm is causing the destruction of the VM, but
the page fault is still on the ->queue list. The ->queue list is owned
by the VCPU, not by the work items, so we cannot just add list_del to
the work item.
Instead, use work->vcpu to note async page faults that have been resolved
and will be processed through the done list. There is no need to flush
those.
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM calls kvm_pmu_set_counter_event_type() when PMCCFILTR is configured.
But this function can't deals with PMCCFILTR correctly because the evtCount
bits of PMCCFILTR, which is reserved 0, conflits with the SW_INCR event
type of other PMXEVTYPER<n> registers. To fix it, when eventsel == 0, this
function shouldn't return immediately; instead it needs to check further
if select_idx is ARMV8_PMU_CYCLE_IDX.
Another issue is that KVM shouldn't copy the eventsel bits of PMCCFILTER
blindly to attr.config. Instead it ought to convert the request to the
"cpu cycle" event type (i.e. 0x11).
To support this patch and to prevent duplicated definitions, a limited
set of ARMv8 perf event types were relocated from perf_event.c to
asm/perf_event.h.
Cc: stable@vger.kernel.org # 4.6+
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
1) Since commit:41a54482 changed timer enabled variable to per-vcpu,
the correlative comment in kvm_timer_enable is useless now.
2) After the kvm module init successfully, the timecounter is always
non-null, so we can remove the checking of timercounter.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>