To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h. Functions
that refer to architecture-specific requests are also moved
to arch/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
[Takuya moved all subsequent constants to fill the void, but that
is useless in view of the following patches. So this change looks
nothing like the original. - Paolo]
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It looks like this in action:
kvm [5197]: vcpu0, guest rIP: 0xffffffff810187ba unhandled rdmsr: 0xc001102
and helps to pinpoint quickly where in the guest we did the unsupported
thing.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.
Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR. If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.
Changes v2:
* Use remainder to calculate periodic timer expiration time
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SynIC message protocol mandates that the message slot is claimed
by atomically setting message type to something other than HVMSG_NONE.
If another message is to be delivered while the slot is still busy,
message pending flag is asserted to indicate to the guest that the
hypervisor wants to be notified when the slot is released.
To make sure the protocol works regardless of where the message
sources are (kernel or userspace), clear the pending flag on SINT ACK
notification, and let the message sources compete for the slot again.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM creates debugfs files to export VM statistics to userland. To be
able to remove them on kvm exit it tracks the files' dentries.
Since their parent directory is also tracked and since each parent
direntry knows its children we can easily remove them by using
debugfs_remove_recursive(kvm_debugfs_dir). Therefore we don't
need the extra tracking in the kvm_stats_debugfs_item anymore.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-By: Sascha Silbe <silbe@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch makes kvm_is_visible_gfn return bool due to this particular
function only using either one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A new vcpu exit is introduced to notify the userspace of the
changes in Hyper-V SynIC configuration triggered by guest writing to the
corresponding MSRs.
Changes v4:
* exit into userspace only if guest writes into SynIC MSR's
Changes v3:
* added KVM_EXIT_HYPERV types and structs notes into docs
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC (synthetic interrupt controller) is a lapic extension,
which is controlled via MSRs and maintains for each vCPU
- 16 synthetic interrupt "lines" (SINT's); each can be configured to
trigger a specific interrupt vector optionally with auto-EOI
semantics
- a message page in the guest memory with 16 256-byte per-SINT message
slots
- an event flag page in the guest memory with 16 2048-bit per-SINT
event flag areas
The host triggers a SINT whenever it delivers a new message to the
corresponding slot or flips an event flag bit in the corresponding area.
The guest informs the host that it can try delivering a message by
explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
MSR.
The userspace (qemu) triggers interrupts and receives EOM notifications
via irqfd with resampler; for that, a GSI is allocated for each
configured SINT, and irq_routing api is extended to support GSI-SINT
mapping.
Changes v4:
* added activation of SynIC by vcpu KVM_ENABLE_CAP
* added per SynIC active flag
* added deactivation of APICv upon SynIC activation
Changes v3:
* added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
docs
Changes v2:
* do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
* add Hyper-V SynIC vectors into EOI exit bitmap
* Hyper-V SyniIC SINT msr write logic simplified
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Actually kvm_arch_irq_routing_update() should be
kvm_arch_post_irq_routing_update() as it's called at the end
of irq routing update.
This renaming frees kvm_arch_irq_routing_update function name.
kvm_arch_irq_routing_update() weak function which will be used
to update mappings for arch-specific irq routing entries
(in particular, the upcoming Hyper-V synthetic interrupts).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this
patch generalizes it to a common TSC scaling function.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
[Inline the multiplication and shift steps into mul_u64_u64_shr. Remove
BUG_ON. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function is not used outside device assignment, and
kvm_arch_set_irq_inatomic has a different prototype. Move it here and
make it static to avoid confusion.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We do not want to do too much work in atomic context, in particular
not walking all the VCPUs of the virtual machine. So we want
to distinguish the architecture-specific injection function for irqfd
from kvm_set_msi. Since it's still empty, reuse the newly added
kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/ARM Changes for v4.4-rc1
Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.
Conflicts:
arch/x86/include/asm/kvm_host.h
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).
Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Allow for arch-specific interrupt types to be set. For that, add
kvm_arch_set_irq() which takes interrupt type-specific action if it
recognizes the interrupt type given, and -EWOULDBLOCK otherwise.
The default implementation always returns -EWOULDBLOCK.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.
pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
post-block:
- Remove the vCPU from the per-CPU list
Signed-off-by: Feng Wu <feng.wu@intel.com>
[Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds an arch specific hooks 'arch_update' in
'struct kvm_kernel_irqfd'. On Intel side, it is used to
update the IRTE when VT-d posted-interrupts is used.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch introduces
- kvm_arch_irq_bypass_add_producer
- kvm_arch_irq_bypass_del_producer
- kvm_arch_irq_bypass_stop
- kvm_arch_irq_bypass_start
They make possible to specialize the KVM IRQ bypass consumer in
case CONFIG_KVM_HAVE_IRQ_BYPASS is set.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
[Add weak implementations of the callbacks. - Feng]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>