Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
If vcpu has a interrupt in vmx non-root mode, injecting that interrupt
requires a vmexit. With posted interrupt processing, the vmexit
is not needed, and interrupts are fully taken care of by hardware.
In nested vmx, this feature avoids much more vmexits than non-nested vmx.
When L1 asks L0 to deliver L1's posted interrupt vector, and the target
VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV
to the target vCPU. Using POSTED_INTR_NV avoids unexpected interrupts
if a concurrent vmexit happens and L1's vector is different with L0's.
The IPI triggers posted interrupt processing in the target physical CPU.
In case the target vCPU was not in guest mode, complete the posted
interrupt delivery on the next entry to L2.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With APICv, LAPIC timer interrupt is always delivered via IRR:
apic_find_highest_irr syncs PIR to IRR.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We forgot to re-check LAPIC after splitting the loop in commit
173beedc16 (KVM: x86: Software disabled APIC should still deliver
NMIs, 2014-11-02).
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 173beedc16
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We cannot hit the bug now, but future patches will expose this path.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The majority of this patch turns
result = 0; if (CODE) result = 1; return result;
into
return CODE;
because we return bool now.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds a function kvm_vcpu_set_pending_timer instead of calling
kvm_make_request in lapic.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add tracepoint to wait_lapic_expire.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remind reader if early or late. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.
This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).
Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.
Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
-> kvm_x86_ops->vm_has_apicv()
-> vmx_vm_has_apicv() or '0' in svm case
-> return enable_apicv && irqchip_in_kernel(kvm)
So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work without in-kernel irqchip.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While fixing an x2apic bug,
17d68b7 KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
we've made only one cluster available. This means that the amount of
logically addressible x2APICs was reduced to 16 and VCPUs kept
overwriting themselves in that region, so even the first cluster wasn't
set up correctly.
This patch extends x2APIC support back to the logical_map's limit, and
keeps the CVE fixed as messages for non-present APICs are dropped.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
They can't be violated now, but play it safe for the future.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2apic allows destinations > 0xff and we don't want them delivered to
lower APICs. They are correctly handled by doing nothing.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Physical mode can't address more than one APIC, but lowest-prio is
allowed, so we just reuse our paths.
SDM 10.6.2.1 Physical Destination:
Also, for any non-broadcast IPI or I/O subsystem initiated interrupt
with lowest priority delivery mode, software must ensure that APICs
defined in the interrupt address are present and enabled to receive
interrupts.
We could warn on top of that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
False from kvm_irq_delivery_to_apic_fast() means that we don't handle it
in the fast path, but we still return false in cases that were perfectly
handled, fix that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
0x830 MSR is 0x300 xAPIC MMIO, which is MSR_ICR.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2APIC has no registers for DFR and ICR2 (see Intel SDM 10.12.1.2 "x2APIC
Register Address Space"). KVM needs to cause #GP on such accesses.
Fix it (DFR and ICR2 on read, ICR2 on write, DFR already handled on writes).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning. The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is
set. If this assumption is broken and apicv is disabled, the injection of
interrupts may be deferred until another interrupt is delivered to the guest.
Ultimately, if no other interrupt should be injected to that vCPU, the pending
interrupt may be lost.
commit 56cc2406d6 ("KVM: nVMX: fix "acknowledge interrupt on exit" when APICv
is in use") changed the behavior of apic_clear_irr so irr_pending is cleared
after setting APIC_IRR vector. After this commit, if apic_set_irr and
apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR
vector set, and irr_pending cleared. In the following example, assume a single
vector is set in IRR prior to calling apic_clear_irr:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic_clear_vector(...);
vec = apic_search_irr(apic);
// => vec == -1
apic_set_vector(...);
apic->irr_pending = (vec != -1);
// => apic->irr_pending == false
Nonetheless, it appears the race might even occur prior to this commit:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic->irr_pending = false;
apic_clear_vector(...);
if (apic_search_irr(apic) != -1)
apic->irr_pending = true;
// => apic->irr_pending == false
apic_set_vector(...);
Fixing this issue by:
1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call
apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending.
2. On apic_set_irr: first call apic_set_vector, then set irr_pending.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical destination mode can be used to send NMI IPIs even when all
APICs are software disabled, so if all APICs are software disabled we
should still look at the DFRs.
So the DFRs should all be the same, even if some or all APICs are
software disabled. However, the SDM does not say this, so tweak
the logic as follows:
- if one APIC is enabled and has LDR != 0, use that one to build the map.
This picks the right DFR in case an OS is only setting it for the
software-enabled APICs, or in case an OS is using logical addressing
on some APICs while leaving the rest in reset state (using LDR was
suggested by Radim).
- if all APICs are disabled, pick a random one to build the map.
We use the last one with LDR != 0 for simplicity.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC logical map does not consider VCPUs whose local-apic is
software-disabled. However, NMIs, INIT, etc. should still be delivered to such
VCPUs. Therefore, the APIC mode should first be determined, and then the map,
considering all VCPUs should be constructed.
To address this issue, first find the APIC mode, and only then construct the
logical map.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APIC base relocation is unsupported by KVM. If anyone uses it, the least should
be to report a warning in the hypervisor.
Note that KVM-unit-tests uses this feature for some reason, so running the
tests triggers the warning.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>