Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
This commit is contained in:
Linus Torvalds
2013-07-03 13:21:40 -07:00
62 changed files with 1383 additions and 808 deletions
+4 -4
View File
@@ -2278,7 +2278,7 @@ return indicates the attribute is implemented. It does not necessarily
indicate that the attribute can be read or written in the device's
current state. "addr" is ignored.
4.77 KVM_ARM_VCPU_INIT
4.82 KVM_ARM_VCPU_INIT
Capability: basic
Architectures: arm, arm64
@@ -2304,7 +2304,7 @@ Possible features:
Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
4.78 KVM_GET_REG_LIST
4.83 KVM_GET_REG_LIST
Capability: basic
Architectures: arm, arm64
@@ -2324,7 +2324,7 @@ This ioctl returns the guest registers that are supported for the
KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
4.80 KVM_ARM_SET_DEVICE_ADDR
4.84 KVM_ARM_SET_DEVICE_ADDR
Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
Architectures: arm, arm64
@@ -2362,7 +2362,7 @@ must be called after calling KVM_CREATE_IRQCHIP, but before calling
KVM_RUN on any of the VCPUs. Calling this ioctl twice for any of the
base addresses will return -EEXIST.
4.82 KVM_PPC_RTAS_DEFINE_TOKEN
4.85 KVM_PPC_RTAS_DEFINE_TOKEN
Capability: KVM_CAP_PPC_RTAS
Architectures: ppc
+83 -8
View File
@@ -191,12 +191,12 @@ Shadow pages contain the following information:
A counter keeping track of how many hardware registers (guest cr3 or
pdptrs) are now pointing at the page. While this counter is nonzero, the
page cannot be destroyed. See role.invalid.
multimapped:
Whether there exist multiple sptes pointing at this page.
parent_pte/parent_ptes:
If multimapped is zero, parent_pte points at the single spte that points at
this page's spt. Otherwise, parent_ptes points at a data structure
with a list of parent_ptes.
parent_ptes:
The reverse mapping for the pte/ptes pointing at this page's spt. If
parent_ptes bit 0 is zero, only one spte points at this pages and
parent_ptes points at this single spte, otherwise, there exists multiple
sptes pointing at this page and (parent_ptes & ~0x1) points at a data
structure with a list of parent_ptes.
unsync:
If true, then the translations in this page may not match the guest's
translation. This is equivalent to the state of the tlb when a pte is
@@ -210,6 +210,24 @@ Shadow pages contain the following information:
A bitmap indicating which sptes in spt point (directly or indirectly) at
pages that may be unsynchronized. Used to quickly locate all unsychronized
pages reachable from a given page.
mmu_valid_gen:
Generation number of the page. It is compared with kvm->arch.mmu_valid_gen
during hash table lookup, and used to skip invalidated shadow pages (see
"Zapping all pages" below.)
clear_spte_count:
Only present on 32-bit hosts, where a 64-bit spte cannot be written
atomically. The reader uses this while running out of the MMU lock
to detect in-progress updates and retry them until the writer has
finished the write.
write_flooding_count:
A guest may write to a page table many times, causing a lot of
emulations if the page needs to be write-protected (see "Synchronized
and unsynchronized pages" below). Leaf pages can be unsynchronized
so that they do not trigger frequent emulation, but this is not
possible for non-leafs. This field counts the number of emulations
since the last time the page table was actually used; if emulation
is triggered too frequently on this page, KVM will unmap the page
to avoid emulation in the future.
Reverse map
===========
@@ -258,14 +276,26 @@ This is the most complicated event. The cause of a page fault can be:
Handling a page fault is performed as follows:
- if the RSV bit of the error code is set, the page fault is caused by guest
accessing MMIO and cached MMIO information is available.
- walk shadow page table
- check for valid generation number in the spte (see "Fast invalidation of
MMIO sptes" below)
- cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and
vcpu->arch.mmio_gfn, and call the emulator
- If both P bit and R/W bit of error code are set, this could possibly
be handled as a "fast page fault" (fixed without taking the MMU lock). See
the description in Documentation/virtual/kvm/locking.txt.
- if needed, walk the guest page tables to determine the guest translation
(gva->gpa or ngpa->gpa)
- if permissions are insufficient, reflect the fault back to the guest
- determine the host page
- if this is an mmio request, there is no host page; call the emulator
to emulate the instruction instead
- if this is an mmio request, there is no host page; cache the info to
vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn
- walk the shadow page table to find the spte for the translation,
instantiating missing intermediate page tables as necessary
- If this is an mmio request, cache the mmio info to the spte and set some
reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
- try to unsynchronize the page
- if successful, we can let the guest continue and modify the gpte
- emulate the instruction
@@ -351,6 +381,51 @@ causes its write_count to be incremented, thus preventing instantiation of
a large spte. The frames at the end of an unaligned memory slot have
artificially inflated ->write_counts so they can never be instantiated.
Zapping all pages (page generation count)
=========================================
For the large memory guests, walking and zapping all pages is really slow
(because there are a lot of pages), and also blocks memory accesses of
all VCPUs because it needs to hold the MMU lock.
To make it be more scalable, kvm maintains a global generation number
which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores
the current global generation-number into sp->mmu_valid_gen when it
is created. Pages with a mismatching generation number are "obsolete".
When KVM need zap all shadow pages sptes, it just simply increases the global
generation-number then reload root shadow pages on all vcpus. As the VCPUs
create new shadow page tables, the old pages are not used because of the
mismatching generation number.
KVM then walks through all pages and zaps obsolete pages. While the zap
operation needs to take the MMU lock, the lock can be released periodically
so that the VCPUs can make progress.
Fast invalidation of MMIO sptes
===============================
As mentioned in "Reaction to events" above, kvm will cache MMIO
information in leaf sptes. When a new memslot is added or an existing
memslot is changed, this information may become stale and needs to be
invalidated. This also needs to hold the MMU lock while walking all
shadow pages, and is made more scalable with a similar technique.
MMIO sptes have a few spare bits, which are used to store a
generation number. The global generation number is stored in
kvm_memslots(kvm)->generation, and increased whenever guest memory info
changes. This generation number is distinct from the one described in
the previous section.
When KVM finds an MMIO spte, it checks the generation number of the spte.
If the generation number of the spte does not equal the global generation
number, it will ignore the cached MMIO information and handle the page
fault through the slow path.
Since only 19 bits are used to store generation-number on mmio spte, all
pages are zapped when there is an overflow.
Further reading
===============