Commit Graph

1073038 Commits

Author SHA1 Message Date
David Matlack cb00a70bd4 KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.

Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:43 -05:00
David Matlack a3fe5dbda0 KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.

Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.

Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:

 1. Splitting on the vCPU thread interrupts vCPUs execution and is
    disruptive to customers whereas splitting on VM ioctl threads can
    run in parallel with vCPU execution.

 2. Splitting all huge pages at once is more efficient because it does
    not require performing VM-exit handling or walking the page table for
    every 4KiB page in the memslot, and greatly reduces the amount of
    contention on the mmu_lock.

For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.

Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.

Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:42 -05:00
David Matlack a82070b6e7 KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization
Separate the allocation of shadow pages from their initialization.  This
is in preparation for splitting huge pages outside of the vCPU fault
context, which requires a different allocation mechanism.

No functional changed intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-15-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:41 -05:00
David Matlack a3aca4de0d KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for splitting huge
pages during VM-ioctls which do not have access to the vCPU MMU context.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-14-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:41 -05:00
David Matlack a81399a573 KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
The vCPU's mmu_role already has the correct values for direct,
has_4_byte_gpte, access, and ad_disabled. Remove the code that was
redundantly overwriting these fields with the same values.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-13-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:41 -05:00
David Matlack 77aa60753a KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
Instead of passing a pointer to the root page table and the root level
separately, pass in a pointer to the root kvm_mmu_page struct.  This
reduces the number of arguments by 1, cutting down on line lengths.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:40 -05:00
David Matlack 315d86da89 KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
restore_acc_track_spte() is pure SPTE bit manipulation, making it a good
fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed,
there isn't any good reason to not inline it.

This move also prepares for a follow-up commit that will need to call
restore_acc_track_spte() from spte.c

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:40 -05:00
David Matlack 77c23c77f9 KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte()
The new_spte local variable is unnecessary. Deleting it can save a line
of code and simplify the remaining lines a bit.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:39 -05:00
David Matlack 59940e76d1 KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte()
The warnings in restore_acc_track_spte() can be removed because the only
caller checks is_access_track_spte(), and is_access_track_spte() checks
!spte_ad_enabled(). In other words, the warning can never be triggered.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:39 -05:00
David Matlack 7b7e1ab6fd KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table
Consolidate the logic to atomically replace an SPTE with an SPTE that
points to a new page table into a single helper function. This will be
used in a follow-up commit to split huge pages, which involves replacing
each huge page SPTE with an SPTE that points to a page table.

Opportunistically drop the call to trace_kvm_mmu_get_page() in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
tdp_mmu_alloc_sp().

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:39 -05:00
David Matlack 0f53dfa34e KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt()
First remove tdp_mmu_ from the name since it is redundant given that it
is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_
as a prefix in the names of static TDP MMU functions, but all of the
other handle_*() variants do not include such a prefix. So drop it
entirely.

Then change "page" to "pt" to convey that this is operating on a page
table rather than an struct page. Purposely use "pt" instead of "sp"
since this function takes the raw RCU-protected page table pointer as an
argument rather than  a pointer to the struct kvm_mmu_page.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:38 -05:00
David Matlack c298a30c28 KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
Rename 3 functions in tdp_mmu.c that handle shadow pages:

  alloc_tdp_mmu_page()  -> tdp_mmu_alloc_sp()
  tdp_mmu_link_page()   -> tdp_mmu_link_sp()
  tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp()

These changed make tdp_mmu a consistent prefix before the verb in the
function name, and make it more clear that these functions deal with
kvm_mmu_page structs rather than struct pages.

One could argue that "shadow page" is the wrong term for a page table in
the TDP MMU since it never actually shadows a guest page table.
However, "shadow page" (or "sp" for short) has evolved to become the
standard term in KVM when referring to a kvm_mmu_page struct, and its
associated page table and other metadata, regardless of whether the page
table shadows a guest page table. So this commit just makes the TDP MMU
more consistent with the rest of KVM.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:38 -05:00
David Matlack 3e72c791fd KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY
tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
with true indicating the SPTE modification was successful and false
indicating failure. Change these functions to return an int instead
since that is the common practice.

Opportunistically fix up the kernel-doc style for the Return section
above tdp_mmu_set_spte_atomic().

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:37 -05:00
David Matlack 3255530ab1 KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg failed. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:37 -05:00
David Matlack 1346bbb6b4 KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
The function formerly known as rmap_write_protect() has been renamed to
kvm_vcpu_write_protect_gfn(), so we can get rid of the double
underscores in front of __rmap_write_protect().

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:37 -05:00
David Matlack cf48f9e286 KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn()
rmap_write_protect() is a poor name because it also write-protects SPTEs
in the TDP MMU, not just SPTEs in the rmap. It is also confusing that
rmap_write_protect() is not a simple wrapper around
__rmap_write_protect(), since that is the common pattern for functions
with double-underscore names.

Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey
that KVM is write-protecting a specific gfn in the context of a vCPU.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:36 -05:00
Sean Christopherson 413af6601f KVM: x86: Add checks for reserved-to-zero Hyper-V hypercall fields
Add checks for the three fields in Hyper-V's hypercall params that must
be zero.  Per the TLFS, HV_STATUS_INVALID_HYPERCALL_INPUT is returned if
"A reserved bit in the specified hypercall input value is non-zero."

Note, some versions of the TLFS have an off-by-one bug for the last
reserved field, and define it as being bits 64:60.  See
https://github.com/MicrosoftDocs/Virtualization-Documentation/pull/1682.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:36 -05:00
Sean Christopherson 40421f38f6 KVM: x86: Reject fixeds-size Hyper-V hypercalls with non-zero "var_cnt"
Reject Hyper-V hypercalls if the guest specifies a non-zero variable size
header (var_cnt in KVM) for a hypercall that has a fixed header size.
Per the TLFS:

  It is illegal to specify a non-zero variable header size for a
  hypercall that is not explicitly documented as accepting variable sized
  input headers. In such a case the hypercall will result in a return
  code of HV_STATUS_INVALID_HYPERCALL_INPUT.

Note, at least some of the various DEBUG commands likely aren't allowed
to use variable size headers, but the TLFS documentation doesn't clearly
state what is/isn't allowed.  Omit them for now to avoid unnecessary
breakage.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:35 -05:00
Sean Christopherson 9c52f6b3d8 KVM: x86: Shove vp_bitmap handling down into sparse_set_to_vcpu_mask()
Move the vp_bitmap "allocation" that's needed to handle mismatched vp_index
values down into sparse_set_to_vcpu_mask() and drop __always_inline from
said helper.  The need for an intermediate vp_bitmap is a detail that's
specific to the sparse translation with mismatched VP<=>vCPU indexes and
does not need to be exposed to the caller.

Regarding the __always_inline, prior to commit f21dd49450 ("KVM: x86:
hyperv: optimize sparse VP set processing") the helper, then named
hv_vcpu_in_sparse_set(), was a tiny bit of code that effectively boiled
down to a handful of bit ops.  The __always_inline was understandable, if
not justifiable.  Since the aforementioned change, sparse_set_to_vcpu_mask()
is a chunky 350-450+ bytes of code without KASAN=y, and balloons to 1100+
with KASAN=y.  In other words, it has no business being forcefully inlined.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:35 -05:00
Sean Christopherson 79661c3766 KVM: x86: Don't bother reading sparse banks that end up being ignored
When handling "sparse" VP_SET requests, don't read sparse banks that
can't possibly contain a legal VP index instead of ignoring such banks
later on in sparse_set_to_vcpu_mask().  This allows KVM to cap the size
of its sparse_banks arrays for VP_SET at KVM_HV_MAX_SPARSE_VCPU_SET_BITS.
Add a compile time assert that KVM_HV_MAX_SPARSE_VCPU_SET_BITS<=64, i.e.
that KVM_MAX_VCPUS<=4096, as the TLFS allows for at most 64 sparse banks,
and KVM will need to do _something_ to play nice with Hyper-V.

Reducing the size of sparse_banks fudges around a compilation warning
(that becomes error with KVM_WERROR=y) when CONFIG_KASAN_STACK=y, which
is selected (and can't be unselected) by CONFIG_KASAN=y when using gcc
(clang/LLVM is a stack hog in some cases so it's opt-in for clang).
KASAN_STACK adds a redzone around every stack variable, which pushes the
Hyper-V functions over the default limit of 1024.

Ideally, KVM would flat out reject such impossibilities, but the TLFS
explicitly allows providing empty banks, even if a bank can't possibly
contain a valid VP index due to its position exceeding KVM's max.

  Furthermore, for a bit 1 in ValidBankMask, it is valid state for the
  corresponding element in BanksContents can be all 0s, meaning no
  processors are specified in this bank.

Arguably KVM should reject and not ignore the "extra" banks, but that can
be done independently and without bloating sparse_banks, e.g. by reading
each "extra" 8-byte chunk individually.

Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:34 -05:00
Sean Christopherson a0dd008fe9 KVM: x86: Add a helper to get the sparse VP_SET for IPIs and TLB flushes
Add a helper, kvm_get_sparse_vp_set(), to handle sanity checks related to
the VARHEAD field and reading the sparse banks of a VP_SET.  A future
commit to reduce the memory footprint of sparse_banks will introduce more
common code to the sparse bank retrieval.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:34 -05:00
Sean Christopherson 25af908118 KVM: x86: Refactor kvm_hv_flush_tlb() to reduce indentation
Refactor the "extended" path of kvm_hv_flush_tlb() to reduce the nesting
depth for the non-fast sparse path, and to make the code more similar to
the extended path in kvm_hv_send_ipi().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:34 -05:00
Sean Christopherson bd1ba5732b KVM: x86: Get the number of Hyper-V sparse banks from the VARHEAD field
Get the number of sparse banks from the VARHEAD field, which the guest is
required to provide as "The size of a variable header, in QWORDS.", where
the variable header is:

  Variable Header Bytes = {Total Header Bytes - sizeof(Fixed Header)}
                          rounded up to nearest multiple of 8
  Variable HeaderSize = Variable Header Bytes / 8

In other words, the VARHEAD should match the number of sparse banks.
Keep the manual count as a sanity check, but otherwise rely on the field
so as to more closely align with the logic defined in the TLFS and to
allow for future cleanups.

Tweak the tracepoint output to use "rep_cnt" instead of simply "cnt" now
that there is also "var_cnt".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:33 -05:00
David Matlack 02844ac1eb KVM: x86/mmu: Consolidate comments about {Host,MMU}-writable
Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the
large comment above is_writable_pte() into one comment. This comment
explains the different reasons why an SPTE may be non-writable and KVM
keeps track of that with the {Host,MMU}-writable bits.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230723.1701061-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:33 -05:00
David Matlack 1ca87e015d KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLE
Both "writeable" and "writable" are valid, but we should be consistent
about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in
the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230713.1700406-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:50:33 -05:00