mirror of
https://github.com/armbian/linux-cix.git
synced 2026-01-06 12:30:45 -08:00
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
This commit is contained in:
@@ -272,18 +272,6 @@ the VCPU file descriptor can be mmap-ed, including:
|
||||
KVM_CAP_DIRTY_LOG_RING, see section 8.3.
|
||||
|
||||
|
||||
4.6 KVM_SET_MEMORY_REGION
|
||||
-------------------------
|
||||
|
||||
:Capability: basic
|
||||
:Architectures: all
|
||||
:Type: vm ioctl
|
||||
:Parameters: struct kvm_memory_region (in)
|
||||
:Returns: 0 on success, -1 on error
|
||||
|
||||
This ioctl is obsolete and has been removed.
|
||||
|
||||
|
||||
4.7 KVM_CREATE_VCPU
|
||||
-------------------
|
||||
|
||||
@@ -368,17 +356,6 @@ see the description of the capability.
|
||||
Note that the Xen shared info page, if configured, shall always be assumed
|
||||
to be dirty. KVM will not explicitly mark it such.
|
||||
|
||||
4.9 KVM_SET_MEMORY_ALIAS
|
||||
------------------------
|
||||
|
||||
:Capability: basic
|
||||
:Architectures: x86
|
||||
:Type: vm ioctl
|
||||
:Parameters: struct kvm_memory_alias (in)
|
||||
:Returns: 0 (success), -1 (error)
|
||||
|
||||
This ioctl is obsolete and has been removed.
|
||||
|
||||
|
||||
4.10 KVM_RUN
|
||||
------------
|
||||
@@ -1332,7 +1309,7 @@ yet and must be cleared on entry.
|
||||
__u64 userspace_addr; /* start of the userspace allocated memory */
|
||||
};
|
||||
|
||||
/* for kvm_memory_region::flags */
|
||||
/* for kvm_userspace_memory_region::flags */
|
||||
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
|
||||
#define KVM_MEM_READONLY (1UL << 1)
|
||||
|
||||
@@ -1377,10 +1354,6 @@ the memory region are automatically reflected into the guest. For example, an
|
||||
mmap() that affects the region will be made visible immediately. Another
|
||||
example is madvise(MADV_DROP).
|
||||
|
||||
It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
|
||||
The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
|
||||
allocation and is deprecated.
|
||||
|
||||
|
||||
4.36 KVM_SET_TSS_ADDR
|
||||
---------------------
|
||||
@@ -3293,6 +3266,7 @@ valid entries found.
|
||||
----------------------
|
||||
|
||||
:Capability: KVM_CAP_DEVICE_CTRL
|
||||
:Architectures: all
|
||||
:Type: vm ioctl
|
||||
:Parameters: struct kvm_create_device (in/out)
|
||||
:Returns: 0 on success, -1 on error
|
||||
@@ -3333,6 +3307,7 @@ number.
|
||||
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
|
||||
KVM_CAP_VCPU_ATTRIBUTES for vcpu device
|
||||
KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
|
||||
:Architectures: x86, arm64, s390
|
||||
:Type: device ioctl, vm ioctl, vcpu ioctl
|
||||
:Parameters: struct kvm_device_attr
|
||||
:Returns: 0 on success, -1 on error
|
||||
@@ -4104,80 +4079,71 @@ flags values for ``struct kvm_msr_filter_range``:
|
||||
``KVM_MSR_FILTER_READ``
|
||||
|
||||
Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
|
||||
indicates that a read should immediately fail, while a 1 indicates that
|
||||
a read for a particular MSR should be handled regardless of the default
|
||||
indicates that read accesses should be denied, while a 1 indicates that
|
||||
a read for a particular MSR should be allowed regardless of the default
|
||||
filter action.
|
||||
|
||||
``KVM_MSR_FILTER_WRITE``
|
||||
|
||||
Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
|
||||
indicates that a write should immediately fail, while a 1 indicates that
|
||||
a write for a particular MSR should be handled regardless of the default
|
||||
indicates that write accesses should be denied, while a 1 indicates that
|
||||
a write for a particular MSR should be allowed regardless of the default
|
||||
filter action.
|
||||
|
||||
``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
|
||||
|
||||
Filter both read and write accesses to MSRs using the given bitmap. A 0
|
||||
in the bitmap indicates that both reads and writes should immediately fail,
|
||||
while a 1 indicates that reads and writes for a particular MSR are not
|
||||
filtered by this range.
|
||||
|
||||
flags values for ``struct kvm_msr_filter``:
|
||||
|
||||
``KVM_MSR_FILTER_DEFAULT_ALLOW``
|
||||
|
||||
If no filter range matches an MSR index that is getting accessed, KVM will
|
||||
fall back to allowing access to the MSR.
|
||||
allow accesses to all MSRs by default.
|
||||
|
||||
``KVM_MSR_FILTER_DEFAULT_DENY``
|
||||
|
||||
If no filter range matches an MSR index that is getting accessed, KVM will
|
||||
fall back to rejecting access to the MSR. In this mode, all MSRs that should
|
||||
be processed by KVM need to explicitly be marked as allowed in the bitmaps.
|
||||
deny accesses to all MSRs by default.
|
||||
|
||||
This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
|
||||
specify whether a certain MSR access should be explicitly filtered for or not.
|
||||
This ioctl allows userspace to define up to 16 bitmaps of MSR ranges to deny
|
||||
guest MSR accesses that would normally be allowed by KVM. If an MSR is not
|
||||
covered by a specific range, the "default" filtering behavior applies. Each
|
||||
bitmap range covers MSRs from [base .. base+nmsrs).
|
||||
|
||||
If this ioctl has never been invoked, MSR accesses are not guarded and the
|
||||
default KVM in-kernel emulation behavior is fully preserved.
|
||||
If an MSR access is denied by userspace, the resulting KVM behavior depends on
|
||||
whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
|
||||
enabled. If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
|
||||
on denied accesses, i.e. userspace effectively intercepts the MSR access. If
|
||||
KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
|
||||
on denied accesses.
|
||||
|
||||
If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
|
||||
the access in accordance with the vCPU model. Note, KVM may still ultimately
|
||||
inject a #GP if an access is allowed by userspace, e.g. if KVM doesn't support
|
||||
the MSR, or to follow architectural behavior for the MSR.
|
||||
|
||||
By default, KVM operates in KVM_MSR_FILTER_DEFAULT_ALLOW mode with no MSR range
|
||||
filters.
|
||||
|
||||
Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
|
||||
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
|
||||
an error.
|
||||
|
||||
As soon as the filtering is in place, every MSR access is processed through
|
||||
the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
|
||||
x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
|
||||
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
|
||||
register.
|
||||
|
||||
.. warning::
|
||||
MSR accesses coming from nested vmentry/vmexit are not filtered.
|
||||
MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
|
||||
This includes both writes to individual VMCS fields and reads/writes
|
||||
through the MSR lists pointed to by the VMCS.
|
||||
|
||||
If a bit is within one of the defined ranges, read and write accesses are
|
||||
guarded by the bitmap's value for the MSR index if the kind of access
|
||||
is included in the ``struct kvm_msr_filter_range`` flags. If no range
|
||||
cover this particular access, the behavior is determined by the flags
|
||||
field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
|
||||
and ``KVM_MSR_FILTER_DEFAULT_DENY``.
|
||||
|
||||
Each bitmap range specifies a range of MSRs to potentially allow access on.
|
||||
The range goes from MSR index [base .. base+nmsrs]. The flags field
|
||||
indicates whether reads, writes or both reads and writes are filtered
|
||||
by setting a 1 bit in the bitmap for the corresponding MSR index.
|
||||
|
||||
If an MSR access is not permitted through the filtering, it generates a
|
||||
#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
|
||||
allows user space to deflect and potentially handle various MSR accesses
|
||||
into user space.
|
||||
x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
|
||||
cover any x2APIC MSRs).
|
||||
|
||||
Note, invoking this ioctl while a vCPU is running is inherently racy. However,
|
||||
KVM does guarantee that vCPUs will see either the previous filter or the new
|
||||
filter, e.g. MSRs with identical settings in both the old and new filter will
|
||||
have deterministic behavior.
|
||||
|
||||
Similarly, if userspace wishes to intercept on denied accesses,
|
||||
KVM_MSR_EXIT_REASON_FILTER must be enabled before activating any filters, and
|
||||
left enabled until after all filters are deactivated. Failure to do so may
|
||||
result in KVM injecting a #GP instead of exiting to userspace.
|
||||
|
||||
4.98 KVM_CREATE_SPAPR_TCE_64
|
||||
----------------------------
|
||||
|
||||
@@ -5163,10 +5129,13 @@ KVM_PV_ENABLE
|
||||
===== =============================
|
||||
|
||||
KVM_PV_DISABLE
|
||||
Deregister the VM from the Ultravisor and reclaim the memory that
|
||||
had been donated to the Ultravisor, making it usable by the kernel
|
||||
again. All registered VCPUs are converted back to non-protected
|
||||
ones.
|
||||
Deregister the VM from the Ultravisor and reclaim the memory that had
|
||||
been donated to the Ultravisor, making it usable by the kernel again.
|
||||
All registered VCPUs are converted back to non-protected ones. If a
|
||||
previous protected VM had been prepared for asynchonous teardown with
|
||||
KVM_PV_ASYNC_CLEANUP_PREPARE and not subsequently torn down with
|
||||
KVM_PV_ASYNC_CLEANUP_PERFORM, it will be torn down in this call
|
||||
together with the current protected VM.
|
||||
|
||||
KVM_PV_VM_SET_SEC_PARMS
|
||||
Pass the image header from VM memory to the Ultravisor in
|
||||
@@ -5289,6 +5258,36 @@ KVM_PV_DUMP
|
||||
authentication tag all of which are needed to decrypt the dump at a
|
||||
later time.
|
||||
|
||||
KVM_PV_ASYNC_CLEANUP_PREPARE
|
||||
:Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
|
||||
|
||||
Prepare the current protected VM for asynchronous teardown. Most
|
||||
resources used by the current protected VM will be set aside for a
|
||||
subsequent asynchronous teardown. The current protected VM will then
|
||||
resume execution immediately as non-protected. There can be at most
|
||||
one protected VM prepared for asynchronous teardown at any time. If
|
||||
a protected VM had already been prepared for teardown without
|
||||
subsequently calling KVM_PV_ASYNC_CLEANUP_PERFORM, this call will
|
||||
fail. In that case, the userspace process should issue a normal
|
||||
KVM_PV_DISABLE. The resources set aside with this call will need to
|
||||
be cleaned up with a subsequent call to KVM_PV_ASYNC_CLEANUP_PERFORM
|
||||
or KVM_PV_DISABLE, otherwise they will be cleaned up when KVM
|
||||
terminates. KVM_PV_ASYNC_CLEANUP_PREPARE can be called again as soon
|
||||
as cleanup starts, i.e. before KVM_PV_ASYNC_CLEANUP_PERFORM finishes.
|
||||
|
||||
KVM_PV_ASYNC_CLEANUP_PERFORM
|
||||
:Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
|
||||
|
||||
Tear down the protected VM previously prepared for teardown with
|
||||
KVM_PV_ASYNC_CLEANUP_PREPARE. The resources that had been set aside
|
||||
will be freed during the execution of this command. This PV command
|
||||
should ideally be issued by userspace from a separate thread. If a
|
||||
fatal signal is received (or the process terminates naturally), the
|
||||
command will terminate immediately without completing, and the normal
|
||||
KVM shutdown procedure will take care of cleaning up all remaining
|
||||
protected VMs, including the ones whose teardown was interrupted by
|
||||
process termination.
|
||||
|
||||
4.126 KVM_XEN_HVM_SET_ATTR
|
||||
--------------------------
|
||||
|
||||
@@ -5306,6 +5305,7 @@ KVM_PV_DUMP
|
||||
union {
|
||||
__u8 long_mode;
|
||||
__u8 vector;
|
||||
__u8 runstate_update_flag;
|
||||
struct {
|
||||
__u64 gfn;
|
||||
} shared_info;
|
||||
@@ -5383,6 +5383,14 @@ KVM_XEN_ATTR_TYPE_XEN_VERSION
|
||||
event channel delivery, so responding within the kernel without
|
||||
exiting to userspace is beneficial.
|
||||
|
||||
KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG
|
||||
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
|
||||
support for KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG. It enables the
|
||||
XEN_RUNSTATE_UPDATE flag which allows guest vCPUs to safely read
|
||||
other vCPUs' vcpu_runstate_info. Xen guests enable this feature via
|
||||
the VM_ASST_TYPE_runstate_update_flag of the HYPERVISOR_vm_assist
|
||||
hypercall.
|
||||
|
||||
4.127 KVM_XEN_HVM_GET_ATTR
|
||||
--------------------------
|
||||
|
||||
@@ -6440,31 +6448,35 @@ if it decides to decode and emulate the instruction.
|
||||
|
||||
Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
|
||||
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
|
||||
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
|
||||
may instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
|
||||
exit for writes.
|
||||
|
||||
The "reason" field specifies why the MSR trap occurred. User space will only
|
||||
receive MSR exit traps when a particular reason was requested during through
|
||||
The "reason" field specifies why the MSR interception occurred. Userspace will
|
||||
only receive MSR exits when a particular reason was requested during through
|
||||
ENABLE_CAP. Currently valid exit reasons are:
|
||||
|
||||
KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
|
||||
KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
|
||||
KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER
|
||||
============================ ========================================
|
||||
KVM_MSR_EXIT_REASON_UNKNOWN access to MSR that is unknown to KVM
|
||||
KVM_MSR_EXIT_REASON_INVAL access to invalid MSRs or reserved bits
|
||||
KVM_MSR_EXIT_REASON_FILTER access blocked by KVM_X86_SET_MSR_FILTER
|
||||
============================ ========================================
|
||||
|
||||
For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
|
||||
wants to read. To respond to this request with a successful read, user space
|
||||
For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
|
||||
wants to read. To respond to this request with a successful read, userspace
|
||||
writes the respective data into the "data" field and must continue guest
|
||||
execution to ensure the read data is transferred into guest register state.
|
||||
|
||||
If the RDMSR request was unsuccessful, user space indicates that with a "1" in
|
||||
If the RDMSR request was unsuccessful, userspace indicates that with a "1" in
|
||||
the "error" field. This will inject a #GP into the guest when the VCPU is
|
||||
executed again.
|
||||
|
||||
For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
|
||||
wants to write. Once finished processing the event, user space must continue
|
||||
vCPU execution. If the MSR write was unsuccessful, user space also sets the
|
||||
For KVM_EXIT_X86_WRMSR, the "index" field tells userspace which MSR the guest
|
||||
wants to write. Once finished processing the event, userspace must continue
|
||||
vCPU execution. If the MSR write was unsuccessful, userspace also sets the
|
||||
"error" field to "1".
|
||||
|
||||
See KVM_X86_SET_MSR_FILTER for details on the interaction with MSR filtering.
|
||||
|
||||
::
|
||||
|
||||
|
||||
@@ -7229,19 +7241,29 @@ polling.
|
||||
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
|
||||
:Returns: 0 on success; -1 on error
|
||||
|
||||
This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
|
||||
into user space.
|
||||
This capability allows userspace to intercept RDMSR and WRMSR instructions if
|
||||
access to an MSR is denied. By default, KVM injects #GP on denied accesses.
|
||||
|
||||
When a guest requests to read or write an MSR, KVM may not implement all MSRs
|
||||
that are relevant to a respective system. It also does not differentiate by
|
||||
CPU type.
|
||||
|
||||
To allow more fine grained control over MSR handling, user space may enable
|
||||
To allow more fine grained control over MSR handling, userspace may enable
|
||||
this capability. With it enabled, MSR accesses that match the mask specified in
|
||||
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
|
||||
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
|
||||
can then handle to implement model specific MSR handling and/or user notifications
|
||||
to inform a user that an MSR was not handled.
|
||||
args[0] and would trigger a #GP inside the guest will instead trigger
|
||||
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications. Userspace
|
||||
can then implement model specific MSR handling and/or user notifications
|
||||
to inform a user that an MSR was not emulated/virtualized by KVM.
|
||||
|
||||
The valid mask flags are:
|
||||
|
||||
============================ ===============================================
|
||||
KVM_MSR_EXIT_REASON_UNKNOWN intercept accesses to unknown (to KVM) MSRs
|
||||
KVM_MSR_EXIT_REASON_INVAL intercept accesses that are architecturally
|
||||
invalid according to the vCPU model and/or mode
|
||||
KVM_MSR_EXIT_REASON_FILTER intercept accesses that are denied by userspace
|
||||
via KVM_X86_SET_MSR_FILTER
|
||||
============================ ===============================================
|
||||
|
||||
7.22 KVM_CAP_X86_BUS_LOCK_EXIT
|
||||
-------------------------------
|
||||
@@ -7384,8 +7406,9 @@ hibernation of the host; however the VMM needs to manually save/restore the
|
||||
tags as appropriate if the VM is migrated.
|
||||
|
||||
When this capability is enabled all memory in memslots must be mapped as
|
||||
not-shareable (no MAP_SHARED), attempts to create a memslot with a
|
||||
MAP_SHARED mmap will result in an -EINVAL return.
|
||||
``MAP_ANONYMOUS`` or with a RAM-based file mapping (``tmpfs``, ``memfd``),
|
||||
attempts to create a memslot with an invalid mmap will result in an
|
||||
-EINVAL return.
|
||||
|
||||
When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
|
||||
perform a bulk copy of tags to/from the guest.
|
||||
@@ -7901,7 +7924,7 @@ KVM_EXIT_X86_WRMSR exit notifications.
|
||||
This capability indicates that KVM supports that accesses to user defined MSRs
|
||||
may be rejected. With this capability exposed, KVM exports new VM ioctl
|
||||
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
|
||||
ranges that KVM should reject access to.
|
||||
ranges that KVM should deny access to.
|
||||
|
||||
In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
|
||||
trap and emulate MSRs that are outside of the scope of KVM as well as
|
||||
@@ -7920,7 +7943,7 @@ regardless of what has actually been exposed through the CPUID leaf.
|
||||
8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
|
||||
----------------------------------------------------------
|
||||
|
||||
:Architectures: x86
|
||||
:Architectures: x86, arm64
|
||||
:Parameters: args[0] - size of the dirty log ring
|
||||
|
||||
KVM is capable of tracking dirty memory using ring buffers that are
|
||||
@@ -8002,13 +8025,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
|
||||
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
|
||||
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
|
||||
|
||||
NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
|
||||
ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
|
||||
KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling
|
||||
KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
|
||||
machine will switch to ring-buffer dirty page tracking and further
|
||||
KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
|
||||
|
||||
NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
|
||||
should be exposed by weakly ordered architecture, in order to indicate
|
||||
the additional memory ordering requirements imposed on userspace when
|
||||
@@ -8017,6 +8033,33 @@ Architecture with TSO-like ordering (such as x86) are allowed to
|
||||
expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
|
||||
to userspace.
|
||||
|
||||
After enabling the dirty rings, the userspace needs to detect the
|
||||
capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the
|
||||
ring structures can be backed by per-slot bitmaps. With this capability
|
||||
advertised, it means the architecture can dirty guest pages without
|
||||
vcpu/ring context, so that some of the dirty information will still be
|
||||
maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
|
||||
can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
|
||||
hasn't been enabled, or any memslot has been existing.
|
||||
|
||||
Note that the bitmap here is only a backup of the ring structure. The
|
||||
use of the ring and bitmap combination is only beneficial if there is
|
||||
only a very small amount of memory that is dirtied out of vcpu/ring
|
||||
context. Otherwise, the stand-alone per-slot bitmap mechanism needs to
|
||||
be considered.
|
||||
|
||||
To collect dirty bits in the backup bitmap, userspace can use the same
|
||||
KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all
|
||||
the generation of the dirty bits is done in a single pass. Collecting
|
||||
the dirty bitmap should be the very last thing that the VMM does before
|
||||
considering the state as complete. VMM needs to ensure that the dirty
|
||||
state is final and avoid missing dirty pages from another ioctl ordered
|
||||
after the bitmap collection.
|
||||
|
||||
NOTE: One example of using the backup bitmap is saving arm64 vgic/its
|
||||
tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on
|
||||
KVM device "kvm-arm-vgic-its" when dirty ring is enabled.
|
||||
|
||||
8.30 KVM_CAP_XEN_HVM
|
||||
--------------------
|
||||
|
||||
@@ -8025,12 +8068,13 @@ to userspace.
|
||||
This capability indicates the features that Xen supports for hosting Xen
|
||||
PVHVM guests. Valid flags are::
|
||||
|
||||
#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
|
||||
#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
|
||||
#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
|
||||
#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
|
||||
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
|
||||
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
|
||||
#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
|
||||
#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
|
||||
#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
|
||||
#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
|
||||
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
|
||||
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
|
||||
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
|
||||
|
||||
The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
|
||||
ioctl is available, for the guest to set its hypercall page.
|
||||
@@ -8062,6 +8106,18 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes.
|
||||
related to event channel delivery, timers, and the XENVER_version
|
||||
interception.
|
||||
|
||||
The KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG flag indicates that KVM supports
|
||||
the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute in the KVM_XEN_SET_ATTR
|
||||
and KVM_XEN_GET_ATTR ioctls. This controls whether KVM will set the
|
||||
XEN_RUNSTATE_UPDATE flag in guest memory mapped vcpu_runstate_info during
|
||||
updates of the runstate information. Note that versions of KVM which support
|
||||
the RUNSTATE feature above, but not thie RUNSTATE_UPDATE_FLAG feature, will
|
||||
always set the XEN_RUNSTATE_UPDATE flag when updating the guest structure,
|
||||
which is perhaps counterintuitive. When this flag is advertised, KVM will
|
||||
behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
|
||||
specifically enabled (by the guest making the hypercall, causing the VMM
|
||||
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).
|
||||
|
||||
8.31 KVM_CAP_PPC_MULTITCE
|
||||
-------------------------
|
||||
|
||||
|
||||
@@ -23,21 +23,23 @@ the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
|
||||
ARCH_FEATURES mechanism before calling it.
|
||||
|
||||
PV_TIME_FEATURES
|
||||
============= ======== ==========
|
||||
|
||||
============= ======== =================================================
|
||||
Function ID: (uint32) 0xC5000020
|
||||
PV_call_id: (uint32) The function to query for support.
|
||||
Currently only PV_TIME_ST is supported.
|
||||
Return value: (int64) NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
|
||||
PV-time feature is supported by the hypervisor.
|
||||
============= ======== ==========
|
||||
============= ======== =================================================
|
||||
|
||||
PV_TIME_ST
|
||||
============= ======== ==========
|
||||
|
||||
============= ======== ==============================================
|
||||
Function ID: (uint32) 0xC5000021
|
||||
Return value: (int64) IPA of the stolen time data structure for this
|
||||
VCPU. On failure:
|
||||
NOT_SUPPORTED (-1)
|
||||
============= ======== ==========
|
||||
============= ======== ==============================================
|
||||
|
||||
The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory
|
||||
with inner and outer write back caching attributes, in the inner shareable
|
||||
@@ -76,5 +78,5 @@ It is advisable that one or more 64k pages are set aside for the purpose of
|
||||
these structures and not used for other purposes, this enables the guest to map
|
||||
the region using 64k pages and avoids conflicting attributes with other memory.
|
||||
|
||||
For the user space interface see Documentation/virt/kvm/devices/vcpu.rst
|
||||
section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL".
|
||||
For the user space interface see
|
||||
:ref:`Documentation/virt/kvm/devices/vcpu.rst <kvm_arm_vcpu_pvtime_ctrl>`.
|
||||
@@ -52,7 +52,10 @@ KVM_DEV_ARM_VGIC_GRP_CTRL
|
||||
|
||||
KVM_DEV_ARM_ITS_SAVE_TABLES
|
||||
save the ITS table data into guest RAM, at the location provisioned
|
||||
by the guest in corresponding registers/table entries.
|
||||
by the guest in corresponding registers/table entries. Should userspace
|
||||
require a form of dirty tracking to identify which pages are modified
|
||||
by the saving process, it should use a bitmap even if using another
|
||||
mechanism to track the memory dirtied by the vCPUs.
|
||||
|
||||
The layout of the tables in guest memory defines an ABI. The entries
|
||||
are laid out in little endian format as described in the last paragraph.
|
||||
|
||||
@@ -171,6 +171,8 @@ configured values on other VCPUs. Userspace should configure the interrupt
|
||||
numbers on at least one VCPU after creating all VCPUs and before running any
|
||||
VCPUs.
|
||||
|
||||
.. _kvm_arm_vcpu_pvtime_ctrl:
|
||||
|
||||
3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
|
||||
==================================
|
||||
|
||||
|
||||
10
MAINTAINERS
10
MAINTAINERS
@@ -11438,6 +11438,16 @@ F: arch/x86/kvm/svm/hyperv.*
|
||||
F: arch/x86/kvm/svm/svm_onhyperv.*
|
||||
F: arch/x86/kvm/vmx/evmcs.*
|
||||
|
||||
KVM X86 Xen (KVM/Xen)
|
||||
M: David Woodhouse <dwmw2@infradead.org>
|
||||
M: Paul Durrant <paul@xen.org>
|
||||
M: Sean Christopherson <seanjc@google.com>
|
||||
M: Paolo Bonzini <pbonzini@redhat.com>
|
||||
L: kvm@vger.kernel.org
|
||||
S: Supported
|
||||
T: git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
|
||||
F: arch/x86/kvm/xen.*
|
||||
|
||||
KERNFS
|
||||
M: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||||
M: Tejun Heo <tj@kernel.org>
|
||||
|
||||
@@ -1988,6 +1988,7 @@ config ARM64_MTE
|
||||
depends on ARM64_PAN
|
||||
select ARCH_HAS_SUBPAGE_FAULTS
|
||||
select ARCH_USES_HIGH_VMA_FLAGS
|
||||
select ARCH_USES_PG_ARCH_X
|
||||
help
|
||||
Memory Tagging (part of the ARMv8.5 Extensions) provides
|
||||
architectural support for run-time, always-on detection of
|
||||
|
||||
@@ -135,7 +135,7 @@
|
||||
* 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are
|
||||
* not known to exist and will break with this configuration.
|
||||
*
|
||||
* The VTCR_EL2 is configured per VM and is initialised in kvm_arm_setup_stage2().
|
||||
* The VTCR_EL2 is configured per VM and is initialised in kvm_init_stage2_mmu.
|
||||
*
|
||||
* Note that when using 4K pages, we concatenate two first level page tables
|
||||
* together. With 16K pages, we concatenate 16 first level page tables.
|
||||
@@ -340,9 +340,13 @@
|
||||
* We have
|
||||
* PAR [PA_Shift - 1 : 12] = PA [PA_Shift - 1 : 12]
|
||||
* HPFAR [PA_Shift - 9 : 4] = FIPA [PA_Shift - 1 : 12]
|
||||
*
|
||||
* Always assume 52 bit PA since at this point, we don't know how many PA bits
|
||||
* the page table has been set up for. This should be safe since unused address
|
||||
* bits in PAR are res0.
|
||||
*/
|
||||
#define PAR_TO_HPFAR(par) \
|
||||
(((par) & GENMASK_ULL(PHYS_MASK_SHIFT - 1, 12)) >> 8)
|
||||
(((par) & GENMASK_ULL(52 - 1, 12)) >> 8)
|
||||
|
||||
#define ECN(x) { ESR_ELx_EC_##x, #x }
|
||||
|
||||
|
||||
@@ -76,6 +76,9 @@ enum __kvm_host_smccc_func {
|
||||
__KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs,
|
||||
__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs,
|
||||
__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_init_traps,
|
||||
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
|
||||
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
|
||||
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
|
||||
};
|
||||
|
||||
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
|
||||
@@ -106,7 +109,7 @@ enum __kvm_host_smccc_func {
|
||||
#define per_cpu_ptr_nvhe_sym(sym, cpu) \
|
||||
({ \
|
||||
unsigned long base, off; \
|
||||
base = kvm_arm_hyp_percpu_base[cpu]; \
|
||||
base = kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu]; \
|
||||
off = (unsigned long)&CHOOSE_NVHE_SYM(sym) - \
|
||||
(unsigned long)&CHOOSE_NVHE_SYM(__per_cpu_start); \
|
||||
base ? (typeof(CHOOSE_NVHE_SYM(sym))*)(base + off) : NULL; \
|
||||
@@ -211,7 +214,7 @@ DECLARE_KVM_HYP_SYM(__kvm_hyp_vector);
|
||||
#define __kvm_hyp_init CHOOSE_NVHE_SYM(__kvm_hyp_init)
|
||||
#define __kvm_hyp_vector CHOOSE_HYP_SYM(__kvm_hyp_vector)
|
||||
|
||||
extern unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
|
||||
extern unsigned long kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[];
|
||||
DECLARE_KVM_NVHE_SYM(__per_cpu_start);
|
||||
DECLARE_KVM_NVHE_SYM(__per_cpu_end);
|
||||
|
||||
|
||||
@@ -73,6 +73,63 @@ u32 __attribute_const__ kvm_target_cpu(void);
|
||||
int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
|
||||
void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
|
||||
|
||||
struct kvm_hyp_memcache {
|
||||
phys_addr_t head;
|
||||
unsigned long nr_pages;
|
||||
};
|
||||
|
||||
static inline void push_hyp_memcache(struct kvm_hyp_memcache *mc,
|
||||
phys_addr_t *p,
|
||||
phys_addr_t (*to_pa)(void *virt))
|
||||
{
|
||||
*p = mc->head;
|
||||
mc->head = to_pa(p);
|
||||
mc->nr_pages++;
|
||||
}
|
||||
|
||||
static inline void *pop_hyp_memcache(struct kvm_hyp_memcache *mc,
|
||||
void *(*to_va)(phys_addr_t phys))
|
||||
{
|
||||
phys_addr_t *p = to_va(mc->head);
|
||||
|
||||
if (!mc->nr_pages)
|
||||
return NULL;
|
||||
|
||||
mc->head = *p;
|
||||
mc->nr_pages--;
|
||||
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline int __topup_hyp_memcache(struct kvm_hyp_memcache *mc,
|
||||
unsigned long min_pages,
|
||||
void *(*alloc_fn)(void *arg),
|
||||
phys_addr_t (*to_pa)(void *virt),
|
||||
void *arg)
|
||||
{
|
||||
while (mc->nr_pages < min_pages) {
|
||||
phys_addr_t *p = alloc_fn(arg);
|
||||
|
||||
if (!p)
|
||||
return -ENOMEM;
|
||||
push_hyp_memcache(mc, p, to_pa);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline void __free_hyp_memcache(struct kvm_hyp_memcache *mc,
|
||||
void (*free_fn)(void *virt, void *arg),
|
||||
void *(*to_va)(phys_addr_t phys),
|
||||
void *arg)
|
||||
{
|
||||
while (mc->nr_pages)
|
||||
free_fn(pop_hyp_memcache(mc, to_va), arg);
|
||||
}
|
||||
|
||||
void free_hyp_memcache(struct kvm_hyp_memcache *mc);
|
||||
int topup_hyp_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages);
|
||||
|
||||
struct kvm_vmid {
|
||||
atomic64_t id;
|
||||
};
|
||||
@@ -115,6 +172,13 @@ struct kvm_smccc_features {
|
||||
unsigned long vendor_hyp_bmap;
|
||||
};
|
||||
|
||||
typedef unsigned int pkvm_handle_t;
|
||||
|
||||
struct kvm_protected_vm {
|
||||
pkvm_handle_t handle;
|
||||
struct kvm_hyp_memcache teardown_mc;
|
||||
};
|
||||
|
||||
struct kvm_arch {
|
||||
struct kvm_s2_mmu mmu;
|
||||
|
||||
@@ -163,9 +227,19 @@ struct kvm_arch {
|
||||
|
||||
u8 pfr0_csv2;
|
||||
u8 pfr0_csv3;
|
||||
struct {
|
||||
u8 imp:4;
|
||||
u8 unimp:4;
|
||||
} dfr0_pmuver;
|
||||
|
||||
/* Hypercall features firmware registers' descriptor */
|
||||
struct kvm_smccc_features smccc_feat;
|
||||
|
||||
/*
|
||||
* For an untrusted host VM, 'pkvm.handle' is used to lookup
|
||||
* the associated pKVM instance in the hypervisor.
|
||||
*/
|
||||
struct kvm_protected_vm pkvm;
|
||||
};
|
||||
|
||||
struct kvm_vcpu_fault_info {
|
||||
@@ -925,8 +999,6 @@ int kvm_set_ipa_limit(void);
|
||||
#define __KVM_HAVE_ARCH_VM_ALLOC
|
||||
struct kvm *kvm_arch_alloc_vm(void);
|
||||
|
||||
int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type);
|
||||
|
||||
static inline bool kvm_vm_is_protected(struct kvm *kvm)
|
||||
{
|
||||
return false;
|
||||
|
||||
@@ -123,4 +123,7 @@ extern u64 kvm_nvhe_sym(id_aa64mmfr0_el1_sys_val);
|
||||
extern u64 kvm_nvhe_sym(id_aa64mmfr1_el1_sys_val);
|
||||
extern u64 kvm_nvhe_sym(id_aa64mmfr2_el1_sys_val);
|
||||
|
||||
extern unsigned long kvm_nvhe_sym(__icache_flags);
|
||||
extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits);
|
||||
|
||||
#endif /* __ARM64_KVM_HYP_H__ */
|
||||
|
||||
@@ -166,7 +166,7 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
|
||||
void free_hyp_pgds(void);
|
||||
|
||||
void stage2_unmap_vm(struct kvm *kvm);
|
||||
int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
|
||||
int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
|
||||
void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
|
||||
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
|
||||
phys_addr_t pa, unsigned long size, bool writable);
|
||||
|
||||
@@ -42,6 +42,8 @@ typedef u64 kvm_pte_t;
|
||||
#define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT)
|
||||
#define KVM_PTE_ADDR_51_48 GENMASK(15, 12)
|
||||
|
||||
#define KVM_PHYS_INVALID (-1ULL)
|
||||
|
||||
static inline bool kvm_pte_valid(kvm_pte_t pte)
|
||||
{
|
||||
return pte & KVM_PTE_VALID;
|
||||
@@ -57,6 +59,18 @@ static inline u64 kvm_pte_to_phys(kvm_pte_t pte)
|
||||
return pa;
|
||||
}
|
||||
|
||||
static inline kvm_pte_t kvm_phys_to_pte(u64 pa)
|
||||
{
|
||||
kvm_pte_t pte = pa & KVM_PTE_ADDR_MASK;
|
||||
|
||||
if (PAGE_SHIFT == 16) {
|
||||
pa &= GENMASK(51, 48);
|
||||
pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48);
|
||||
}
|
||||
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline u64 kvm_granule_shift(u32 level)
|
||||
{
|
||||
/* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */
|
||||
@@ -85,6 +99,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
|
||||
* allocation is physically contiguous.
|
||||
* @free_pages_exact: Free an exact number of memory pages previously
|
||||
* allocated by zalloc_pages_exact.
|
||||
* @free_removed_table: Free a removed paging structure by unlinking and
|
||||
* dropping references.
|
||||
* @get_page: Increment the refcount on a page.
|
||||
* @put_page: Decrement the refcount on a page. When the
|
||||
* refcount reaches 0 the page is automatically
|
||||
@@ -103,6 +119,7 @@ struct kvm_pgtable_mm_ops {
|
||||
void* (*zalloc_page)(void *arg);
|
||||
void* (*zalloc_pages_exact)(size_t size);
|
||||
void (*free_pages_exact)(void *addr, size_t size);
|
||||
void (*free_removed_table)(void *addr, u32 level);
|
||||
void (*get_page)(void *addr);
|
||||
void (*put_page)(void *addr);
|
||||
int (*page_count)(void *addr);
|
||||
@@ -161,6 +178,121 @@ enum kvm_pgtable_prot {
|
||||
typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
|
||||
enum kvm_pgtable_prot prot);
|
||||
|
||||
/**
|
||||
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
|
||||
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
|
||||
* entries.
|
||||
* @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their
|
||||
* children.
|
||||
* @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their
|
||||
* children.
|
||||
* @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
|
||||
* with other software walkers.
|
||||
*/
|
||||
enum kvm_pgtable_walk_flags {
|
||||
KVM_PGTABLE_WALK_LEAF = BIT(0),
|
||||
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
|
||||
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
|
||||
KVM_PGTABLE_WALK_SHARED = BIT(3),
|
||||
};
|
||||
|
||||
struct kvm_pgtable_visit_ctx {
|
||||
kvm_pte_t *ptep;
|
||||
kvm_pte_t old;
|
||||
void *arg;
|
||||
struct kvm_pgtable_mm_ops *mm_ops;
|
||||
u64 addr;
|
||||
u64 end;
|
||||
u32 level;
|
||||
enum kvm_pgtable_walk_flags flags;
|
||||
};
|
||||
|
||||
typedef int (*kvm_pgtable_visitor_fn_t)(const struct kvm_pgtable_visit_ctx *ctx,
|
||||
enum kvm_pgtable_walk_flags visit);
|
||||
|
||||
static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *ctx)
|
||||
{
|
||||
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
|
||||
}
|
||||
|
||||
/**
|
||||
* struct kvm_pgtable_walker - Hook into a page-table walk.
|
||||
* @cb: Callback function to invoke during the walk.
|
||||
* @arg: Argument passed to the callback function.
|
||||
* @flags: Bitwise-OR of flags to identify the entry types on which to
|
||||
* invoke the callback function.
|
||||
*/
|
||||
struct kvm_pgtable_walker {
|
||||
const kvm_pgtable_visitor_fn_t cb;
|
||||
void * const arg;
|
||||
const enum kvm_pgtable_walk_flags flags;
|
||||
};
|
||||
|
||||
/*
|
||||
* RCU cannot be used in a non-kernel context such as the hyp. As such, page
|
||||
* table walkers used in hyp do not call into RCU and instead use other
|
||||
* synchronization mechanisms (such as a spinlock).
|
||||
*/
|
||||
#if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__)
|
||||
|
||||
typedef kvm_pte_t *kvm_pteref_t;
|
||||
|
||||
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
|
||||
kvm_pteref_t pteref)
|
||||
{
|
||||
return pteref;
|
||||
}
|
||||
|
||||
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
|
||||
{
|
||||
/*
|
||||
* Due to the lack of RCU (or a similar protection scheme), only
|
||||
* non-shared table walkers are allowed in the hypervisor.
|
||||
*/
|
||||
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
||||
return -EPERM;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker) {}
|
||||
|
||||
static inline bool kvm_pgtable_walk_lock_held(void)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
#else
|
||||
|
||||
typedef kvm_pte_t __rcu *kvm_pteref_t;
|
||||
|
||||
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
|
||||
kvm_pteref_t pteref)
|
||||
{
|
||||
return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED));
|
||||
}
|
||||
|
||||
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
|
||||
{
|
||||
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
||||
rcu_read_lock();
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker)
|
||||
{
|
||||
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
static inline bool kvm_pgtable_walk_lock_held(void)
|
||||
{
|
||||
return rcu_read_lock_held();
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
/**
|
||||
* struct kvm_pgtable - KVM page-table.
|
||||
* @ia_bits: Maximum input address size, in bits.
|
||||
@@ -175,7 +307,7 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
|
||||
struct kvm_pgtable {
|
||||
u32 ia_bits;
|
||||
u32 start_level;
|
||||
kvm_pte_t *pgd;
|
||||
kvm_pteref_t pgd;
|
||||
struct kvm_pgtable_mm_ops *mm_ops;
|
||||
|
||||
/* Stage-2 only */
|
||||
@@ -184,39 +316,6 @@ struct kvm_pgtable {
|
||||
kvm_pgtable_force_pte_cb_t force_pte_cb;
|
||||
};
|
||||
|
||||
/**
|
||||
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
|
||||
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
|
||||
* entries.
|
||||
* @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their
|
||||
* children.
|
||||
* @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their
|
||||
* children.
|
||||
*/
|
||||
enum kvm_pgtable_walk_flags {
|
||||
KVM_PGTABLE_WALK_LEAF = BIT(0),
|
||||
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
|
||||
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
|
||||
};
|
||||
|
||||
typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
|
||||
kvm_pte_t *ptep,
|
||||
enum kvm_pgtable_walk_flags flag,
|
||||
void * const arg);
|
||||
|
||||
/**
|
||||
* struct kvm_pgtable_walker - Hook into a page-table walk.
|
||||
* @cb: Callback function to invoke during the walk.
|
||||
* @arg: Argument passed to the callback function.
|
||||
* @flags: Bitwise-OR of flags to identify the entry types on which to
|
||||
* invoke the callback function.
|
||||
*/
|
||||
struct kvm_pgtable_walker {
|
||||
const kvm_pgtable_visitor_fn_t cb;
|
||||
void * const arg;
|
||||
const enum kvm_pgtable_walk_flags flags;
|
||||
};
|
||||
|
||||
/**
|
||||
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
|
||||
* @pgt: Uninitialised page-table structure to initialise.
|
||||
@@ -296,6 +395,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
|
||||
*/
|
||||
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
|
||||
|
||||
/**
|
||||
* kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
|
||||
* @vtcr: Content of the VTCR register.
|
||||
*
|
||||
* Return: the size (in bytes) of the stage-2 PGD
|
||||
*/
|
||||
size_t kvm_pgtable_stage2_pgd_size(u64 vtcr);
|
||||
|
||||
/**
|
||||
* __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
|
||||
* @pgt: Uninitialised page-table structure to initialise.
|
||||
@@ -324,6 +431,17 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
|
||||
*/
|
||||
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
|
||||
|
||||
/**
|
||||
* kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure.
|
||||
* @mm_ops: Memory management callbacks.
|
||||
* @pgtable: Unlinked stage-2 paging structure to be freed.
|
||||
* @level: Level of the stage-2 paging structure to be freed.
|
||||
*
|
||||
* The page-table is assumed to be unreachable by any hardware walkers prior to
|
||||
* freeing and therefore no TLB invalidation is performed.
|
||||
*/
|
||||
void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
|
||||
|
||||
/**
|
||||
* kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
|
||||
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
||||
@@ -333,6 +451,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
|
||||
* @prot: Permissions and attributes for the mapping.
|
||||
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
|
||||
* page-table pages.
|
||||
* @flags: Flags to control the page-table walk (ex. a shared walk)
|
||||
*
|
||||
* The offset of @addr within a page is ignored, @size is rounded-up to
|
||||
* the next page boundary and @phys is rounded-down to the previous page
|
||||
@@ -354,7 +473,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
|
||||
*/
|
||||
int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
|
||||
u64 phys, enum kvm_pgtable_prot prot,
|
||||
void *mc);
|
||||
void *mc, enum kvm_pgtable_walk_flags flags);
|
||||
|
||||
/**
|
||||
* kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
|
||||
|
||||
@@ -9,11 +9,49 @@
|
||||
#include <linux/memblock.h>
|
||||
#include <asm/kvm_pgtable.h>
|
||||
|
||||
/* Maximum number of VMs that can co-exist under pKVM. */
|
||||
#define KVM_MAX_PVMS 255
|
||||
|
||||
#define HYP_MEMBLOCK_REGIONS 128
|
||||
|
||||
int pkvm_init_host_vm(struct kvm *kvm);
|
||||
int pkvm_create_hyp_vm(struct kvm *kvm);
|
||||
void pkvm_destroy_hyp_vm(struct kvm *kvm);
|
||||
|
||||
extern struct memblock_region kvm_nvhe_sym(hyp_memory)[];
|
||||
extern unsigned int kvm_nvhe_sym(hyp_memblock_nr);
|
||||
|
||||
static inline unsigned long
|
||||
hyp_vmemmap_memblock_size(struct memblock_region *reg, size_t vmemmap_entry_size)
|
||||
{
|
||||
unsigned long nr_pages = reg->size >> PAGE_SHIFT;
|
||||
unsigned long start, end;
|
||||
|
||||
start = (reg->base >> PAGE_SHIFT) * vmemmap_entry_size;
|
||||
end = start + nr_pages * vmemmap_entry_size;
|
||||
start = ALIGN_DOWN(start, PAGE_SIZE);
|
||||
end = ALIGN(end, PAGE_SIZE);
|
||||
|
||||
return end - start;
|
||||
}
|
||||
|
||||
static inline unsigned long hyp_vmemmap_pages(size_t vmemmap_entry_size)
|
||||
{
|
||||
unsigned long res = 0, i;
|
||||
|
||||
for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
|
||||
res += hyp_vmemmap_memblock_size(&kvm_nvhe_sym(hyp_memory)[i],
|
||||
vmemmap_entry_size);
|
||||
}
|
||||
|
||||
return res >> PAGE_SHIFT;
|
||||
}
|
||||
|
||||
static inline unsigned long hyp_vm_table_pages(void)
|
||||
{
|
||||
return PAGE_ALIGN(KVM_MAX_PVMS * sizeof(void *)) >> PAGE_SHIFT;
|
||||
}
|
||||
|
||||
static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
|
||||
{
|
||||
unsigned long total = 0, i;
|
||||
|
||||
@@ -25,7 +25,7 @@ unsigned long mte_copy_tags_to_user(void __user *to, void *from,
|
||||
unsigned long n);
|
||||
int mte_save_tags(struct page *page);
|
||||
void mte_save_page_tags(const void *page_addr, void *tag_storage);
|
||||
bool mte_restore_tags(swp_entry_t entry, struct page *page);
|
||||
void mte_restore_tags(swp_entry_t entry, struct page *page);
|
||||
void mte_restore_page_tags(void *page_addr, const void *tag_storage);
|
||||
void mte_invalidate_tags(int type, pgoff_t offset);
|
||||
void mte_invalidate_tags_area(int type);
|
||||
@@ -36,6 +36,58 @@ void mte_free_tag_storage(char *storage);
|
||||
|
||||
/* track which pages have valid allocation tags */
|
||||
#define PG_mte_tagged PG_arch_2
|
||||
/* simple lock to avoid multiple threads tagging the same page */
|
||||
#define PG_mte_lock PG_arch_3
|
||||
|
||||
static inline void set_page_mte_tagged(struct page *page)
|
||||
{
|
||||
/*
|
||||
* Ensure that the tags written prior to this function are visible
|
||||
* before the page flags update.
|
||||
*/
|
||||
smp_wmb();
|
||||
set_bit(PG_mte_tagged, &page->flags);
|
||||
}
|
||||
|
||||
static inline bool page_mte_tagged(struct page *page)
|
||||
{
|
||||
bool ret = test_bit(PG_mte_tagged, &page->flags);
|
||||
|
||||
/*
|
||||
* If the page is tagged, ensure ordering with a likely subsequent
|
||||
* read of the tags.
|
||||
*/
|
||||
if (ret)
|
||||
smp_rmb();
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Lock the page for tagging and return 'true' if the page can be tagged,
|
||||
* 'false' if already tagged. PG_mte_tagged is never cleared and therefore the
|
||||
* locking only happens once for page initialisation.
|
||||
*
|
||||
* The page MTE lock state:
|
||||
*
|
||||
* Locked: PG_mte_lock && !PG_mte_tagged
|
||||
* Unlocked: !PG_mte_lock || PG_mte_tagged
|
||||
*
|
||||
* Acquire semantics only if the page is tagged (returning 'false').
|
||||
*/
|
||||
static inline bool try_page_mte_tagging(struct page *page)
|
||||
{
|
||||
if (!test_and_set_bit(PG_mte_lock, &page->flags))
|
||||
return true;
|
||||
|
||||
/*
|
||||
* The tags are either being initialised or may have been initialised
|
||||
* already. Check if the PG_mte_tagged flag has been set or wait
|
||||
* otherwise.
|
||||
*/
|
||||
smp_cond_load_acquire(&page->flags, VAL & (1UL << PG_mte_tagged));
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
void mte_zero_clear_page_tags(void *addr);
|
||||
void mte_sync_tags(pte_t old_pte, pte_t pte);
|
||||
@@ -56,6 +108,17 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t size);
|
||||
/* unused if !CONFIG_ARM64_MTE, silence the compiler */
|
||||
#define PG_mte_tagged 0
|
||||
|
||||
static inline void set_page_mte_tagged(struct page *page)
|
||||
{
|
||||
}
|
||||
static inline bool page_mte_tagged(struct page *page)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
static inline bool try_page_mte_tagging(struct page *page)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
static inline void mte_zero_clear_page_tags(void *addr)
|
||||
{
|
||||
}
|
||||
|
||||
@@ -1046,8 +1046,8 @@ static inline void arch_swap_invalidate_area(int type)
|
||||
#define __HAVE_ARCH_SWAP_RESTORE
|
||||
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
|
||||
{
|
||||
if (system_supports_mte() && mte_restore_tags(entry, &folio->page))
|
||||
set_bit(PG_mte_tagged, &folio->flags);
|
||||
if (system_supports_mte())
|
||||
mte_restore_tags(entry, &folio->page);
|
||||
}
|
||||
|
||||
#endif /* CONFIG_ARM64_MTE */
|
||||
|
||||
@@ -43,6 +43,7 @@
|
||||
#define __KVM_HAVE_VCPU_EVENTS
|
||||
|
||||
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1
|
||||
#define KVM_DIRTY_LOG_PAGE_OFFSET 64
|
||||
|
||||
#define KVM_REG_SIZE(id) \
|
||||
(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
|
||||
|
||||
@@ -2076,8 +2076,10 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap)
|
||||
* Clear the tags in the zero page. This needs to be done via the
|
||||
* linear map which has the Tagged attribute.
|
||||
*/
|
||||
if (!test_and_set_bit(PG_mte_tagged, &ZERO_PAGE(0)->flags))
|
||||
if (try_page_mte_tagging(ZERO_PAGE(0))) {
|
||||
mte_clear_page_tags(lm_alias(empty_zero_page));
|
||||
set_page_mte_tagged(ZERO_PAGE(0));
|
||||
}
|
||||
|
||||
kasan_init_hw_tags_cpu();
|
||||
}
|
||||
|
||||
@@ -47,7 +47,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
|
||||
* Pages mapped in user space as !pte_access_permitted() (e.g.
|
||||
* PROT_EXEC only) may not have the PG_mte_tagged flag set.
|
||||
*/
|
||||
if (!test_bit(PG_mte_tagged, &page->flags)) {
|
||||
if (!page_mte_tagged(page)) {
|
||||
put_page(page);
|
||||
dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
|
||||
continue;
|
||||
|
||||
@@ -271,7 +271,7 @@ static int swsusp_mte_save_tags(void)
|
||||
if (!page)
|
||||
continue;
|
||||
|
||||
if (!test_bit(PG_mte_tagged, &page->flags))
|
||||
if (!page_mte_tagged(page))
|
||||
continue;
|
||||
|
||||
ret = save_tags(page, pfn);
|
||||
|
||||
@@ -63,12 +63,6 @@ KVM_NVHE_ALIAS(nvhe_hyp_panic_handler);
|
||||
/* Vectors installed by hyp-init on reset HVC. */
|
||||
KVM_NVHE_ALIAS(__hyp_stub_vectors);
|
||||
|
||||
/* Kernel symbol used by icache_is_vpipt(). */
|
||||
KVM_NVHE_ALIAS(__icache_flags);
|
||||
|
||||
/* VMID bits set by the KVM VMID allocator */
|
||||
KVM_NVHE_ALIAS(kvm_arm_vmid_bits);
|
||||
|
||||
/* Static keys which are set if a vGIC trap should be handled in hyp. */
|
||||
KVM_NVHE_ALIAS(vgic_v2_cpuif_trap);
|
||||
KVM_NVHE_ALIAS(vgic_v3_cpuif_trap);
|
||||
@@ -84,9 +78,6 @@ KVM_NVHE_ALIAS(gic_nonsecure_priorities);
|
||||
KVM_NVHE_ALIAS(__start___kvm_ex_table);
|
||||
KVM_NVHE_ALIAS(__stop___kvm_ex_table);
|
||||
|
||||
/* Array containing bases of nVHE per-CPU memory regions. */
|
||||
KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base);
|
||||
|
||||
/* PMU available static key */
|
||||
#ifdef CONFIG_HW_PERF_EVENTS
|
||||
KVM_NVHE_ALIAS(kvm_arm_pmu_available);
|
||||
@@ -103,12 +94,6 @@ KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy);
|
||||
KVM_NVHE_ALIAS_HYP(__memset, __pi_memset);
|
||||
#endif
|
||||
|
||||
/* Kernel memory sections */
|
||||
KVM_NVHE_ALIAS(__start_rodata);
|
||||
KVM_NVHE_ALIAS(__end_rodata);
|
||||
KVM_NVHE_ALIAS(__bss_start);
|
||||
KVM_NVHE_ALIAS(__bss_stop);
|
||||
|
||||
/* Hyp memory sections */
|
||||
KVM_NVHE_ALIAS(__hyp_idmap_text_start);
|
||||
KVM_NVHE_ALIAS(__hyp_idmap_text_end);
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user