Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "s390:

   - More phys_to_virt conversions

   - Improvement of AP management for VSIE (nested virtualization)

  ARM64:

   - Numerous fixes for the pathological lock inversion issue that
     plagued KVM/arm64 since... forever.

   - New framework allowing SMCCC-compliant hypercalls to be forwarded
     to userspace, hopefully paving the way for some more features being
     moved to VMMs rather than be implemented in the kernel.

   - Large rework of the timer code to allow a VM-wide offset to be
     applied to both virtual and physical counters as well as a
     per-timer, per-vcpu offset that complements the global one. This
     last part allows the NV timer code to be implemented on top.

   - A small set of fixes to make sure that we don't change anything
     affecting the EL1&0 translation regime just after having having
     taken an exception to EL2 until we have executed a DSB. This
     ensures that speculative walks started in EL1&0 have completed.

   - The usual selftest fixes and improvements.

  x86:

   - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
     enabled, and by giving the guest control of CR0.WP when EPT is
     enabled on VMX (VMX-only because SVM doesn't support per-bit
     controls)

   - Add CR0/CR4 helpers to query single bits, and clean up related code
     where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
     return as a bool

   - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

   - Avoid unnecessary writes+flushes when the guest is only adding new
     PTEs

   - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
     optimizations when emulating invalidations

   - Clean up the range-based flushing APIs

   - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
     single A/D bit using a LOCK AND instead of XCHG, and skip all of
     the "handle changed SPTE" overhead associated with writing the
     entire entry

   - Track the number of "tail" entries in a pte_list_desc to avoid
     having to walk (potentially) all descriptors during insertion and
     deletion, which gets quite expensive if the guest is spamming
     fork()

   - Disallow virtualizing legacy LBRs if architectural LBRs are
     available, the two are mutually exclusive in hardware

   - Disallow writes to immutable feature MSRs (notably
     PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features

   - Overhaul the vmx_pmu_caps selftest to better validate
     PERF_CAPABILITIES

   - Apply PMU filters to emulated events and add test coverage to the
     pmu_event_filter selftest

   - AMD SVM:
       - Add support for virtual NMIs
       - Fixes for edge cases related to virtual interrupts

   - Intel AMX:
       - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
         XTILE_DATA is not being reported due to userspace not opting in
         via prctl()
       - Fix a bug in emulation of ENCLS in compatibility mode
       - Allow emulation of NOP and PAUSE for L2
       - AMX selftests improvements
       - Misc cleanups

  MIPS:

   - Constify MIPS's internal callbacks (a leftover from the hardware
     enabling rework that landed in 6.3)

  Generic:

   - Drop unnecessary casts from "void *" throughout kvm_main.c

   - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
     struct size by 8 bytes on 64-bit kernels by utilizing a padding
     hole

  Documentation:

   - Fix goof introduced by the conversion to rST"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
  KVM: s390: pci: fix virtual-physical confusion on module unload/load
  KVM: s390: vsie: clarifications on setting the APCB
  KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: selftests: Test the PMU event "Instructions retired"
  KVM: selftests: Copy full counter values from guest in PMU event filter test
  KVM: selftests: Use error codes to signal errors in PMU event filter test
  KVM: selftests: Print detailed info in PMU event filter asserts
  KVM: selftests: Add helpers for PMC asserts in PMU event filter test
  KVM: selftests: Add a common helper for the PMU event filter guest code
  KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: selftests: Add test to verify KVM's supported XCR0
  ...
This commit is contained in:
Linus Torvalds
2023-05-01 12:06:20 -07:00
111 changed files with 4021 additions and 1867 deletions

View File

@@ -5645,7 +5645,8 @@ with the KVM_XEN_VCPU_GET_ATTR ioctl.
};
Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned.
``length`` must not be bigger than 2^31 - PAGE_SIZE bytes. The ``addr``
field must point to a buffer which the tags will be copied to or from.
``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
@@ -6029,6 +6030,44 @@ delivery must be provided via the "reg_aen" struct.
The "pad" and "reserved" fields may be used for future extensions and should be
set to 0s by userspace.
4.138 KVM_ARM_SET_COUNTER_OFFSET
--------------------------------
:Capability: KVM_CAP_COUNTER_OFFSET
:Architectures: arm64
:Type: vm ioctl
:Parameters: struct kvm_arm_counter_offset (in)
:Returns: 0 on success, < 0 on error
This capability indicates that userspace is able to apply a single VM-wide
offset to both the virtual and physical counters as viewed by the guest
using the KVM_ARM_SET_CNT_OFFSET ioctl and the following data structure:
::
struct kvm_arm_counter_offset {
__u64 counter_offset;
__u64 reserved;
};
The offset describes a number of counter cycles that are subtracted from
both virtual and physical counter views (similar to the effects of the
CNTVOFF_EL2 and CNTPOFF_EL2 system registers, but only global). The offset
always applies to all vcpus (already created or created after this ioctl)
for this VM.
It is userspace's responsibility to compute the offset based, for example,
on previous values of the guest counters.
Any value other than 0 for the "reserved" field may result in an error
(-EINVAL) being returned. This ioctl can also return -EBUSY if any vcpu
ioctl is issued concurrently.
Note that using this ioctl results in KVM ignoring subsequent userspace
writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
interface. No error will be returned, but the resulting offset will not be
applied.
5. The kvm_run structure
========================
@@ -6218,15 +6257,40 @@ to the byte array.
__u64 nr;
__u64 args[6];
__u64 ret;
__u32 longmode;
__u32 pad;
__u64 flags;
} hypercall;
Unused. This was once used for 'hypercall to userspace'. To implement
such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
It is strongly recommended that userspace use ``KVM_EXIT_IO`` (x86) or
``KVM_EXIT_MMIO`` (all except s390) to implement functionality that
requires a guest to interact with host userpace.
.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
For arm64:
----------
SMCCC exits can be enabled depending on the configuration of the SMCCC
filter. See the Documentation/virt/kvm/devices/vm.rst
``KVM_ARM_SMCCC_FILTER`` for more details.
``nr`` contains the function ID of the guest's SMCCC call. Userspace is
expected to use the ``KVM_GET_ONE_REG`` ioctl to retrieve the call
parameters from the vCPU's GPRs.
Definition of ``flags``:
- ``KVM_HYPERCALL_EXIT_SMC``: Indicates that the guest used the SMC
conduit to initiate the SMCCC call. If this bit is 0 then the guest
used the HVC conduit for the SMCCC call.
- ``KVM_HYPERCALL_EXIT_16BIT``: Indicates that the guest used a 16bit
instruction to initiate the SMCCC call. If this bit is 0 then the
guest used a 32bit instruction. An AArch64 guest always has this
bit set to 0.
At the point of exit, PC points to the instruction immediately following
the trapping instruction.
::
/* KVM_EXIT_TPR_ACCESS */
@@ -7266,6 +7330,7 @@ and injected exceptions.
will clear DR6.RTM.
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
--------------------------------------
:Architectures: x86, arm64, mips
:Parameters: args[0] whether feature should be enabled or not

View File

@@ -321,3 +321,82 @@ Allows userspace to query the status of migration mode.
if it is enabled
:Returns: -EFAULT if the given address is not accessible from kernel space;
0 in case of success.
6. GROUP: KVM_ARM_VM_SMCCC_CTRL
===============================
:Architectures: arm64
6.1. ATTRIBUTE: KVM_ARM_VM_SMCCC_FILTER (w/o)
---------------------------------------------
:Parameters: Pointer to a ``struct kvm_smccc_filter``
:Returns:
====== ===========================================
EEXIST Range intersects with a previously inserted
or reserved range
EBUSY A vCPU in the VM has already run
EINVAL Invalid filter configuration
ENOMEM Failed to allocate memory for the in-kernel
representation of the SMCCC filter
====== ===========================================
Requests the installation of an SMCCC call filter described as follows::
enum kvm_smccc_filter_action {
KVM_SMCCC_FILTER_HANDLE = 0,
KVM_SMCCC_FILTER_DENY,
KVM_SMCCC_FILTER_FWD_TO_USER,
};
struct kvm_smccc_filter {
__u32 base;
__u32 nr_functions;
__u8 action;
__u8 pad[15];
};
The filter is defined as a set of non-overlapping ranges. Each
range defines an action to be applied to SMCCC calls within the range.
Userspace can insert multiple ranges into the filter by using
successive calls to this attribute.
The default configuration of KVM is such that all implemented SMCCC
calls are allowed. Thus, the SMCCC filter can be defined sparsely
by userspace, only describing ranges that modify the default behavior.
The range expressed by ``struct kvm_smccc_filter`` is
[``base``, ``base + nr_functions``). The range is not allowed to wrap,
i.e. userspace cannot rely on ``base + nr_functions`` overflowing.
The SMCCC filter applies to both SMC and HVC calls initiated by the
guest. The SMCCC filter gates the in-kernel emulation of SMCCC calls
and as such takes effect before other interfaces that interact with
SMCCC calls (e.g. hypercall bitmap registers).
Actions:
- ``KVM_SMCCC_FILTER_HANDLE``: Allows the guest SMCCC call to be
handled in-kernel. It is strongly recommended that userspace *not*
explicitly describe the allowed SMCCC call ranges.
- ``KVM_SMCCC_FILTER_DENY``: Rejects the guest SMCCC call in-kernel
and returns to the guest.
- ``KVM_SMCCC_FILTER_FWD_TO_USER``: The guest SMCCC call is forwarded
to userspace with an exit reason of ``KVM_EXIT_HYPERCALL``.
The ``pad`` field is reserved for future use and must be zero. KVM may
return ``-EINVAL`` if the field is nonzero.
KVM reserves the 'Arm Architecture Calls' range of function IDs and
will reject attempts to define a filter for any portion of these ranges:
=========== ===============
Start End (inclusive)
=========== ===============
0x8000_0000 0x8000_FFFF
0xC000_0000 0xC000_FFFF
=========== ===============

View File

@@ -21,7 +21,7 @@ The acquisition orders for mutexes are as follows:
- kvm->mn_active_invalidate_count ensures that pairs of
invalidate_range_start() and invalidate_range_end() callbacks
use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
are taken on the waiting side in install_new_memslots, so MMU notifiers
are taken on the waiting side when modifying memslots, so MMU notifiers
must not take either kvm->slots_lock or kvm->slots_arch_lock.
For SRCU:

View File

@@ -16,6 +16,7 @@
#include <linux/types.h>
#include <linux/jump_label.h>
#include <linux/kvm_types.h>
#include <linux/maple_tree.h>
#include <linux/percpu.h>
#include <linux/psci.h>
#include <asm/arch_gicv3.h>
@@ -199,6 +200,9 @@ struct kvm_arch {
/* Mandated version of PSCI */
u32 psci_version;
/* Protects VM-scoped configuration data */
struct mutex config_lock;
/*
* If we encounter a data abort without valid instruction syndrome
* information, report this to user space. User space can (and
@@ -221,7 +225,12 @@ struct kvm_arch {
#define KVM_ARCH_FLAG_EL1_32BIT 4
/* PSCI SYSTEM_SUSPEND enabled for the guest */
#define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED 5
/* VM counter offset */
#define KVM_ARCH_FLAG_VM_COUNTER_OFFSET 6
/* Timer PPIs made immutable */
#define KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE 7
/* SMCCC filter initialized for the VM */
#define KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED 8
unsigned long flags;
/*
@@ -242,6 +251,7 @@ struct kvm_arch {
/* Hypercall features firmware registers' descriptor */
struct kvm_smccc_features smccc_feat;
struct maple_tree smccc_filter;
/*
* For an untrusted host VM, 'pkvm.handle' is used to lookup
@@ -365,6 +375,10 @@ enum vcpu_sysreg {
TPIDR_EL2, /* EL2 Software Thread ID Register */
CNTHCTL_EL2, /* Counter-timer Hypervisor Control register */
SP_EL2, /* EL2 Stack Pointer */
CNTHP_CTL_EL2,
CNTHP_CVAL_EL2,
CNTHV_CTL_EL2,
CNTHV_CVAL_EL2,
NR_SYS_REGS /* Nothing after this line! */
};
@@ -522,6 +536,7 @@ struct kvm_vcpu_arch {
/* vcpu power state */
struct kvm_mp_state mp_state;
spinlock_t mp_state_lock;
/* Cache some mmu pages needed inside spinlock regions */
struct kvm_mmu_memory_cache mmu_page_cache;
@@ -939,6 +954,9 @@ void kvm_reset_sys_regs(struct kvm_vcpu *vcpu);
int __init kvm_sys_reg_table_init(void);
bool lock_all_vcpus(struct kvm *kvm);
void unlock_all_vcpus(struct kvm *kvm);
/* MMIO helpers */
void kvm_mmio_write_buf(void *buf, unsigned int len, unsigned long data);
unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len);
@@ -1022,8 +1040,10 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr);
long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
struct kvm_arm_copy_mte_tags *copy_tags);
int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
struct kvm_arm_copy_mte_tags *copy_tags);
int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm,
struct kvm_arm_counter_offset *offset);
/* Guest/host FPSIMD coordination helpers */
int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
@@ -1078,6 +1098,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
(system_supports_32bit_el0() && \
!static_branch_unlikely(&arm64_mismatched_32bit_el0))
#define kvm_vm_has_ran_once(kvm) \
(test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &(kvm)->arch.flags))
int kvm_trng_call(struct kvm_vcpu *vcpu);
#ifdef CONFIG_KVM
extern phys_addr_t hyp_mem_base;

View File

@@ -63,6 +63,7 @@
* specific registers encoded in the instructions).
*/
.macro kern_hyp_va reg
#ifndef __KVM_VHE_HYPERVISOR__
alternative_cb ARM64_ALWAYS_SYSTEM, kvm_update_va_mask
and \reg, \reg, #1 /* mask with va_mask */
ror \reg, \reg, #1 /* rotate to the first tag bit */
@@ -70,6 +71,7 @@ alternative_cb ARM64_ALWAYS_SYSTEM, kvm_update_va_mask
add \reg, \reg, #0, lsl 12 /* insert the top 12 bits of the tag */
ror \reg, \reg, #63 /* rotate back */
alternative_cb_end
#endif
.endm
/*
@@ -127,6 +129,7 @@ void kvm_apply_hyp_relocations(void);
static __always_inline unsigned long __kern_hyp_va(unsigned long v)
{
#ifndef __KVM_VHE_HYPERVISOR__
asm volatile(ALTERNATIVE_CB("and %0, %0, #1\n"
"ror %0, %0, #1\n"
"add %0, %0, #0\n"
@@ -135,6 +138,7 @@ static __always_inline unsigned long __kern_hyp_va(unsigned long v)
ARM64_ALWAYS_SYSTEM,
kvm_update_va_mask)
: "+r" (v));
#endif
return v;
}

View File

@@ -388,6 +388,7 @@
#define SYS_CNTFRQ_EL0 sys_reg(3, 3, 14, 0, 0)
#define SYS_CNTPCT_EL0 sys_reg(3, 3, 14, 0, 1)
#define SYS_CNTPCTSS_EL0 sys_reg(3, 3, 14, 0, 5)
#define SYS_CNTVCTSS_EL0 sys_reg(3, 3, 14, 0, 6)
@@ -400,7 +401,9 @@
#define SYS_AARCH32_CNTP_TVAL sys_reg(0, 0, 14, 2, 0)
#define SYS_AARCH32_CNTP_CTL sys_reg(0, 0, 14, 2, 1)
#define SYS_AARCH32_CNTPCT sys_reg(0, 0, 0, 14, 0)
#define SYS_AARCH32_CNTP_CVAL sys_reg(0, 2, 0, 14, 0)
#define SYS_AARCH32_CNTPCTSS sys_reg(0, 8, 0, 14, 0)
#define __PMEV_op2(n) ((n) & 0x7)
#define __CNTR_CRm(n) (0x8 | (((n) >> 3) & 0x3))

View File

@@ -198,6 +198,15 @@ struct kvm_arm_copy_mte_tags {
__u64 reserved[2];
};
/*
* Counter/Timer offset structure. Describe the virtual/physical offset.
* To be used with KVM_ARM_SET_COUNTER_OFFSET.
*/
struct kvm_arm_counter_offset {
__u64 counter_offset;
__u64 reserved;
};
#define KVM_ARM_TAGS_TO_GUEST 0
#define KVM_ARM_TAGS_FROM_GUEST 1
@@ -372,6 +381,10 @@ enum {
#endif
};
/* Device Control API on vm fd */
#define KVM_ARM_VM_SMCCC_CTRL 0
#define KVM_ARM_VM_SMCCC_FILTER 0
/* Device Control API: ARM VGIC */
#define KVM_DEV_ARM_VGIC_GRP_ADDR 0
#define KVM_DEV_ARM_VGIC_GRP_DIST_REGS 1
@@ -411,6 +424,8 @@ enum {
#define KVM_ARM_VCPU_TIMER_CTRL 1
#define KVM_ARM_VCPU_TIMER_IRQ_VTIMER 0
#define KVM_ARM_VCPU_TIMER_IRQ_PTIMER 1
#define KVM_ARM_VCPU_TIMER_IRQ_HVTIMER 2
#define KVM_ARM_VCPU_TIMER_IRQ_HPTIMER 3
#define KVM_ARM_VCPU_PVTIME_CTRL 2
#define KVM_ARM_VCPU_PVTIME_IPA 0
@@ -469,6 +484,27 @@ enum {
/* run->fail_entry.hardware_entry_failure_reason codes. */
#define KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED (1ULL << 0)
enum kvm_smccc_filter_action {
KVM_SMCCC_FILTER_HANDLE = 0,
KVM_SMCCC_FILTER_DENY,
KVM_SMCCC_FILTER_FWD_TO_USER,
#ifdef __KERNEL__
NR_SMCCC_FILTER_ACTIONS
#endif
};
struct kvm_smccc_filter {
__u32 base;
__u32 nr_functions;
__u8 action;
__u8 pad[15];
};
/* arm64-specific KVM_EXIT_HYPERCALL flags */
#define KVM_HYPERCALL_EXIT_SMC (1U << 0)
#define KVM_HYPERCALL_EXIT_16BIT (1U << 1)
#endif
#endif /* __ARM_KVM_H__ */

View File

@@ -2230,6 +2230,17 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.matches = has_cpuid_feature,
ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, ECV, IMP)
},
{
.desc = "Enhanced Counter Virtualization (CNTPOFF)",
.capability = ARM64_HAS_ECV_CNTPOFF,
.type = ARM64_CPUCAP_SYSTEM_FEATURE,
.matches = has_cpuid_feature,
.sys_reg = SYS_ID_AA64MMFR0_EL1,
.field_pos = ID_AA64MMFR0_EL1_ECV_SHIFT,
.field_width = 4,
.sign = FTR_UNSIGNED,
.min_field_value = ID_AA64MMFR0_EL1_ECV_CNTPOFF,
},
#ifdef CONFIG_ARM64_PAN
{
.desc = "Privileged Access Never",

File diff suppressed because it is too large Load Diff

View File

@@ -126,6 +126,16 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
int ret;
mutex_init(&kvm->arch.config_lock);
#ifdef CONFIG_LOCKDEP
/* Clue in lockdep that the config_lock must be taken inside kvm->lock */
mutex_lock(&kvm->lock);
mutex_lock(&kvm->arch.config_lock);
mutex_unlock(&kvm->arch.config_lock);
mutex_unlock(&kvm->lock);
#endif
ret = kvm_share_hyp(kvm, kvm + 1);
if (ret)
return ret;
@@ -146,6 +156,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_vgic_early_init(kvm);
kvm_timer_init_vm(kvm);
/* The maximum number of VCPUs is limited by the host's GIC model */
kvm->max_vcpus = kvm_arm_default_max_vcpus();
@@ -190,6 +202,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_destroy_vcpus(kvm);
kvm_unshare_hyp(kvm, kvm + 1);
kvm_arm_teardown_hypercalls(kvm);
}
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
@@ -219,6 +233,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_COUNTER_OFFSET:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
@@ -325,6 +340,16 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
{
int err;
spin_lock_init(&vcpu->arch.mp_state_lock);
#ifdef CONFIG_LOCKDEP
/* Inform lockdep that the config_lock is acquired after vcpu->mutex */
mutex_lock(&vcpu->mutex);
mutex_lock(&vcpu->kvm->arch.config_lock);
mutex_unlock(&vcpu->kvm->arch.config_lock);
mutex_unlock(&vcpu->mutex);
#endif
/* Force users to call KVM_ARM_VCPU_INIT */
vcpu->arch.target = -1;
bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
@@ -442,34 +467,41 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
}
void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
{
vcpu->arch.mp_state.mp_state = KVM_MP_STATE_STOPPED;
WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_STOPPED);
kvm_make_request(KVM_REQ_SLEEP, vcpu);
kvm_vcpu_kick(vcpu);
}
void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
{
spin_lock(&vcpu->arch.mp_state_lock);
__kvm_arm_vcpu_power_off(vcpu);
spin_unlock(&vcpu->arch.mp_state_lock);
}
bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu)
{
return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_STOPPED;
return READ_ONCE(vcpu->arch.mp_state.mp_state) == KVM_MP_STATE_STOPPED;
}
static void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu)
{
vcpu->arch.mp_state.mp_state = KVM_MP_STATE_SUSPENDED;
WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_SUSPENDED);
kvm_make_request(KVM_REQ_SUSPEND, vcpu);
kvm_vcpu_kick(vcpu);
}
static bool kvm_arm_vcpu_suspended(struct kvm_vcpu *vcpu)
{
return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_SUSPENDED;
return READ_ONCE(vcpu->arch.mp_state.mp_state) == KVM_MP_STATE_SUSPENDED;
}
int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
{
*mp_state = vcpu->arch.mp_state;
*mp_state = READ_ONCE(vcpu->arch.mp_state);
return 0;
}
@@ -479,12 +511,14 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
{
int ret = 0;
spin_lock(&vcpu->arch.mp_state_lock);
switch (mp_state->mp_state) {
case KVM_MP_STATE_RUNNABLE:
vcpu->arch.mp_state = *mp_state;
WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
break;
case KVM_MP_STATE_STOPPED:
kvm_arm_vcpu_power_off(vcpu);
__kvm_arm_vcpu_power_off(vcpu);
break;
case KVM_MP_STATE_SUSPENDED:
kvm_arm_vcpu_suspend(vcpu);
@@ -493,6 +527,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
ret = -EINVAL;
}
spin_unlock(&vcpu->arch.mp_state_lock);
return ret;
}
@@ -592,9 +628,9 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
if (kvm_vm_is_protected(kvm))
kvm_call_hyp_nvhe(__pkvm_vcpu_init_traps, vcpu);
mutex_lock(&kvm->lock);
mutex_lock(&kvm->arch.config_lock);
set_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags);
mutex_unlock(&kvm->lock);
mutex_unlock(&kvm->arch.config_lock);
return ret;
}
@@ -1209,10 +1245,14 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
/*
* Handle the "start in power-off" case.
*/
spin_lock(&vcpu->arch.mp_state_lock);
if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features))
kvm_arm_vcpu_power_off(vcpu);
__kvm_arm_vcpu_power_off(vcpu);
else
vcpu->arch.mp_state.mp_state = KVM_MP_STATE_RUNNABLE;
WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_RUNNABLE);
spin_unlock(&vcpu->arch.mp_state_lock);
return 0;
}
@@ -1438,11 +1478,31 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
}
}
long kvm_arch_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
static int kvm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
{
switch (attr->group) {
case KVM_ARM_VM_SMCCC_CTRL:
return kvm_vm_smccc_has_attr(kvm, attr);
default:
return -ENXIO;
}
}
static int kvm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
{
switch (attr->group) {
case KVM_ARM_VM_SMCCC_CTRL:
return kvm_vm_smccc_set_attr(kvm, attr);
default:
return -ENXIO;
}
}
int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
{
struct kvm *kvm = filp->private_data;
void __user *argp = (void __user *)arg;
struct kvm_device_attr attr;
switch (ioctl) {
case KVM_CREATE_IRQCHIP: {
@@ -1478,11 +1538,73 @@ long kvm_arch_vm_ioctl(struct file *filp,
return -EFAULT;
return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags);
}
case KVM_ARM_SET_COUNTER_OFFSET: {
struct kvm_arm_counter_offset offset;
if (copy_from_user(&offset, argp, sizeof(offset)))
return -EFAULT;
return kvm_vm_ioctl_set_counter_offset(kvm, &offset);
}
case KVM_HAS_DEVICE_ATTR: {
if (copy_from_user(&attr, argp, sizeof(attr)))
return -EFAULT;
return kvm_vm_has_attr(kvm, &attr);
}
case KVM_SET_DEVICE_ATTR: {
if (copy_from_user(&attr, argp, sizeof(attr)))
return -EFAULT;
return kvm_vm_set_attr(kvm, &attr);
}
default:
return -EINVAL;
}
}
/* unlocks vcpus from @vcpu_lock_idx and smaller */
static void unlock_vcpus(struct kvm *kvm, int vcpu_lock_idx)
{
struct kvm_vcpu *tmp_vcpu;
for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) {
tmp_vcpu = kvm_get_vcpu(kvm, vcpu_lock_idx);
mutex_unlock(&tmp_vcpu->mutex);
}
}
void unlock_all_vcpus(struct kvm *kvm)
{
lockdep_assert_held(&kvm->lock);
unlock_vcpus(kvm, atomic_read(&kvm->online_vcpus) - 1);
}
/* Returns true if all vcpus were locked, false otherwise */
bool lock_all_vcpus(struct kvm *kvm)
{
struct kvm_vcpu *tmp_vcpu;
unsigned long c;
lockdep_assert_held(&kvm->lock);
/*
* Any time a vcpu is in an ioctl (including running), the
* core KVM code tries to grab the vcpu->mutex.
*
* By grabbing the vcpu->mutex of all VCPUs we ensure that no
* other VCPUs can fiddle with the state while we access it.
*/
kvm_for_each_vcpu(c, tmp_vcpu, kvm) {
if (!mutex_trylock(&tmp_vcpu->mutex)) {
unlock_vcpus(kvm, c - 1);
return false;
}
}
return true;
}
static unsigned long nvhe_percpu_size(void)
{
return (unsigned long)CHOOSE_NVHE_SYM(__per_cpu_end) -

View File

@@ -590,11 +590,16 @@ static unsigned long num_core_regs(const struct kvm_vcpu *vcpu)
return copy_core_reg_indices(vcpu, NULL);
}
/**
* ARM64 versions of the TIMER registers, always available on arm64
*/
static const u64 timer_reg_list[] = {
KVM_REG_ARM_TIMER_CTL,
KVM_REG_ARM_TIMER_CNT,
KVM_REG_ARM_TIMER_CVAL,
KVM_REG_ARM_PTIMER_CTL,
KVM_REG_ARM_PTIMER_CNT,
KVM_REG_ARM_PTIMER_CVAL,
};
#define NUM_TIMER_REGS 3
#define NUM_TIMER_REGS ARRAY_SIZE(timer_reg_list)
static bool is_timer_reg(u64 index)
{
@@ -602,6 +607,9 @@ static bool is_timer_reg(u64 index)
case KVM_REG_ARM_TIMER_CTL:
case KVM_REG_ARM_TIMER_CNT:
case KVM_REG_ARM_TIMER_CVAL:
case KVM_REG_ARM_PTIMER_CTL:
case KVM_REG_ARM_PTIMER_CNT:
case KVM_REG_ARM_PTIMER_CVAL:
return true;
}
return false;
@@ -609,14 +617,11 @@ static bool is_timer_reg(u64 index)
static int copy_timer_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
{
if (put_user(KVM_REG_ARM_TIMER_CTL, uindices))
return -EFAULT;
uindices++;
if (put_user(KVM_REG_ARM_TIMER_CNT, uindices))
return -EFAULT;
uindices++;
if (put_user(KVM_REG_ARM_TIMER_CVAL, uindices))
return -EFAULT;
for (int i = 0; i < NUM_TIMER_REGS; i++) {
if (put_user(timer_reg_list[i], uindices))
return -EFAULT;
uindices++;
}
return 0;
}
@@ -957,7 +962,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
switch (attr->group) {
case KVM_ARM_VCPU_PMU_V3_CTRL:
mutex_lock(&vcpu->kvm->arch.config_lock);
ret = kvm_arm_pmu_v3_set_attr(vcpu, attr);
mutex_unlock(&vcpu->kvm->arch.config_lock);
break;
case KVM_ARM_VCPU_TIMER_CTRL:
ret = kvm_arm_timer_set_attr(vcpu, attr);
@@ -1019,8 +1026,8 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
return ret;
}
long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
struct kvm_arm_copy_mte_tags *copy_tags)
int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
struct kvm_arm_copy_mte_tags *copy_tags)
{
gpa_t guest_ipa = copy_tags->guest_ipa;
size_t length = copy_tags->length;
@@ -1041,6 +1048,10 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
return -EINVAL;
/* Lengths above INT_MAX cannot be represented in the return value */
if (length > INT_MAX)
return -EINVAL;
gfn = gpa_to_gfn(guest_ipa);
mutex_lock(&kvm->slots_lock);

View File

@@ -36,8 +36,6 @@ static void kvm_handle_guest_serror(struct kvm_vcpu *vcpu, u64 esr)
static int handle_hvc(struct kvm_vcpu *vcpu)
{
int ret;
trace_kvm_hvc_arm64(*vcpu_pc(vcpu), vcpu_get_reg(vcpu, 0),
kvm_vcpu_hvc_get_imm(vcpu));
vcpu->stat.hvc_exit_stat++;
@@ -52,33 +50,29 @@ static int handle_hvc(struct kvm_vcpu *vcpu)
return 1;
}
ret = kvm_hvc_call_handler(vcpu);
if (ret < 0) {
vcpu_set_reg(vcpu, 0, ~0UL);
return 1;
}
return ret;
return kvm_smccc_call_handler(vcpu);
}
static int handle_smc(struct kvm_vcpu *vcpu)
{
int ret;
/*
* "If an SMC instruction executed at Non-secure EL1 is
* trapped to EL2 because HCR_EL2.TSC is 1, the exception is a
* Trap exception, not a Secure Monitor Call exception [...]"
*
* We need to advance the PC after the trap, as it would
* otherwise return to the same address...
*
* Only handle SMCs from the virtual EL2 with an immediate of zero and
* skip it otherwise.
* otherwise return to the same address. Furthermore, pre-incrementing
* the PC before potentially exiting to userspace maintains the same
* abstraction for both SMCs and HVCs.
*/
if (!vcpu_is_el2(vcpu) || kvm_vcpu_hvc_get_imm(vcpu)) {
kvm_incr_pc(vcpu);
/*
* SMCs with a nonzero immediate are reserved according to DEN0028E 2.9
* "SMC and HVC immediate value".
*/
if (kvm_vcpu_hvc_get_imm(vcpu)) {
vcpu_set_reg(vcpu, 0, ~0UL);
kvm_incr_pc(vcpu);
return 1;
}
@@ -89,13 +83,7 @@ static int handle_smc(struct kvm_vcpu *vcpu)
* at Non-secure EL1 is trapped to EL2 if HCR_EL2.TSC==1, rather than
* being treated as UNDEFINED.
*/
ret = kvm_hvc_call_handler(vcpu);
if (ret < 0)
vcpu_set_reg(vcpu, 0, ~0UL);
kvm_incr_pc(vcpu);
return ret;
return kvm_smccc_call_handler(vcpu);
}
/*

View File

@@ -26,6 +26,7 @@
#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
#include <asm/kvm_nested.h>
#include <asm/fpsimd.h>
#include <asm/debug-monitors.h>
#include <asm/processor.h>
@@ -326,6 +327,55 @@ static bool kvm_hyp_handle_ptrauth(struct kvm_vcpu *vcpu, u64 *exit_code)
return true;
}
static bool kvm_hyp_handle_cntpct(struct kvm_vcpu *vcpu)
{
struct arch_timer_context *ctxt;
u32 sysreg;
u64 val;
/*
* We only get here for 64bit guests, 32bit guests will hit
* the long and winding road all the way to the standard
* handling. Yes, it sucks to be irrelevant.
*/
sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
switch (sysreg) {
case SYS_CNTPCT_EL0:
case SYS_CNTPCTSS_EL0:
if (vcpu_has_nv(vcpu)) {
if (is_hyp_ctxt(vcpu)) {
ctxt = vcpu_hptimer(vcpu);
break;
}
/* Check for guest hypervisor trapping */
val = __vcpu_sys_reg(vcpu, CNTHCTL_EL2);
if (!vcpu_el2_e2h_is_set(vcpu))
val = (val & CNTHCTL_EL1PCTEN) << 10;
if (!(val & (CNTHCTL_EL1PCTEN << 10)))
return false;
}
ctxt = vcpu_ptimer(vcpu);
break;
default:
return false;
}
val = arch_timer_read_cntpct_el0();
if (ctxt->offset.vm_offset)
val -= *kern_hyp_va(ctxt->offset.vm_offset);
if (ctxt->offset.vcpu_offset)
val -= *kern_hyp_va(ctxt->offset.vcpu_offset);
vcpu_set_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu), val);
__kvm_skip_instr(vcpu);
return true;
}
static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
{
if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_TX2_219_TVM) &&
@@ -339,6 +389,9 @@ static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
if (esr_is_ptrauth_trap(kvm_vcpu_get_esr(vcpu)))
return kvm_hyp_handle_ptrauth(vcpu, exit_code);
if (kvm_hyp_handle_cntpct(vcpu))
return true;
return false;
}

View File

@@ -37,7 +37,6 @@ static void __debug_save_spe(u64 *pmscr_el1)
/* Now drain all buffered data to memory */
psb_csync();
dsb(nsh);
}
static void __debug_restore_spe(u64 pmscr_el1)
@@ -69,7 +68,6 @@ static void __debug_save_trace(u64 *trfcr_el1)
isb();
/* Drain the trace buffer to memory */
tsb_csync();
dsb(nsh);
}
static void __debug_restore_trace(u64 trfcr_el1)

View File

@@ -297,6 +297,13 @@ int __pkvm_prot_finalize(void)
params->vttbr = kvm_get_vttbr(mmu);
params->vtcr = host_mmu.arch.vtcr;
params->hcr_el2 |= HCR_VM;
/*
* The CMO below not only cleans the updated params to the
* PoC, but also provides the DSB that ensures ongoing
* page-table walks that have started before we trapped to EL2
* have completed.
*/
kvm_flush_dcache_to_poc(params, sizeof(*params));
write_sysreg(params->hcr_el2, hcr_el2);

View File

@@ -272,6 +272,17 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
*/
__debug_save_host_buffers_nvhe(vcpu);
/*
* We're about to restore some new MMU state. Make sure
* ongoing page-table walks that have started before we
* trapped to EL2 have completed. This also synchronises the
* above disabling of SPE and TRBE.
*
* See DDI0487I.a D8.1.5 "Out-of-context translation regimes",
* rule R_LFHQG and subsequent information statements.
*/
dsb(nsh);
__kvm_adjust_pc(vcpu);
/*
@@ -306,6 +317,13 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__timer_disable_traps(vcpu);
__hyp_vgic_save_state(vcpu);
/*
* Same thing as before the guest run: we're about to switch
* the MMU context, so let's make sure we don't have any
* ongoing EL1&0 translations.
*/
dsb(nsh);
__deactivate_traps(vcpu);
__load_host_stage2();

View File

@@ -9,6 +9,7 @@
#include <linux/kvm_host.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
void __kvm_timer_set_cntvoff(u64 cntvoff)
{
@@ -35,14 +36,19 @@ void __timer_disable_traps(struct kvm_vcpu *vcpu)
*/
void __timer_enable_traps(struct kvm_vcpu *vcpu)
{
u64 val;
u64 clr = 0, set = 0;
/*
* Disallow physical timer access for the guest
* Physical counter access is allowed
* Physical counter access is allowed if no offset is enforced
* or running protected (we don't offset anything in this case).
*/
val = read_sysreg(cnthctl_el2);
val &= ~CNTHCTL_EL1PCEN;
val |= CNTHCTL_EL1PCTEN;
write_sysreg(val, cnthctl_el2);
clr = CNTHCTL_EL1PCEN;
if (is_protected_kvm_enabled() ||
!kern_hyp_va(vcpu->kvm)->arch.timer_data.poffset)
set |= CNTHCTL_EL1PCTEN;
else
clr |= CNTHCTL_EL1PCTEN;
sysreg_clear_set(cnthctl_el2, clr, set);
}

View File

@@ -15,8 +15,31 @@ struct tlb_inv_context {
};
static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
struct tlb_inv_context *cxt)
struct tlb_inv_context *cxt,
bool nsh)
{
/*
* We have two requirements:
*
* - ensure that the page table updates are visible to all
* CPUs, for which a dsb(DOMAIN-st) is what we need, DOMAIN
* being either ish or nsh, depending on the invalidation
* type.
*
* - complete any speculative page table walk started before
* we trapped to EL2 so that we can mess with the MM
* registers out of context, for which dsb(nsh) is enough
*
* The composition of these two barriers is a dsb(DOMAIN), and
* the 'nsh' parameter tracks the distinction between
* Inner-Shareable and Non-Shareable, as specified by the
* callers.
*/
if (nsh)
dsb(nsh);
else
dsb(ish);
if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
u64 val;
@@ -60,10 +83,8 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
{
struct tlb_inv_context cxt;
dsb(ishst);
/* Switch to requested VMID */
__tlb_switch_to_guest(mmu, &cxt);
__tlb_switch_to_guest(mmu, &cxt, false);
/*
* We could do so much better if we had the VA as well.
@@ -113,10 +134,8 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
dsb(ishst);
/* Switch to requested VMID */
__tlb_switch_to_guest(mmu, &cxt);
__tlb_switch_to_guest(mmu, &cxt, false);
__tlbi(vmalls12e1is);
dsb(ish);
@@ -130,7 +149,7 @@ void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu)
struct tlb_inv_context cxt;
/* Switch to requested VMID */
__tlb_switch_to_guest(mmu, &cxt);
__tlb_switch_to_guest(mmu, &cxt, false);
__tlbi(vmalle1);
asm volatile("ic iallu");
@@ -142,7 +161,8 @@ void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu)
void __kvm_flush_vm_context(void)
{
dsb(ishst);
/* Same remark as in __tlb_switch_to_guest() */
dsb(ish);
__tlbi(alle1is);
/*

View File

@@ -227,11 +227,10 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
/*
* When we exit from the guest we change a number of CPU configuration
* parameters, such as traps. Make sure these changes take effect
* before running the host or additional guests.
* parameters, such as traps. We rely on the isb() in kvm_call_hyp*()
* to make sure these changes take effect before running the host or
* additional guests.
*/
isb();
return ret;
}

View File

@@ -13,6 +13,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_nested.h>
/*
* VHE: Host and guest must save mdscr_el1 and sp_el0 (and the PC and
@@ -69,6 +70,17 @@ void kvm_vcpu_load_sysregs_vhe(struct kvm_vcpu *vcpu)
host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
__sysreg_save_user_state(host_ctxt);
/*
* When running a normal EL1 guest, we only load a new vcpu
* after a context switch, which imvolves a DSB, so all
* speculative EL1&0 walks will have already completed.
* If running NV, the vcpu may transition between vEL1 and
* vEL2 without a context switch, so make sure we complete
* those walks before loading a new context.
*/
if (vcpu_has_nv(vcpu))
dsb(nsh);
/*
* Load guest EL1 and user state
*

Some files were not shown because too many files have changed in this diff Show More