Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář: "ARM: - icache invalidation optimizations, improving VM startup time - support for forwarded level-triggered interrupts, improving performance for timers and passthrough platform devices - a small fix for power-management notifiers, and some cosmetic changes PPC: - add MMIO emulation for vector loads and stores - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without requiring the complex thread synchronization of older CPU versions - improve the handling of escalation interrupts with the XIVE interrupt controller - support decrement register migration - various cleanups and bugfixes. s390: - Cornelia Huck passed maintainership to Janosch Frank - exitless interrupts for emulated devices - cleanup of cpuflag handling - kvm_stat counter improvements - VSIE improvements - mm cleanup x86: - hypervisor part of SEV - UMIP, RDPID, and MSR_SMI_COUNT emulation - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512 features - show vcpu id in its anonymous inode name - many fixes and cleanups - per-VCPU MSR bitmaps (already merged through x86/pti branch) - stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)" * tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits) KVM: PPC: Book3S: Add MMIO emulation for VMX instructions KVM: PPC: Book3S HV: Branch inside feature section KVM: PPC: Book3S HV: Make HPT resizing work on POWER9 KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code KVM: PPC: Book3S PR: Fix broken select due to misspelling KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs() KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled KVM: PPC: Book3S HV: Drop locks before reading guest memory kvm: x86: remove efer_reload entry in kvm_vcpu_stat KVM: x86: AMD Processor Topology Information x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested kvm: embed vcpu id to dentry of vcpu anon inode kvm: Map PFN-type memory regions as writable (if possible) x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n KVM: arm/arm64: Fixup userspace irqchip static key optimization KVM: arm/arm64: Fix userspace_irqchip_in_use counting KVM: arm/arm64: Fix incorrect timer_is_pending logic MAINTAINERS: update KVM/s390 maintainers MAINTAINERS: add Halil as additional vfio-ccw maintainer MAINTAINERS: add David as a reviewer for KVM/s390 ...
2026-05-01 15:00:59 -07:00 · 2018-02-10 13:16:35 -08:00
parent 9a61df9e5f 1ab03c072f
commit 15303ba5d1
123 changed files with 6610 additions and 1449 deletions
@@ -26,3 +26,6 @@ s390-diag.txt
 	- Diagnose hypercall description (for IBM S/390)
 timekeeping.txt
 	- timekeeping virtualization for x86-based architectures.
 amd-memory-encryption.txt
 	- notes on AMD Secure Encrypted Virtualization feature and SEV firmware
 	  command description
@@ -0,0 +1,247 @@
 ======================================
 Secure Encrypted Virtualization (SEV)
 ======================================
 Overview
 ========
 Secure Encrypted Virtualization (SEV) is a feature found on AMD processors.
 SEV is an extension to the AMD-V architecture which supports running
 virtual machines (VMs) under the control of a hypervisor. When enabled,
 the memory contents of a VM will be transparently encrypted with a key
 unique to that VM.
 The hypervisor can determine the SEV support through the CPUID
 instruction. The CPUID function 0x8000001f reports information related
 to SEV::
 	0x8000001f[eax]:
 			Bit[1] 	indicates support for SEV
 	    ...
 		  [ecx]:
 			Bits[31:0]  Number of encrypted guests supported simultaneously
 If support for SEV is present, MSR 0xc001_0010 (MSR_K8_SYSCFG) and MSR 0xc001_0015
 (MSR_K7_HWCR) can be used to determine if it can be enabled::
 	0xc001_0010:
 		Bit[23]	   1 = memory encryption can be enabled
 			   0 = memory encryption can not be enabled
 	0xc001_0015:
 		Bit[0]	   1 = memory encryption can be enabled
 			   0 = memory encryption can not be enabled
 When SEV support is available, it can be enabled in a specific VM by
 setting the SEV bit before executing VMRUN.::
 	VMCB[0x90]:
 		Bit[1]	    1 = SEV is enabled
 			    0 = SEV is disabled
 SEV hardware uses ASIDs to associate a memory encryption key with a VM.
 Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value
 defined in the CPUID 0x8000001f[ecx] field.
 SEV Key Management
 ==================
 The SEV guest key management is handled by a separate processor called the AMD
 Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure
 key management interface to perform common hypervisor activities such as
 encrypting bootstrap code, snapshot, migrating and debugging the guest. For more
 information, see the SEV Key Management spec [api-spec]_
 KVM implements the following commands to support common lifecycle events of SEV
 guests, such as launching, running, snapshotting, migrating and decommissioning.
 1. KVM_SEV_INIT
 ---------------
 The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform
 context. In a typical workflow, this command should be the first command issued.
 Returns: 0 on success, -negative on error
 2. KVM_SEV_LAUNCH_START
 -----------------------
 The KVM_SEV_LAUNCH_START command is used for creating the memory encryption
 context. To create the encryption context, user must provide a guest policy,
 the owner's public Diffie-Hellman (PDH) key and session information.
 Parameters: struct  kvm_sev_launch_start (in/out)
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_launch_start {
                __u32 handle;           /* if zero then firmware creates a new handle */
                __u32 policy;           /* guest's policy */
                __u64 dh_uaddr;         /* userspace address pointing to the guest owner's PDH key */
                __u32 dh_len;
                __u64 session_addr;     /* userspace address which points to the guest session information */
                __u32 session_len;
        };
 On success, the 'handle' field contains a new handle and on error, a negative value.
 For more details, see SEV spec Section 6.2.
 3. KVM_SEV_LAUNCH_UPDATE_DATA
 -----------------------------
 The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also
 calculates a measurement of the memory contents. The measurement is a signature
 of the memory contents that can be sent to the guest owner as an attestation
 that the memory was encrypted correctly by the firmware.
 Parameters (in): struct  kvm_sev_launch_update_data
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_launch_update {
                __u64 uaddr;    /* userspace address to be encrypted (must be 16-byte aligned) */
                __u32 len;      /* length of the data to be encrypted (must be 16-byte aligned) */
        };
 For more details, see SEV spec Section 6.3.
 4. KVM_SEV_LAUNCH_MEASURE
 -------------------------
 The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the
 data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may
 wait to provide the guest with confidential information until it can verify the
 measurement. Since the guest owner knows the initial contents of the guest at
 boot, the measurement can be verified by comparing it to what the guest owner
 expects.
 Parameters (in): struct  kvm_sev_launch_measure
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_launch_measure {
                __u64 uaddr;    /* where to copy the measurement */
                __u32 len;      /* length of measurement blob */
        };
 For more details on the measurement verification flow, see SEV spec Section 6.4.
 5. KVM_SEV_LAUNCH_FINISH
 ------------------------
 After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be
 issued to make the guest ready for the execution.
 Returns: 0 on success, -negative on error
 6. KVM_SEV_GUEST_STATUS
 -----------------------
 The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a
 SEV-enabled guest.
 Parameters (out): struct kvm_sev_guest_status
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_guest_status {
                __u32 handle;   /* guest handle */
                __u32 policy;   /* guest policy */
                __u8 state;     /* guest state (see enum below) */
        };
 SEV guest state:
 ::
        enum {
        SEV_STATE_INVALID = 0;
        SEV_STATE_LAUNCHING,    /* guest is currently being launched */
        SEV_STATE_SECRET,       /* guest is being launched and ready to accept the ciphertext data */
        SEV_STATE_RUNNING,      /* guest is fully launched and running */
        SEV_STATE_RECEIVING,    /* guest is being migrated in from another SEV machine */
        SEV_STATE_SENDING       /* guest is getting migrated out to another SEV machine */
        };
 7. KVM_SEV_DBG_DECRYPT
 ----------------------
 The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the
 firmware to decrypt the data at the given memory region.
 Parameters (in): struct kvm_sev_dbg
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_dbg {
                __u64 src_uaddr;        /* userspace address of data to decrypt */
                __u64 dst_uaddr;        /* userspace address of destination */
                __u32 len;              /* length of memory region to decrypt */
        };
 The command returns an error if the guest policy does not allow debugging.
 8. KVM_SEV_DBG_ENCRYPT
 ----------------------
 The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the
 firmware to encrypt the data at the given memory region.
 Parameters (in): struct kvm_sev_dbg
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_dbg {
                __u64 src_uaddr;        /* userspace address of data to encrypt */
                __u64 dst_uaddr;        /* userspace address of destination */
                __u32 len;              /* length of memory region to encrypt */
        };
 The command returns an error if the guest policy does not allow debugging.
 9. KVM_SEV_LAUNCH_SECRET
 ------------------------
 The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret
 data after the measurement has been validated by the guest owner.
 Parameters (in): struct kvm_sev_launch_secret
 Returns: 0 on success, -negative on error
 ::
        struct kvm_sev_launch_secret {
                __u64 hdr_uaddr;        /* userspace address containing the packet header */
                __u32 hdr_len;
                __u64 guest_uaddr;      /* the guest memory region where the secret should be injected */
                __u32 guest_len;
                __u64 trans_uaddr;      /* the hypervisor memory region which contains the secret */
                __u32 trans_len;
        };
 References
 ==========
 .. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] http://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf
 .. [amd-apm] http://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
@@ -1841,6 +1841,7 @@ registers, find a list below:
  PPC	| KVM_REG_PPC_DBSR              | 32
  PPC   | KVM_REG_PPC_TIDR              | 64
  PPC   | KVM_REG_PPC_PSSCR             | 64
  PPC   | KVM_REG_PPC_DEC_EXPIRY        | 64
  PPC   | KVM_REG_PPC_TM_GPR0           | 64
          ...
  PPC   | KVM_REG_PPC_TM_GPR31          | 64
@@ -3403,7 +3404,7 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
 or if no page table is present for the addresses (e.g. when using
 hugepages).
-4.108 KVM_PPC_GET_CPU_CHAR
+4.109 KVM_PPC_GET_CPU_CHAR
 Capability: KVM_CAP_PPC_GET_CPU_CHAR
 Architectures: powerpc
@@ -3449,6 +3450,57 @@ array bounds check and the array access.
 These fields use the same bit definitions as the new
 H_GET_CPU_CHARACTERISTICS hypercall.
 4.110 KVM_MEMORY_ENCRYPT_OP
 Capability: basic
 Architectures: x86
 Type: system
 Parameters: an opaque platform specific structure (in/out)
 Returns: 0 on success; -1 on error
 If the platform supports creating encrypted VMs then this ioctl can be used
 for issuing platform-specific memory encryption commands to manage those
 encrypted VMs.
 Currently, this ioctl is used for issuing Secure Encrypted Virtualization
 (SEV) commands on AMD Processors. The SEV commands are defined in
 Documentation/virtual/kvm/amd-memory-encryption.txt.
 4.111 KVM_MEMORY_ENCRYPT_REG_REGION
 Capability: basic
 Architectures: x86
 Type: system
 Parameters: struct kvm_enc_region (in)
 Returns: 0 on success; -1 on error
 This ioctl can be used to register a guest memory region which may
 contain encrypted data (e.g. guest RAM, SMRAM etc).
 It is used in the SEV-enabled guest. When encryption is enabled, a guest
 memory region may contain encrypted data. The SEV memory encryption
 engine uses a tweak such that two identical plaintext pages, each at
 different locations will have differing ciphertexts. So swapping or
 moving ciphertext of those pages will not result in plaintext being
 swapped. So relocating (or migrating) physical backing pages for the SEV
 guest will require some additional steps.
 Note: The current SEV key management spec does not provide commands to
 swap or migrate (move) ciphertext pages. Hence, for now we pin the guest
 memory region registered with the ioctl.
 4.112 KVM_MEMORY_ENCRYPT_UNREG_REGION
 Capability: basic
 Architectures: x86
 Type: system
 Parameters: struct kvm_enc_region (in)
 Returns: 0 on success; -1 on error
 This ioctl can be used to unregister the guest memory region registered
 with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.
 5. The kvm_run structure
 ------------------------
@@ -1,187 +0,0 @@
 KVM/ARM VGIC Forwarded Physical Interrupts
 ==========================================
 The KVM/ARM code implements software support for the ARM Generic
 Interrupt Controller's (GIC's) hardware support for virtualization by
 allowing software to inject virtual interrupts to a VM, which the guest
 OS sees as regular interrupts.  The code is famously known as the VGIC.
 Some of these virtual interrupts, however, correspond to physical
 interrupts from real physical devices.  One example could be the
 architected timer, which itself supports virtualization, and therefore
 lets a guest OS program the hardware device directly to raise an
 interrupt at some point in time.  When such an interrupt is raised, the
 host OS initially handles the interrupt and must somehow signal this
 event as a virtual interrupt to the guest.  Another example could be a
 passthrough device, where the physical interrupts are initially handled
 by the host, but the device driver for the device lives in the guest OS
 and KVM must therefore somehow inject a virtual interrupt on behalf of
 the physical one to the guest OS.
 These virtual interrupts corresponding to a physical interrupt on the
 host are called forwarded physical interrupts, but are also sometimes
 referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
 Forwarded physical interrupts are handled slightly differently compared
 to virtual interrupts generated purely by a software emulated device.
 The HW bit
 ----------
 Virtual interrupts are signalled to the guest by programming the List
 Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
 with the virtual IRQ number and the state of the interrupt (Pending,
 Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
 interrupt, the LR state moves from Pending to Active, and finally to
 inactive.
 The LRs include an extra bit, called the HW bit.  When this bit is set,
 KVM must also program an additional field in the LR, the physical IRQ
 number, to link the virtual with the physical IRQ.
 When the HW bit is set, KVM must EITHER set the Pending OR the Active
 bit, never both at the same time.
 Setting the HW bit causes the hardware to deactivate the physical
 interrupt on the physical distributor when the guest deactivates the
 corresponding virtual interrupt.
 Forwarded Physical Interrupts Life Cycle
 ----------------------------------------
 The state of forwarded physical interrupts is managed in the following way:
  - The physical interrupt is acked by the host, and becomes active on
    the physical distributor (*).
  - KVM sets the LR.Pending bit, because this is the only way the GICV
    interface is going to present it to the guest.
  - LR.Pending will stay set as long as the guest has not acked the interrupt.
  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
    expected.
  - On guest EOI, the *physical distributor* active bit gets cleared,
    but the LR.Active is left untouched (set).
  - KVM clears the LR on VM exits when the physical distributor
    active state has been cleared.
 (*): The host handling is slightly more complicated.  For some forwarded
 interrupts (shared), KVM directly sets the active state on the physical
 distributor before entering the guest, because the interrupt is never actually
 handled on the host (see details on the timer as an example below).  For other
 forwarded interrupts (non-shared) the host does not deactivate the interrupt
 when the host ISR completes, but leaves the interrupt active until the guest
 deactivates it.  Leaving the interrupt active is allowed, because Linux
 configures the physical GIC with EOIMode=1, which causes EOI operations to
 perform a priority drop allowing the GIC to receive other interrupts of the
 default priority.
 Forwarded Edge and Level Triggered PPIs and SPIs
 ------------------------------------------------
 Forwarded physical interrupts injected should always be active on the
 physical distributor when injected to a guest.
 Level-triggered interrupts will keep the interrupt line to the GIC
 asserted, typically until the guest programs the device to deassert the
 line.  This means that the interrupt will remain pending on the physical
 distributor until the guest has reprogrammed the device.  Since we
 always run the VM with interrupts enabled on the CPU, a pending
 interrupt will exit the guest as soon as we switch into the guest,
 preventing the guest from ever making progress as the process repeats
 over and over.  Therefore, the active state on the physical distributor
 must be set when entering the guest, preventing the GIC from forwarding
 the pending interrupt to the CPU.  As soon as the guest deactivates the
 interrupt, the physical line is sampled by the hardware again and the host
 takes a new interrupt if and only if the physical line is still asserted.
 Edge-triggered interrupts do not exhibit the same problem with
 preventing guest execution that level-triggered interrupts do.  One
 option is to not use HW bit at all, and inject edge-triggered interrupts
 from a physical device as pure virtual interrupts.  But that would
 potentially slow down handling of the interrupt in the guest, because a
 physical interrupt occurring in the middle of the guest ISR would
 preempt the guest for the host to handle the interrupt.  Additionally,
 if you configure the system to handle interrupts on a separate physical
 core from that running your VCPU, you still have to interrupt the VCPU
 to queue the pending state onto the LR, even though the guest won't use
 this information until the guest ISR completes.  Therefore, the HW
 bit should always be set for forwarded edge-triggered interrupts.  With
 the HW bit set, the virtual interrupt is injected and additional
 physical interrupts occurring before the guest deactivates the interrupt
 simply mark the state on the physical distributor as Pending+Active.  As
 soon as the guest deactivates the interrupt, the host takes another
 interrupt if and only if there was a physical interrupt between injecting
 the forwarded interrupt to the guest and the guest deactivating the
 interrupt.
 Consequently, whenever we schedule a VCPU with one or more LRs with the
 HW bit set, the interrupt must also be active on the physical
 distributor.
 Forwarded LPIs
 --------------
 LPIs, introduced in GICv3, are always edge-triggered and do not have an
 active state.  They become pending when a device signal them, and as
 soon as they are acked by the CPU, they are inactive again.
 It therefore doesn't make sense, and is not supported, to set the HW bit
 for physical LPIs that are forwarded to a VM as virtual interrupts,
 typically virtual SPIs.
 For LPIs, there is no other choice than to preempt the VCPU thread if
 necessary, and queue the pending state onto the LR.
 Putting It Together: The Architected Timer
 ------------------------------------------
 The architected timer is a device that signals interrupts with level
 triggered semantics.  The timer hardware is directly accessed by VCPUs
 which program the timer to fire at some point in time.  Each VCPU on a
 system programs the timer to fire at different times, and therefore the
 hardware is multiplexed between multiple VCPUs.  This is implemented by
 context-switching the timer state along with each VCPU thread.
 However, this means that a scenario like the following is entirely
 possible, and in fact, typical:
 1.  KVM runs the VCPU
 2.  The guest programs the time to fire in T+100
 3.  The guest is idle and calls WFI (wait-for-interrupts)
 4.  The hardware traps to the host
 5.  KVM stores the timer state to memory and disables the hardware timer
 6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
 7.  KVM puts the VCPU thread to sleep (on a waitqueue)
 8.  The soft timer fires, waking up the VCPU thread
 9.  KVM reprograms the timer hardware with the VCPU's values
 10. KVM marks the timer interrupt as active on the physical distributor
 11. KVM injects a forwarded physical interrupt to the guest
 12. KVM runs the VCPU
 Notice that KVM injects a forwarded physical interrupt in step 11 without
 the corresponding interrupt having actually fired on the host.  That is
 exactly why we mark the timer interrupt as active in step 10, because
 the active state on the physical distributor is part of the state
 belonging to the timer hardware, which is context-switched along with
 the VCPU thread.
 If the guest does not idle because it is busy, the flow looks like this
 instead:
 1.  KVM runs the VCPU
 2.  The guest programs the time to fire in T+100
 4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
    (note that this initially only traps to EL2 and does not run the host ISR
    until KVM has returned to the host).
 5.  With interrupts still disabled on the CPU coming back from the guest, KVM
    stores the virtual timer state to memory and disables the virtual hw timer.
 6.  KVM looks at the timer state (in memory) and injects a forwarded physical
    interrupt because it concludes the timer has expired.
 7.  KVM marks the timer interrupt as active on the physical distributor
 7.  KVM enables the timer, enables interrupts, and runs the VCPU
 Notice that again the forwarded physical interrupt is injected to the
 guest without having actually been handled on the host.  In this case it
 is because the physical interrupt is never actually seen by the host because the
 timer is disabled upon guest return, and the virtual forwarded interrupt is
 injected on the KVM guest entry path.
@@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
                                   ||       || before enabling paravirtualized
                                   ||       || spinlock support.
 ------------------------------------------------------------------------------
 KVM_FEATURE_PV_TLB_FLUSH           ||     9 || guest checks this feature bit
                                   ||       || before enabling paravirtualized
                                   ||       || tlb flush.
 ------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                   ||       || per-cpu warps are expected in
                                   ||       || kvmclock.
@@ -7748,7 +7748,9 @@ F:	arch/powerpc/kernel/kvm*
 KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
 M:	Christian Borntraeger <borntraeger@de.ibm.com>
-M:	Cornelia Huck <cohuck@redhat.com>
+M:	Janosch Frank <frankja@linux.vnet.ibm.com>
 R:	David Hildenbrand <david@redhat.com>
 R:	Cornelia Huck <cohuck@redhat.com>
 L:	linux-s390@vger.kernel.org
 W:	http://www.ibm.com/developerworks/linux/linux390/
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git
@@ -12026,6 +12028,7 @@ F:	drivers/pci/hotplug/s390_pci_hpc.c
 S390 VFIO-CCW DRIVER
 M:	Cornelia Huck <cohuck@redhat.com>
 M:	Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
 M:	Halil Pasic <pasic@linux.vnet.ibm.com>
 L:	linux-s390@vger.kernel.org
 L:	kvm@vger.kernel.org
 S:	Supported
@@ -131,7 +131,7 @@ static inline bool mode_has_spsr(struct kvm_vcpu *vcpu)
 static inline bool vcpu_mode_priv(struct kvm_vcpu *vcpu)
 {
 	unsigned long cpsr_mode = vcpu->arch.ctxt.gp_regs.usr_regs.ARM_cpsr & MODE_MASK;
-	return cpsr_mode > USR_MODE;;
+	return cpsr_mode > USR_MODE;
 }
 static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
@@ -48,6 +48,8 @@
 	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
 DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
 int __attribute_const__ kvm_target_cpu(void);
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
@@ -21,7 +21,6 @@
 #include <linux/compiler.h>
 #include <linux/kvm_host.h>
 #include <asm/cp15.h>
 #include <asm/kvm_mmu.h>
 #include <asm/vfp.h>
 #define __hyp_text __section(.hyp.text) notrace
@@ -69,6 +68,8 @@
 #define HIFAR		__ACCESS_CP15(c6, 4, c0, 2)
 #define HPFAR		__ACCESS_CP15(c6, 4, c0, 4)
 #define ICIALLUIS	__ACCESS_CP15(c7, 0, c1, 0)
 #define BPIALLIS	__ACCESS_CP15(c7, 0, c1, 6)
 #define ICIMVAU		__ACCESS_CP15(c7, 0, c5, 1)
 #define ATS1CPR		__ACCESS_CP15(c7, 0, c8, 0)
 #define TLBIALLIS	__ACCESS_CP15(c8, 0, c3, 0)
 #define TLBIALL		__ACCESS_CP15(c8, 0, c7, 0)
@@ -37,6 +37,8 @@
 #include <linux/highmem.h>
 #include <asm/cacheflush.h>
 #include <asm/cputype.h>
 #include <asm/kvm_hyp.h>
 #include <asm/pgalloc.h>
 #include <asm/stage2_pgtable.h>
@@ -83,6 +85,18 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
 	return pmd;
 }
 static inline pte_t kvm_s2pte_mkexec(pte_t pte)
 {
 	pte_val(pte) &= ~L_PTE_XN;
 	return pte;
 }
 static inline pmd_t kvm_s2pmd_mkexec(pmd_t pmd)
 {
 	pmd_val(pmd) &= ~PMD_SECT_XN;
 	return pmd;
 }
 static inline void kvm_set_s2pte_readonly(pte_t *pte)
 {
 	pte_val(*pte) = (pte_val(*pte) & ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY;
@@ -93,6 +107,11 @@ static inline bool kvm_s2pte_readonly(pte_t *pte)
 	return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
 }
 static inline bool kvm_s2pte_exec(pte_t *pte)
 {
 	return !(pte_val(*pte) & L_PTE_XN);
 }
 static inline void kvm_set_s2pmd_readonly(pmd_t *pmd)
 {
 	pmd_val(*pmd) = (pmd_val(*pmd) & ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY;
@@ -103,6 +122,11 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 	return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY;
 }
 static inline bool kvm_s2pmd_exec(pmd_t *pmd)
 {
 	return !(pmd_val(*pmd) & PMD_SECT_XN);
 }
 static inline bool kvm_page_empty(void *ptr)
 {
 	struct page *ptr_page = virt_to_page(ptr);
@@ -126,21 +150,10 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_cp15(vcpu, c1_SCTLR) & 0b101) == 0b101;
 }
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
 					       kvm_pfn_t pfn,
 					       unsigned long size)
 {
 	/*
-	 * If we are going to insert an instruction page and the icache is
+	 * Clean the dcache to the Point of Coherency.
 	 * either VIPT or PIPT, there is a potential problem where the host
 	 * (or another VM) may have used the same page as this guest, and we
 	 * read incorrect data from the icache.  If we're using a PIPT cache,
 	 * we can invalidate just that page, but if we are using a VIPT cache
 	 * we need to invalidate the entire icache - damn shame - as written
 	 * in the ARM ARM (DDI 0406C.b - Page B3-1393).
 	 *
 	 * VIVT caches are tagged using both the ASID and the VMID and doesn't
 	 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
 	 *
 	 * We need to do this through a kernel mapping (using the
 	 * user-space mapping has proved to be the wrong
@@ -155,9 +168,63 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
 		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
-		if (icache_is_pipt())
+		size -= PAGE_SIZE;
-			__cpuc_coherent_user_range((unsigned long)va,
+		pfn++;
-						   (unsigned long)va + PAGE_SIZE);
+
 		kunmap_atomic(va);
 	}
 }
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
 						  unsigned long size)
 {
 	u32 iclsz;
 	/*
 	 * If we are going to insert an instruction page and the icache is
 	 * either VIPT or PIPT, there is a potential problem where the host
 	 * (or another VM) may have used the same page as this guest, and we
 	 * read incorrect data from the icache.  If we're using a PIPT cache,
 	 * we can invalidate just that page, but if we are using a VIPT cache
 	 * we need to invalidate the entire icache - damn shame - as written
 	 * in the ARM ARM (DDI 0406C.b - Page B3-1393).
 	 *
 	 * VIVT caches are tagged using both the ASID and the VMID and doesn't
 	 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
 	 */
 	VM_BUG_ON(size & ~PAGE_MASK);
 	if (icache_is_vivt_asid_tagged())
 		return;
 	if (!icache_is_pipt()) {
 		/* any kind of VIPT cache */
 		__flush_icache_all();
 		return;
 	}
 	/*
 	 * CTR IminLine contains Log2 of the number of words in the
 	 * cache line, so we can get the number of words as
 	 * 2 << (IminLine - 1).  To get the number of bytes, we
 	 * multiply by 4 (the number of bytes in a 32-bit word), and
 	 * get 4 << (IminLine).
 	 */
 	iclsz = 4 << (read_cpuid(CPUID_CACHETYPE) & 0xf);
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
 		void *end = va + PAGE_SIZE;
 		void *addr = va;
 		do {
 			write_sysreg(addr, ICIMVAU);
 			addr += iclsz;
 		} while (addr < end);
 		dsb(ishst);
 		isb();
 		size -= PAGE_SIZE;
 		pfn++;
@@ -165,9 +232,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
 		kunmap_atomic(va);
 	}
-	if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
+	/* Check if we need to invalidate the BTB */
-		/* any kind of VIPT cache */
+	if ((read_cpuid_ext(CPUID_EXT_MMFR1) >> 28) != 4) {
-		__flush_icache_all();
+		write_sysreg(0, BPIALLIS);
 		dsb(ishst);
 		isb();
 	}
 }
@@ -102,8 +102,8 @@ extern pgprot_t		pgprot_s2_device;
 #define PAGE_HYP_EXEC		_MOD_PROT(pgprot_kernel, L_PTE_HYP | L_PTE_RDONLY)
 #define PAGE_HYP_RO		_MOD_PROT(pgprot_kernel, L_PTE_HYP | L_PTE_RDONLY | L_PTE_XN)
 #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
-#define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
+#define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY | L_PTE_XN)
-#define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
+#define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY | L_PTE_XN)
 #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
 #define __PAGE_SHARED		__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_XN)
@@ -18,6 +18,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 __asm__(".arch_extension     virt");
@@ -19,6 +19,7 @@
 */
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 /**
 * Flush per-VMID TLBs
@@ -435,6 +435,27 @@ alternative_endif
 	dsb	\domain
 	.endm
 /*
 * Macro to perform an instruction cache maintenance for the interval
 * [start, end)
 *
 * 	start, end:	virtual addresses describing the region
 *	label:		A label to branch to on user fault.
 * 	Corrupts:	tmp1, tmp2
 */
 	.macro invalidate_icache_by_line start, end, tmp1, tmp2, label
 	icache_line_size \tmp1, \tmp2
 	sub	\tmp2, \tmp1, #1
 	bic	\tmp2, \start, \tmp2
 9997:
 USER(\label, ic	ivau, \tmp2)			// invalidate I line PoU
 	add	\tmp2, \tmp2, \tmp1
 	cmp	\tmp2, \end
 	b.lo	9997b
 	dsb	ish
 	isb
 	.endm
 /*
 * reset_pmuserenr_el0 - reset PMUSERENR_EL0 if PMUv3 present
 */
@@ -52,6 +52,12 @@
 *		- start  - virtual start address
 *		- end    - virtual end address
 *
 *	invalidate_icache_range(start, end)
 *
 *		Invalidate the I-cache in the region described by start, end.
 *		- start  - virtual start address
 *		- end    - virtual end address
 *
 *	__flush_cache_user_range(start, end)
 *
 *		Ensure coherency between the I-cache and the D-cache in the
@@ -66,6 +72,7 @@
 *		- size   - region size
 */
 extern void flush_icache_range(unsigned long start, unsigned long end);
 extern int  invalidate_icache_range(unsigned long start, unsigned long end);
 extern void __flush_dcache_area(void *addr, size_t len);
 extern void __inval_dcache_area(void *addr, size_t len);
 extern void __clean_dcache_area_poc(void *addr, size_t len);
@@ -48,6 +48,8 @@
 	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
 DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 int __attribute_const__ kvm_target_cpu(void);
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
 int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext);
@@ -20,7 +20,6 @@
 #include <linux/compiler.h>
 #include <linux/kvm_host.h>
 #include <asm/kvm_mmu.h>
 #include <asm/sysreg.h>
 #define __hyp_text __section(.hyp.text) notrace
@@ -173,6 +173,18 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
 	return pmd;
 }
 static inline pte_t kvm_s2pte_mkexec(pte_t pte)
 {
 	pte_val(pte) &= ~PTE_S2_XN;
 	return pte;
 }
 static inline pmd_t kvm_s2pmd_mkexec(pmd_t pmd)
 {
 	pmd_val(pmd) &= ~PMD_S2_XN;
 	return pmd;
 }
 static inline void kvm_set_s2pte_readonly(pte_t *pte)
 {
 	pteval_t old_pteval, pteval;
@@ -191,6 +203,11 @@ static inline bool kvm_s2pte_readonly(pte_t *pte)
 	return (pte_val(*pte) & PTE_S2_RDWR) == PTE_S2_RDONLY;
 }
 static inline bool kvm_s2pte_exec(pte_t *pte)
 {
 	return !(pte_val(*pte) & PTE_S2_XN);
 }
 static inline void kvm_set_s2pmd_readonly(pmd_t *pmd)
 {
 	kvm_set_s2pte_readonly((pte_t *)pmd);
@@ -201,6 +218,11 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 	return kvm_s2pte_readonly((pte_t *)pmd);
 }
 static inline bool kvm_s2pmd_exec(pmd_t *pmd)
 {
 	return !(pmd_val(*pmd) & PMD_S2_XN);
 }
 static inline bool kvm_page_empty(void *ptr)
 {
 	struct page *ptr_page = virt_to_page(ptr);
@@ -230,21 +252,25 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
 					       kvm_pfn_t pfn,
 					       unsigned long size)
 {
 	void *va = page_address(pfn_to_page(pfn));
 	kvm_flush_dcache_to_poc(va, size);
 }
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
 						  unsigned long size)
 {
 	if (icache_is_aliasing()) {
 		/* any kind of VIPT cache */
 		__flush_icache_all();
 	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
 		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
-		flush_icache_range((unsigned long)va,
+		void *va = page_address(pfn_to_page(pfn));
-				   (unsigned long)va + size);
+
 		invalidate_icache_range((unsigned long)va,
 					(unsigned long)va + size);
 	}
 }
@@ -187,9 +187,11 @@
 */
 #define PTE_S2_RDONLY		(_AT(pteval_t, 1) << 6)   /* HAP[2:1] */
 #define PTE_S2_RDWR		(_AT(pteval_t, 3) << 6)   /* HAP[2:1] */
 #define PTE_S2_XN		(_AT(pteval_t, 2) << 53)  /* XN[1:0] */
 #define PMD_S2_RDONLY		(_AT(pmdval_t, 1) << 6)   /* HAP[2:1] */
 #define PMD_S2_RDWR		(_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
 #define PMD_S2_XN		(_AT(pmdval_t, 2) << 53)  /* XN[1:0] */
 /*
 * Memory Attribute override for Stage-2 (MemAttr[3:0])
@@ -67,8 +67,8 @@
 #define PAGE_HYP_RO		__pgprot(_HYP_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY | PTE_HYP_XN)
 #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
-#define PAGE_S2			__pgprot(_PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
+#define PAGE_S2			__pgprot(_PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY | PTE_S2_XN)
-#define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
+#define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 #define PAGE_SHARED		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
--- a/Show More
+++ b/Show More