Pull virtio updates from Rusty Russell:
"OK, this has the big virtio 1.0 implementation, as specified by OASIS.
On top of tht is the major rework of lguest, to use PCI and virtio
1.0, to double-check the implementation.
Then comes the inevitable fixes and cleanups from that work"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits)
virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice.
virtio_net: unconditionally define struct virtio_net_hdr_v1.
tools/lguest: don't use legacy definitions for net device in example launcher.
virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined.
tools/lguest: use common error macros in the example launcher.
tools/lguest: give virtqueues names for better error messages
tools/lguest: more documentation and checking of virtio 1.0 compliance.
lguest: don't look in console features to find emerg_wr.
tools/lguest: don't start devices until DRIVER_OK status set.
tools/lguest: handle indirect partway through chain.
tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: rename virtio_pci_cfg_cap field to match spec.
tools/lguest: fix features_accepted logic in example launcher.
tools/lguest: handle device reset correctly in example launcher.
virtual: Documentation: simplify and generalize paravirt_ops.txt
lguest: remove NOTIFY call and eventfd facility.
lguest: remove NOTIFY facility from demonstration launcher.
lguest: use the PCI console device's emerg_wr for early boot messages.
lguest: always put console in PCI slot #1.
...
This patch enables cpu model support in kvm/s390 via the vm attribute
interface.
During KVM initialization, the host properties cpuid, IBC value and the
facility list are stored in the architecture specific cpu model structure.
During vcpu setup, these properties are taken to initialize the related SIE
state. This mechanism allows to adjust the properties from user space and thus
to implement different selectable cpu models.
This patch uses the IBC functionality to block instructions that have not
been implemented at the requested CPU type and GA level compared to the
full host capability.
Userspace has to initialize the cpu model before vcpu creation. A cpu model
change of running vcpus is not possible.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: fixes and features for kvm/next (3.20)
1. Generic
- sparse warning (make function static)
- optimize locking
- bugfixes for interrupt injection
- fix MVPG addressing modes
2. hrtimer/wakeup fun
A recent change can cause KVM hangs if adjtime is used in the host.
The hrtimer might wake up too early or too late. Too early is fatal
as vcpu_block will see that the wakeup condition is not met and
sleep again. This CPU might never wake up again.
This series addresses this problem. adjclock slowing down the host
clock will result in too late wakeups. This will require more work.
In addition to that we also change the hrtimer from REALTIME to
MONOTONIC to avoid similar problems with timedatectl set-time.
3. sigp rework
We will move all "slow" sigps to QEMU (protected with a capability that
can be enabled) to avoid several races between concurrent SIGP orders.
4. Optimize the shadow page table
Provide an interface to announce the maximum guest size. The kernel
will use that to make the pagetable 2,3,4 (or theoretically) 5 levels.
5. Provide an interface to set the guest TOD
We now use two vm attributes instead of two oneregs, as oneregs are
vcpu ioctl and we don't want to call them from other threads.
6. Protected key functions
The real HMC allows to enable/disable protected key CPACF functions.
Lets provide an implementation + an interface for QEMU to activate
this the protected key instructions.
Most SIGP orders are handled partially in kernel and partially in
user space. In order to:
- Get a correct SIGP SET PREFIX handler that informs user space
- Avoid race conditions between concurrently executed SIGP orders
- Serialize SIGP orders per VCPU
We need to handle all "slow" SIGP orders in user space. The remaining
ones to be handled completely in kernel are:
- SENSE
- SENSE RUNNING
- EXTERNAL CALL
- EMERGENCY SIGNAL
- CONDITIONAL EMERGENCY SIGNAL
According to the PoP, they have to be fast. They can be executed
without conflicting to the actions of other pending/concurrently
executing orders (e.g. STOP vs. START).
This patch introduces a new capability that will - when enabled -
forward all but the mentioned SIGP orders to user space. The
instruction counters in the kernel are still updated.
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
In order to get rid of the action_flags and to properly migrate pending SIGP
STOP irqs triggered e.g. by SIGP STOP AND STORE STATUS, we need to remember
whether to store the status when stopping.
For this reason, a new parameter (flags) for the SIGP STOP irq is introduced.
These flags further define details of the requested STOP and can be easily
migrated.
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
With commit c6c956b80b ("KVM: s390/mm: support gmap page tables with less
than 5 levels") we are able to define a limit for the guest memory size.
As we round up the guest size in respect to the levels of page tables
we get to guest limits of: 2048 MB, 4096 GB, 8192 TB and 16384 PB.
We currently limit the guest size to 16 TB, which means we end up
creating a page table structure supporting guest sizes up to 8192 TB.
This patch introduces an interface that allows userspace to tune
this limit. This may bring performance improvements for small guests.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Although the GIC architecture requires us to map the MMIO regions
only at page aligned addresses, we currently do not enforce this from
the kernel side.
Restrict any vGICv2 regions to be 4K aligned and any GICv3 regions
to be 64K aligned. Document this requirement.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
With all of the GICv3 code in place now we allow userland to ask the
kernel for using a virtual GICv3 in the guest.
Also we provide the necessary support for guests setting the memory
addresses for the virtual distributor and redistributors.
This requires some userland code to make use of that feature and
explicitly ask for a virtual GICv3.
Document that KVM_CREATE_IRQCHIP only works for GICv2, but is
considered legacy and using KVM_CREATE_DEVICE is preferred.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Since the advent of VGIC dynamic initialization, this latter is
initialized quite late on the first vcpu run or "on-demand", when
injecting an IRQ or when the guest sets its registers.
This initialization could be initiated explicitly much earlier
by the users-space, as soon as it has provided the requested
dimensioning parameters.
This patch adds a new entry to the VGIC KVM device that allows
the user to manually request the VGIC init:
- a new KVM_DEV_ARM_VGIC_GRP_CTRL group is introduced.
- Its first attribute is KVM_DEV_ARM_VGIC_CTRL_INIT
The rationale behind introducing a group is to be able to add other
controls later on, if needed.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
Conflicts:
arch/ia64/kvm/kvm-ia64.c [deleted in HEAD and modified in kvmarm]
When a vcpu calls SYSTEM_OFF or SYSTEM_RESET with PSCI v0.2, the vcpus
should really be turned off for the VM adhering to the suggestions in
the PSCI spec, and it's the sane thing to do.
Also, clarify the behavior and expectations for exits to user space with
the KVM_EXIT_SYSTEM_EVENT case.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
It is not clear that this ioctl can be called multiple times for a given
vcpu. Userspace already does this, so clarify the ABI.
Also specify that userspace is expected to always make secondary and
subsequent calls to the ioctl with the same parameters for the VCPU as
the initial call (which userspace also already does).
Add code to check that userspace doesn't violate that ABI in the future,
and move the kvm_vcpu_set_target() function which is currently
duplicated between the 32-bit and 64-bit versions in guest.c to a common
static function in arm.c, shared between both architectures.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The implementation of KVM_ARM_VCPU_INIT is currently not doing what
userspace expects, namely making sure that a vcpu which may have been
turned off using PSCI is returned to its initial state, which would be
powered on if userspace does not set the KVM_ARM_VCPU_POWER_OFF flag.
Implement the expected functionality and clarify the ABI.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
KVM: s390: Fixes for kvm/next (3.19) and stable
1. We should flush TLBs for load control instruction emulation (stable)
2. A workaround for a compiler bug that renders ACCESS_ONCE broken (stable)
3. Fix program check handling for load control
4. Documentation Fix
Documentation uses incorrect attribute names for some vm device
attributes: fix this.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
No kernel ever reported KVM_CAP_DEVICE_MSIX, KVM_CAP_DEVICE_MSI,
KVM_CAP_DEVICE_ASSIGNMENT, KVM_CAP_DEVICE_DEASSIGNMENT.
This makes the documentation wrong, and no application ever
written to use these capabilities has a chance to work correctly.
The only way to detect support is to try, and test errno for ENOTTY.
That's unfortunate, but we can't fix the past.
Document the actual semantics, and drop the definitions from
the exported header to make it easier for application
developers to note and fix the bug.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When commit 6adba52742 (KVM: Let host know whether the guest can
handle async PF in non-userspace context.) is introduced, actually
bit 2 still is reserved and should be zero. Instead, bit 1 is 1 to
indicate if asynchronous page faults can be injected when vcpu is
in cpl == 0, and also please see this,
in the file kvm_para.h, #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1).
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Changes for KVM for arm/arm64 for 3.18
This includes a bunch of changes:
- Support read-only memory slots on arm/arm64
- Various changes to fix Sparse warnings
- Correctly detect write vs. read Stage-2 faults
- Various VGIC cleanups and fixes
- Dynamic VGIC data strcuture sizing
- Fix SGI set_clear_pend offset bug
- Fix VTTBR_BADDR Mask
- Correctly report the FSC on Stage-2 faults
Conflicts:
virt/kvm/eventfd.c
[duplicate, different patch where the kvm-arm version broke x86.
The kvm tree instead has the right one]
This was missed in respective one_reg implementation patch.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
In order to make the number of interrupts configurable, use the new
fancy device management API to add KVM_DEV_ARM_VGIC_GRP_NR_IRQS as
a VGIC configurable attribute.
Userspace can now specify the exact size of the GIC (by increments
of 32 interrupts).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.
If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.
We can prevent the following case:
vcpu (CPU 0) | thread (CPU 1)
--------------------------------------------+--------------------------
1 vm exit |
2 srcu_read_unlock(&kvm->srcu) |
3 decide to cache something based on |
old memslots |
4 | change memslots
| (increments generation)
5 | synchronize_srcu(&kvm->srcu);
6 retrieve generation # from new memslots |
7 tag cache with new memslot generation |
8 srcu_read_unlock(&kvm->srcu) |
... |
<action based on cache occurs even |
though the caching decision was based |
on the old memslots> |
... |
<action *continues* to occur until next |
memslot generation change, which may |
be never> |
|
By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).
Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots. It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.
To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE. This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited. Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>