linux

mirror of https://github.com/Dasharo/linux.git synced 2026-03-06 15:25:10 -08:00

Author	SHA1	Message	Date
Linus Torvalds	4aca98a8a1	Merge tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Constify an unmodified structure used in linking vfio and kvm (Christophe JAILLET) - Add ID for an additional hardware SKU supported by the nvgrace-gpu vfio-pci variant driver (Ankit Agrawal) - Fix incorrect signed cast in QAT vfio-pci variant driver, negating test in check_add_overflow(), though still caught by later tests (Giovanni Cabiddu) - Additional debugfs attributes exposed in hisi_acc vfio-pci variant driver for migration debugging (Longfang Liu) - Migration support is added to the virtio vfio-pci variant driver, becoming the primary feature of the driver while retaining emulation of virtio legacy support as a secondary option (Yishai Hadas) - Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered through reviews of the virtio variant driver (Yishai Hadas) - Fix an unlikely issue where a PCI device exposed to userspace with an unknown capability at the base of the extended capability chain can overflow an array index (Avihai Horon) * tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: Properly hide first-in-list PCIe extended capability vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data() vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages() vfio/virtio: Enable live migration once VIRTIO_PCI was configured vfio/virtio: Add PRE_COPY support for live migration vfio/virtio: Add support for the basic live migration functionality virtio-pci: Introduce APIs to execute device parts admin commands virtio: Manage device and driver capabilities via the admin commands virtio: Extend the admin command to include the result size virtio_pci: Introduce device parts access commands Documentation: add debugfs description for hisi migration hisi_acc_vfio_pci: register debugfs for hisilicon migration driver hisi_acc_vfio_pci: create subfunction for data reading hisi_acc_vfio_pci: extract public functions for container_of vfio/qat: fix overflow check in qat_vf_resume_write() vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table kvm/vfio: Constify struct kvm_device_ops	2024-11-27 12:57:03 -08:00
Linus Torvalds	9f16d5e6f2	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "The biggest change here is eliminating the awful idea that KVM had of essentially guessing which pfns are refcounted pages. The reason to do so was that KVM needs to map both non-refcounted pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain refcounted pages. However, the result was security issues in the past, and more recently the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by struct page but is not refcounted. In particular this broke virtio-gpu blob resources (which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device) with the amdgpu driver, because amdgpu allocates non-compound higher order pages and the tail pages could not be mapped into KVM. This requires adjusting all uses of struct page in the per-architecture code, to always work on the pfn whenever possible. The large series that did this, from David Stevens and Sean Christopherson, also cleaned up substantially the set of functions that provided arch code with the pfn for a host virtual addresses. The previous maze of twisty little passages, all different, is replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages) saving almost 200 lines of code. ARM: - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests LoongArch: - Add iocsr and mmio bus simulation in kernel. - Add in-kernel interrupt controller emulation. - Add support for virtualization extensions to the eiointc irqchip. PPC: - Drop lingering and utterly obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect documentation references to non-existing ioctls RISC-V: - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side s390: - New selftests: more ucontrol selftests and CPU model sanity checks - Support for the gen17 CPU model - List registers supported by KVM_GET/SET_ONE_REG in the documentation x86: - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes. Even if the hardware A/D tracking is disabled, it is possible to use the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty, and that removes a lot of special cases. - Elide TLB flushes when aging secondary PTEs, as has been done in x86's primary MMU for over 10 years. - Recover huge pages in-place in the TDP MMU when dirty page logging is toggled off, instead of zapping them and waiting until the page is re-accessed to create a huge mapping. This reduces vCPU jitter. - Batch TLB flushes when dirty page logging is toggled off. This reduces the time it takes to disable dirty logging by ~3x. - Remove the shrinker that was (poorly) attempting to reclaim shadow page tables in low-memory situations. - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Advertise CPUIDs for new instructions in Clearwater Forest - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which in turn can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe; harden the cache accessors to try to prevent similar issues from occuring in the future. The issue that triggered this change was already fixed in 6.12, but was still kinda latent. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups - Switch hugepage recovery thread to use vhost_task. These kthreads can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory); therefore KVM tried to place the thread in the VM's cgroups and charge the CPU time consumed by that work to the VM's container. However the kthreads did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM instances inside could not complete freezing. Fix this by replacing the kthread with a PF_USER_WORKER thread, via the vhost_task abstraction. Another 100+ lines removed, with generally better behavior too like having these threads properly parented in the process tree. - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't really work; there was really nothing to work around anyway: the broken patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the erratum. - Fix 6.12 regression where CONFIG_KVM will be built as a module even if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'. x86 selftests: - x86 selftests can now use AVX. Documentation: - Use rST internal links - Reorganize the introduction to the API document Generic: - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long due to having to wait for all CPUs become quiescent. In general both reads and writes are rare, but userspace that supports confidential computing is introducing the use of "helper" vCPUs that may jump from one host processor to another. Those will be very happy to trigger a synchronize_rcu(), and the effect on performance is quite the disaster" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits) KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL \|\| KVM_AMD KVM: x86: add back X86_LOCAL_APIC dependency Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()" KVM: x86: switch hugepage recovery thread to vhost_task KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest Documentation: KVM: fix malformed table irqchip/loongson-eiointc: Add virt extension support LoongArch: KVM: Add irqfd support LoongArch: KVM: Add PCHPIC user mode read and write functions LoongArch: KVM: Add PCHPIC read and write functions LoongArch: KVM: Add PCHPIC device support LoongArch: KVM: Add EIOINTC user mode read and write functions LoongArch: KVM: Add EIOINTC read and write functions LoongArch: KVM: Add EIOINTC device support LoongArch: KVM: Add IPI user mode read and write function LoongArch: KVM: Add IPI read and write function LoongArch: KVM: Add IPI device support LoongArch: KVM: Add iocsr and mmio bus simulation in kernel KVM: arm64: Pass on SVE mapping failures ...	2024-11-23 16:00:50 -08:00
Linus Torvalds	0f25f0e4ef	Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...	2024-11-18 12:24:06 -08:00
Paolo Bonzini	d96c77bd4e	KVM: x86: switch hugepage recovery thread to vhost_task kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-11-14 13:20:04 -05:00
Paolo Bonzini	c59de14133	Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.13 - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes, and to simplify A/D-disabled MMUs by using the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty. - Elide TLB flushes when aging SPTEs, as has been done in x86's primary MMU for over 10 years. - Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when dirty logging is toggled off, which reduces the time it takes to disable dirty logging by ~3x. - Recover huge pages in-place in the TDP MMU instead of zapping the SP and waiting until the page is re-accessed to create a huge mapping. Proactively installing huge pages can reduce vCPU jitter in extreme scenarios. - Remove support for (poorly) reclaiming page tables in shadow MMUs via the primary MMU's shrinker interface.	2024-11-13 06:31:54 -05:00
Paolo Bonzini	b39d1578d5	Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEAD KVM generic changes for 6.13 - Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two partial poasses over "all" vCPUs. Opportunistically expand the comment to better explain the motivation and logic. - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long stalls due to having to wait for all CPUs become quiescent.	2024-11-13 06:24:19 -05:00
Paolo Bonzini	e3e0f9b7ae	Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv changes for 6.13 - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side	2024-11-08 12:13:48 -05:00
Al Viro	66635b0776	assorted variants of irqfd setup: convert to CLASS(fd) in all of those failure exits prior to fdget() are plain returns and the only thing done after fdput() is (on failure exits) a kfree(), which can be done before fdput() just fine. NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for eventfd_ctx_fileget() failure (we only want fdput() there) and once we stop doing that, it doesn't need to check if eventfd is NULL or ERR_PTR(...) there. NOTE: in privcmd we move fdget() up before the allocation - more to the point, before the copy_from_user() attempt. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:07 -05:00
Al Viro	8152f82010	fdget(), more trivial conversions all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	6348be02ee	fdget(), trivial conversions fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Sean Christopherson	2ebbe0308c	KVM: Allow arch code to elide TLB flushes when aging a young page Add a Kconfig to allow architectures to opt-out of a TLB flush when a young page is aged, as invalidating TLB entries is not functionally required on most KVM-supported architectures. Stale TLB entries can result in false negatives and theoretically lead to suboptimal reclaim, but in practice all observations have been that the performance gained by skipping TLB flushes outweighs any performance lost by reclaiming hot pages. E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB that's required for ordering (presumably because there are optimizations related to eliding other TLB flushes when doing make-before-break). Link: https://lore.kernel.org/r/20241011021051.1557902-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 15:25:40 -07:00
Sean Christopherson	3e7f43188e	KVM: Protect vCPU's "last run PID" with rwlock, not RCU To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a vCPU. When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to run SEV migration helper vCPUs during post-copy, the synchronize_rcu() needed to change the PID associated with the vCPU can stall for hundreds of milliseconds, which is problematic for latency sensitive post-copy operations. In the directed yield path, do not acquire the lock if it's contended, i.e. if the associated PID is changing, as that means the vCPU's task is already running. Reported-by: Steve Rutherford <srutherford@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:41:22 -07:00
Sean Christopherson	6cf9ef23d9	KVM: Return '0' directly when there's no task to yield to Do "return 0" instead of initializing and returning a local variable in kvm_vcpu_yield_to(), e.g. so that it's more obvious what the function returns if there is no task. No functional change intended. Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:29:31 -07:00
Sean Christopherson	7e513617da	KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop Rework kvm_vcpu_on_spin() to use a single for-loop instead of making "two" passes over all vCPUs. Given N=kvm->last_boosted_vcpu, the logic is to iterate from vCPU[N+1]..vcpu[N-1], i.e. using two loops is just a kludgy way of handling the wrap from the last vCPU to vCPU0 when a boostable vCPU isn't found in vcpu[N+1]..vcpu[MAX]. Open code the xa_load() instead of using kvm_get_vcpu() to avoid reading online_vcpus in every loop, as well as the accompanying smp_rmb(), i.e. make it a custom kvm_for_each_vcpu(), for all intents and purposes. Oppurtunistically clean up the comment explaining the logic. Link: https://lore.kernel.org/r/20240802202121.341348-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:27:51 -07:00
Christophe JAILLET	bbee049d8e	kvm/vfio: Constify struct kvm_device_ops 'struct kvm_device_ops' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 2605 169 16 2790 ae6 virt/kvm/vfio.o After: ===== text data bss dec hex filename 2685 89 16 2790 ae6 virt/kvm/vfio.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/e7361a1bb7defbb0f7056b884e83f8d75ac9fe21.1727517084.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2024-10-30 13:31:26 -06:00
Sean Christopherson	8b15c3764c	KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page" Now that KVM no longer relies on an ugly heuristic to find its struct page references, i.e. now that KVM can't get false positives on VM_MIXEDMAP pfns, remove KVM's hack to elevate the refcount for pfns that happen to have a valid struct page. In addition to removing a long-standing wart in KVM, this allows KVM to map non-refcounted struct page memory into the guest, e.g. for exposing GPU TTM buffers to KVM guests. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-86-seanjc@google.com>	2024-10-25 13:01:35 -04:00
Sean Christopherson	93b7da404f	KVM: Drop APIs that manipulate "struct page" via pfns Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-85-seanjc@google.com>	2024-10-25 13:01:35 -04:00
Sean Christopherson	31fccdd212	KVM: Make kvm_follow_pfn.refcounted_page a required field Now that the legacy gfn_to_pfn() APIs are gone, and all callers of hva_to_pfn() pass in a refcounted_page pointer, make it a required field to ensure all future usage in KVM plays nice. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-82-seanjc@google.com>	2024-10-25 13:01:35 -04:00
Sean Christopherson	06cdaff80e	KVM: Drop gfn_to_pfn() APIs now that all users are gone Drop gfn_to_pfn() and all its variants now that all users are gone. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-80-seanjc@google.com>	2024-10-25 13:01:35 -04:00
Sean Christopherson	f42e289a20	KVM: Add support for read-only usage of gfn_to_page() Rework gfn_to_page() to support read-only accesses so that it can be used by arm64 to get MTE tags out of guest memory. Opportunistically rewrite the comment to be even more stern about using gfn_to_page(), as there are very few scenarios where requiring a struct page is actually the right thing to do (though there are such scenarios). Add a FIXME to call out that KVM probably should be pinning pages, not just getting pages. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-77-seanjc@google.com>	2024-10-25 13:00:50 -04:00
Sean Christopherson	ce6bf70346	KVM: Convert gfn_to_page() to use kvm_follow_pfn() Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-76-seanjc@google.com>	2024-10-25 13:00:49 -04:00
Sean Christopherson	1fbee5b01a	KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn() Provide the "struct page" associated with a guest_memfd pfn as an output from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can directly put the page instead of having to rely on kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-47-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	4af18dc6a9	KVM: guest_memfd: Pass index, not gfn, to __kvm_gmem_get_pfn() Refactor guest_memfd usage of __kvm_gmem_get_pfn() to pass the index into the guest_memfd file instead of the gfn, i.e. resolve the index based on the slot+gfn in the caller instead of in __kvm_gmem_get_pfn(). This will allow kvm_gmem_get_pfn() to retrieve and return the specific "struct page", which requires the index into the folio, without a redoing the index calculation multiple times (which isn't costly, just hard to follow). Opportunistically add a kvm_gmem_get_index() helper to make the copy+pasted code easier to understand. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-46-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	1c7b627e93	KVM: Add kvm_faultin_pfn() to specifically service guest page faults Add a new dedicated API, kvm_faultin_pfn(), for servicing guest page faults, i.e. for getting pages/pfns that will be mapped into the guest via an mmu_notifier-protected KVM MMU. Keep struct kvm_follow_pfn buried in internal code, as having __kvm_faultin_pfn() take "out" params is actually cleaner for several architectures, e.g. it allows the caller to have its own "page fault" structure without having to marshal data to/from kvm_follow_pfn. Long term, common KVM would ideally provide a kvm_page_fault structure, a la x86's struct of the same name. But all architectures need to be converted to a common API before that can happen. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-44-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	68e51d0a43	KVM: Disallow direct access (w/o mmu_notifier) to unpinned pfn by default Add an off-by-default module param to control whether or not KVM is allowed to map memory that isn't pinned, i.e. that KVM can't guarantee won't be freed while it is mapped into KVM and/or the guest. Don't remove the functionality entirely, as there are use cases where mapping unpinned memory is safe (as defined by the platform owner), e.g. when memory is hidden from the kernel and managed by userspace, in which case userspace is already fully trusted to not muck with guest memory mappings. But for more typical setups, mapping unpinned memory is wildly unsafe, and unnecessary. The APIs are used exclusively by x86's nested virtualization support, and there is no known (or sane) use case for mapping PFN-mapped memory a KVM guest _and_ letting the guest use it for virtualization structures. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-36-seanjc@google.com>	2024-10-25 12:59:08 -04:00

1 2 3 4 5 ...

2642 Commits