gfn_to_memslot() can return NULL or invalid slot. We need to check slot
validity before accessing it.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
In commit e935b8372c ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held. This leads to the following:
BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
[<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
[<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
[<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
[<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
[<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
[<ffffffff8111824a>] balance_pgdat+0x54a/0x730
[<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
[<ffffffff811185bf>] kswapd+0x18f/0x490
[<ffffffff81070961>] ? get_parent_ip+0x11/0x50
[<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff81118430>] ? balance_pgdat+0x730/0x730
[<ffffffff81060d2b>] kthread+0xdb/0xe0
[<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
[<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
[<ffffffff81060c50>] ? __init_kthread_worker+0x
After the previous patch, kvm_lock need not be a raw spinlock anymore,
so change it back.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.
OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
For bytemaps each IRQ field is 1 byte wide, so we pack 4 irq fields in
one word and since there are 32 private (per cpu) irqs, we have 8
private u32 fields on the vgic_bytemap struct. We shift the offset from
the base of the register group right by 2, giving us the word index
instead of the field index. But then there are 8 private words, not 4,
which is also why we subtract 8 words from the offset of the shared
words.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
All the code in handle_mmio_cfg_reg() assumes the offset has
been shifted right to accomodate for the 2:1 bit compression,
but this is only done when getting the register address.
Shift the offset early so the code works mostly unchanged.
Reported-by: Zhaobo (Bob, ERC) <zhaobo@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
vgic_get_target_reg is quite complicated, for no good reason.
Actually, it is fairly easy to write it in a much more efficient
way by using the target CPU array instead of the bitmap.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This is the type-safe comparison function, so the double-underscore is
not related.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The checks on PG_reserved in the page structure on head and tail pages
aren't necessary because split_huge_page wouldn't transfer the
PG_reserved bit from head to tail anyway.
This was a forward-thinking check done in the case PageReserved was
set by a driver-owned page mapped in userland with something like
remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
possible right now). It was meant to be very safe, but it's overkill
as it's unlikely split_huge_page could ever run without the driver
noticing and tearing down the hugepage itself.
And if a driver in the future will really want to map a reserved
hugepage in userland using an huge pmd it should simply take care of
marking all subpages reserved too to keep KVM safe. This of course
would require such a hypothetical driver to tear down the huge pmd
itself and splitting the hugepage itself, instead of relaying on
split_huge_page, but that sounds very reasonable, especially
considering split_huge_page wouldn't currently transfer the reserved
bit anyway.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
KVM uses anon_inode_get() to allocate file descriptors as part
of some of its ioctls. But those ioctls are lacking a flag argument
allowing userspace to choose options for the newly opened file descriptor.
In such case it's advised to use O_CLOEXEC by default so that
userspace is allowed to choose, without race, if the file descriptor
is going to be inherited across exec().
This patch set O_CLOEXEC flag on all file descriptors created
with anon_inode_getfd() to not leak file descriptors across exec().
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
kvm_io_bus_sort_cmp is used also directly, not just as a callback for
sort and bsearch. In these cases, it is handy to have a type-safe
variant. This patch introduces such a variant, __kvm_io_bus_sort_cmp,
and uses it throughout kvm_main.c.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots(). Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.
In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add new functions kvm_io_bus_{read,write}_cookie() that allows users of
the kvm io infrastructure to use a cookie value to speed up lookup of a
device on an io bus.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Pull KVM fixes from Paolo Bonzini:
"On the x86 side, there are some optimizations and documentation
updates. The big ARM/KVM change for 3.11, support for AArch64, will
come through Catalin Marinas's tree. s390 and PPC have misc cleanups
and bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
KVM: PPC: Ignore PIR writes
KVM: PPC: Book3S PR: Invalidate SLB entries properly
KVM: PPC: Book3S PR: Allow guest to use 1TB segments
KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
KVM: PPC: Book3S PR: Fix proto-VSID calculations
KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
KVM: Fix RTC interrupt coalescing tracking
kvm: Add a tracepoint write_tsc_offset
KVM: MMU: Inform users of mmio generation wraparound
KVM: MMU: document fast invalidate all mmio sptes
KVM: MMU: document fast invalidate all pages
KVM: MMU: document fast page fault
KVM: MMU: document mmio page fault
KVM: MMU: document write_flooding_count
KVM: MMU: document clear_spte_count
KVM: MMU: drop kvm_mmu_zap_mmio_sptes
KVM: MMU: init kvm generation close to mmio wrap-around value
KVM: MMU: add tracepoint for check_mmio_spte
KVM: MMU: fast invalidate all mmio sptes
...
KVM/ARM pull request for 3.11 merge window
* tag 'kvm-arm-3.11' of git://git.linaro.org/people/cdall/linux-kvm-arm.git:
ARM: kvm: don't include drivers/virtio/Kconfig
Update MAINTAINERS: KVM/ARM work now funded by Linaro
arm/kvm: Cleanup KVM_ARM_MAX_VCPUS logic
ARM: KVM: clear exclusive monitor on all exception returns
ARM: KVM: add missing dsb before invalidating Stage-2 TLBs
ARM: KVM: perform save/restore of PAR
ARM: KVM: get rid of S2_PGD_SIZE
ARM: KVM: don't special case PC when doing an MMIO
ARM: KVM: use phys_addr_t instead of unsigned long long for HYP PGDs
ARM: KVM: remove dead prototype for __kvm_tlb_flush_vmid
ARM: KVM: Don't handle PSCI calls via SMC
ARM: KVM: Allow host virt timer irq to be different from guest timer virt irq
This reverts most of the f1ed0450a5. After
the commit kvm_apic_set_irq() no longer returns accurate information
about interrupt injection status if injection is done into disabled
APIC. RTC interrupt coalescing tracking relies on the information to be
accurate and cannot recover if it is not.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The arch_timer irq numbers (or PPI numbers) are implementation dependent,
so the host virtual timer irq number can be different from guest virtual
timer irq number.
This patch ensures that host virtual timer irq number is read from DTB and
guest virtual timer irq is determined based on vcpu target type.
Signed-off-by: Anup Patel <anup.patel@linaro.org>
Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).
In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.
Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.
Also fixed one indent issue in kvm_host.h
Signed-off-by: Amos Kong <akong@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
As KVM/arm64 is looming on the horizon, it makes sense to move some
of the common code to a single location in order to reduce duplication.
The code could live anywhere. Actually, most of KVM is already built
with a bunch of ugly ../../.. hacks in the various Makefiles, so we're
not exactly talking about style here. But maybe it is time to start
moving into a less ugly direction.
The include files must be in a "public" location, as they are accessed
from non-KVM files (arch/arm/kernel/asm-offsets.c).
For this purpose, introduce two new locations:
- virt/kvm/arm/ : x86 and ia64 already share the ioapic code in
virt/kvm, so this could be seen as a (very ugly) precedent.
- include/kvm/ : there is already an include/xen, and while the
intent is slightly different, this seems as good a location as
any
Eventually, we should probably have independant Makefiles at every
levels (just like everywhere else in the kernel), but this is just
the first step.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Since the arrival of posted interrupt support we can no longer guarantee
that coalesced IRQs are always reported to the IRQ source. Moreover,
accumulated APIC timer events could cause a busy loop when a VCPU should
rather be halted. The consensus is to remove coalesced tracking from the
LAPIC.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>