The original implementation of vq_present() relied on aggressive
inlining in order for the compiler to know that the code is
correct, due to some const-casting issues. This was causing sparse
and clang to complain, while GCC compiled cleanly.
Commit 0c529ff789 addressed this problem, but since vq_present()
is no longer a function, there is now no implicit casting of the
returned value to the return type (bool).
In set_sve_vls(), this uncast bit value is compared against a bool,
and so may spuriously compare as unequal when both are nonzero. As
a result, KVM may reject valid SVE vector length configurations as
invalid, and vice versa.
Fix it by forcing the returned value to a bool.
Signed-off-by: Zhang Lei <zhang.lei@jp.fujitsu.com>
Fixes: 0c529ff789 ("KVM: arm64: Implement vq_present() as a macro")
Signed-off-by: Dave Martin <Dave.Martin@arm.com> [commit message rewrite]
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thomas reported that:
| Background:
|
| In preparation of supporting IPI shorthands I changed the CPU offline
| code to software disable the local APIC instead of just masking it.
| That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
| register.
|
| Failure:
|
| When the CPU comes back online the startup code triggers occasionally
| the warning in apic_pending_intr_clear(). That complains that the IRRs
| are not empty.
|
| The offending vector is the local APIC timer vector who's IRR bit is set
| and stays set.
|
| It took me quite some time to reproduce the issue locally, but now I can
| see what happens.
|
| It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
| (and hardware support) it behaves correctly.
|
| Here is the series of events:
|
| Guest CPU
|
| goes down
|
| native_cpu_disable()
|
| apic_soft_disable();
|
| play_dead()
|
| ....
|
| startup()
|
| if (apic_enabled())
| apic_pending_intr_clear() <- Not taken
|
| enable APIC
|
| apic_pending_intr_clear() <- Triggers warning because IRR is stale
|
| When this happens then the deadline timer or the regular APIC timer -
| happens with both, has fired shortly before the APIC is disabled, but the
| interrupt was not serviced because the guest CPU was in an interrupt
| disabled region at that point.
|
| The state of the timer vector ISR/IRR bits:
|
| ISR IRR
| before apic_soft_disable() 0 1
| after apic_soft_disable() 0 1
|
| On startup 0 1
|
| Now one would assume that the IRR is cleared after the INIT reset, but this
| happens only on CPU0.
|
| Why?
|
| Because our CPU0 hotplug is just for testing to make sure nothing breaks
| and goes through an NMI wakeup vehicle because INIT would send it through
| the boots-trap code which is not really working if that CPU was not
| physically unplugged.
|
| Now looking at a real world APIC the situation in that case is:
|
| ISR IRR
| before apic_soft_disable() 0 1
| after apic_soft_disable() 0 1
|
| On startup 0 0
|
| Why?
|
| Once the dying CPU reenables interrupts the pending interrupt gets
| delivered as a spurious interupt and then the state is clear.
|
| While that CPU0 hotplug test case is surely an esoteric issue, the APIC
| emulation is still wrong, Even if the play_dead() code would not enable
| interrupts then the pending IRR bit would turn into an ISR .. interrupt
| when the APIC is reenabled on startup.
From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
* Pending interrupts in the IRR and ISR registers are held and require
masking or handling by the CPU.
In Thomas's testing, hardware cpu will not respect soft disable LAPIC
when IRR has already been set or APICv posted-interrupt is in flight,
so we can skip soft disable APIC checking when clearing IRR and set ISR,
continue to respect soft disable APIC when attempting to set IRR.
Reported-by: Rong Chen <rong.a.chen@intel.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
capability is enabled on vCPU.
As indicated by vmx->nested.enlightened_vmcs_enabled.
This is quite bizarre as userspace VMM should make sure to expose
same vCPU with same CPUID values in both source and destination.
In case vCPU is exposed with eVMCS support on CPUID, it is also
expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
Therefore, KVM_STATE_NESTED_EVMCS is redundant.
KVM_STATE_NESTED_EVMCS is currently used on restore path
(vmx_set_nested_state()) only to enable eVMCS capability in KVM
and to signal need_vmcs12_sync such that on next VMEntry to guest
nested_sync_from_vmcs12() will be called to sync vmcs12 content
into eVMCS in guest memory.
However, because restore nested-state is rare enough, we could
have just modified vmx_set_nested_state() to always signal
need_vmcs12_sync.
From all the above, it seems that we could have just removed
the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
backwards migration compatibility, we cannot do that.
(vmx_get_nested_state() needs to signal flag when migrating from
new kernel to old kernel).
Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
have a bad side-effect of userspace VMM having to send nested-state
from source to destination as part of migration stream. Even if
guest have never used eVMCS as it doesn't even run a nested
hypervisor workload. This requires destination userspace VMM and
KVM to support setting nested-state. Which make it more difficult
to migrate from new host to older host.
To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
not only enabled but also active. i.e. Guest have made some
eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
from eVMCS and therefore should be restored into eVMCS resident
in memory (by copy_vmcs12_to_enlightened()).
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As comment in code specifies, SMM temporarily disables VMX so we cannot
be in guest mode, nor can VMLAUNCH/VMRESUME be pending.
However, code currently assumes that these are the only flags that can be
set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
can also be set on this field to signal that eVMCS should be enabled.
Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
explicitly.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull powerpc fix from Michael Ellerman:
"One fix for a regression in my commit adding KUAP (Kernel User Access
Prevention) on Radix, which incorrectly touched the AMR in the early
machine check handler.
Thanks to Nicholas Piggin"
* tag 'powerpc-5.2-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/exception: Fix machine check early corrupting AMR
Pull SMP fixes from Thomas Gleixner:
"Two small changes for the cpu hotplug code:
- Prevent out of bounds access which actually might crash the machine
caused by a missing bounds check in the fail injection code
- Warn about unsupported migitation mode command line arguments to
make people aware that they typoed the paramater. Not necessarily a
fix but quite some people tripped over that"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Fix out-of-bounds read when setting fail state
cpu/speculation: Warn on unsupported mitigations= parameter
Pull x86 fixes from Ingo Molnar:
"Misc fixes all over the place:
- might_sleep() atomicity fix in the microcode loader
- resctrl boundary condition fix
- APIC arithmethics bug fix for frequencies >= 4.2 GHz
- three 5-level paging crash fixes
- two speculation fixes
- a perf/stacktrace fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fall back to using frame pointers for generated code
perf/x86: Always store regs->ip in perf_callchain_kernel()
x86/speculation: Allow guests to use SSBD even if host does not
x86/mm: Handle physical-virtual alignment mismatch in phys_p4d_init()
x86/boot/64: Add missing fixup_pointer() for next_early_pgt access
x86/boot/64: Fix crash if kernel image crosses page table boundary
x86/apic: Fix integer overflow on 10 bit left shift of cpu_khz
x86/resctrl: Prevent possible overrun during bitmap operations
x86/microcode: Fix the microcode load on CPU hotplug for real
Pull perf fixes from Ingo Molnar:
"Various fixes, most of them related to bugs perf fuzzing found in the
x86 code"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/regs: Use PERF_REG_EXTENDED_MASK
perf/x86: Remove pmu->pebs_no_xmm_regs
perf/x86: Clean up PEBS_XMM_REGS
perf/x86/regs: Check reserved bits
perf/x86: Disable extended registers for non-supported PMUs
perf/ioctl: Add check for the sample_period value
perf/core: Fix perf_sample_regs_user() mm check
Pull irq fixes from Ingo Molnar:
"Diverse irqchip driver fixes"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix command queue pointer comparison bug
irqchip/mips-gic: Use the correct local interrupt map registers
irqchip/ti-sci-inta: Fix kernel crash if irq_create_fwspec_mapping fail
irqchip/irq-csky-mpintc: Support auto irq deliver to all cpus
Pull EFI fixes from Ingo Molnar:
"Four fixes:
- fix a kexec crash on arm64
- fix a reboot crash on some Android platforms
- future-proof the code for upcoming ACPI 6.2 changes
- fix a build warning on x86"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efibc: Replace variable set function in notifier call
x86/efi: fix a -Wtype-limits compilation warning
efi/bgrt: Drop BGRT status field reserved bits check
efi/memreserve: deal with memreserve entries in unmapped memory
Pull power management fix from Rafael Wysocki:
"Avoid skipping bus-level PCI power management during system resume for
PCIe ports left in D0 during the preceding suspend transition on
platforms where the power states of those ports can change out of the
PCI layer's control"
* tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
Pull XArray fixes from Matthew Wilcox:
- Account XArray nodes for the page cache to the appropriate cgroup
(Johannes Weiner)
- Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)
- Add a test for xa_insert() (Matthew Wilcox)
* tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add check_insert
idr: Fix idr_get_next race with idr_remove
mm: fix page cache convergence regression
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL
mm, swap: fix THP swap out
fork,memcg: alloc_thread_stack_node needs to set tsk->stack
MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info
mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
initramfs: fix populate_initrd_image() section mismatch
mm/oom_kill.c: fix uninitialized oc->constraint
mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
signal: remove the wrong signal_pending() check in restore_user_sigmask()
fs/binfmt_flat.c: make load_flat_shared_library() work
mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
fs/proc/array.c: allow reporting eip/esp for all coredumping threads
mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address
Pull RISC-V fixes from Paul Walmsley:
"Minor RISC-V fixes and one defconfig update.
The fixes have no functional impact:
- Fix some comment text in the memory management vmalloc_fault path.
- Fix some warnings from the DT compiler in our newly-added DT files.
- Change the newly-added DT bindings such that SoC IP blocks with
external I/O are marked as "disabled" by default, then enable them
explicitly in board DT files when the devices are used on the
board. This aligns the bindings with existing upstream practice.
- Add the MIT license as an option for a minor header file, at the
request of one of the U-Boot maintainers.
The RISC-V defconfig update builds the SiFive SPI driver and the
MMC-SPI driver by default. The intention here is to make v5.2 more
usable for testers and users with RISC-V hardware"
* tag 'riscv-for-v5.2/fixes-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: mm: Fix code comment
dt-bindings: clock: sifive: add MIT license as an option for the header file
dt-bindings: riscv: resolve 'make dt_binding_check' warnings
riscv: dts: Re-organize the DT nodes
RISC-V: defconfig: enable MMC & SPI for RISC-V
Pull two more NFS client fixes from Anna Schumaker:
"These are both stable fixes.
One to calculate the correct client message length in the case of
partial transmissions. And the other to set the proper TCP timeout for
flexfiles"
* tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O
SUNRPC: Fix up calculation of client message length
Pull ceph fix from Ilya Dryomov:
"A small fix for a potential -rc1 regression from Jeff"
* tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client:
ceph: fix ceph_mdsc_build_path to not stop on first component
Pull SCSI fix from James Bottomley:
"One simple fix for a driver use after free"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck()
Pull block fixes from Jens Axboe:
"Just two small fixes.
One from Paolo, fixing a silly mistake in BFQ. The other one is from
me, ensuring that we have ->file cleared in the io_uring request a bit
earlier. That avoids a use-before-free, if we encounter an error
before ->file is assigned"
* tag 'for-linus-20190628' of git://git.kernel.dk/linux-block:
block, bfq: fix operator in BFQQ_TOTALLY_SEEKY
io_uring: ensure req->file is cleared on allocation
Pull pin control fixes from Linus Walleij:
"Sorry to bomb in fixes this late. Maybe I can comfort you by saying it
is only driver fixes, and mostly IRQ handling which is something GPIO
and pin control drivers never get right. You think it works and then
it doesn't.
Summary:
- Fix IRQ setup in the MCP23s08.
- Fix pin setup on pins > 31 in the Ocelot driver.
- Fix IRQs in the Mediatek driver"
* tag 'pinctrl-v5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: mediatek: Update cur_mask in mask/mask ops
pinctrl: mediatek: Ignore interrupts that are wake only during resume
pinctrl: ocelot: fix pinmuxing for pins after 31
pinctrl: ocelot: fix gpio direction for pins after 31
pinctrl: mcp23s08: Fix add_data and irqchip_add_nested call order
0-Day test system reported some OOM regressions for several THP
(Transparent Huge Page) swap test cases. These regressions are bisected
to 6861428921 ("block: always define BIO_MAX_PAGES as 256"). In the
commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
THP. That causes the OOM.
As in the patch description of 6861428921 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
to swap space. So the issue is fixed via doing that in get_swap_bio().
BTW: I remember I have checked the THP swap code when 6861428921
("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
THP swap code needn't to be changed. But apparently, I was wrong. I
should have done this at that time.
Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
Fixes: 6861428921 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") corrected two instances, but there was a third
instance of this bug.
Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
execute free_thread_stack() on a dangling pointer.
Enterprise kernels are compiled with VMAP_STACK=y so this isn't
critical, but custom VMAP_STACK=n builds should have some performance
advantage, with the drawback of risking to fail fork because compaction
didn't succeed. So as long as VMAP_STACK=n is a supported option it's
worth fixing it upstream.
Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
Fixes: 9b6f7e163c ("mm: rework memcg kernel stack accounting")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>