In commit 11f1a4b975 ("x86: reorganize SMAP handling in user space
accesses") I changed how the stac/clac instructions were generated
around the user space accesses, which then made it possible to do
batched accesses efficiently for user string copies etc.
However, in doing so, I completely spaced out, and didn't even think
about the 32-bit case. And nobody really even seemed to notice, because
SMAP doesn't even exist until modern Skylake processors, and you'd have
to be crazy to run 32-bit kernels on a modern CPU.
Which brings us to Andy Lutomirski.
He actually tested the 32-bit kernel on new hardware, and noticed that
it doesn't work. My bad. The trivial fix is to add the required
uaccess begin/end markers around the raw accesses in <asm/uaccess_32.h>.
I feel a bit bad about this patch, just because that header file really
should be cleaned up to avoid all the duplicated code in it, and this
commit just expands on the problem. But this just fixes the bug without
any bigger cleanup surgery.
Reported-and-tested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull xen bug fixes from David Vrabel:
- Two scsiback fixes (resource leak and spurious warning).
- Fix DMA mapping of compound pages on arm/arm64.
- Fix some pciback regressions in MSI-X handling.
- Fix a pcifront crash due to some uninitialize state.
* tag 'for-linus-4.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.
xen/pcifront: Report the errors better.
xen/pciback: Save the number of MSI-X entries to be copied later.
xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY
xen: fix potential integer overflow in queue_reply
xen/arm: correctly handle DMA mapping of compound pages
xen/scsiback: avoid warnings when adding multiple LUNs to a domain
xen/scsiback: correct frontend counting
Pull x86 fixes from Ingo Molnar:
"This is unusually large, partly due to the EFI fixes that prevent
accidental deletion of EFI variables through efivarfs that may brick
machines. These fixes are somewhat involved to maintain compatibility
with existing install methods and other usage modes, while trying to
turn off the 'rm -rf' bricking vector.
Other fixes are for large page ioremap()s and for non-temporal
user-memcpy()s"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix vmalloc_fault() to handle large pages properly
hpet: Drop stale URLs
x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
lib/ucs2_string: Correct ucs2 -> utf8 conversion
efi: Add pstore variables to the deletion whitelist
efi: Make efivarfs entries immutable by default
efi: Make our variable validation list include the guid
efi: Do variable name validation tests in utf8
efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
lib/ucs2_string: Add ucs2 -> utf8 helper functions
Pull perf fixes from Ingo Molnar:
"A handful of CPU hotplug related fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Plug potential memory leak in CPU_UP_PREPARE
perf/core: Remove the bogus and dangerous CPU_DOWN_FAILED hotplug state
perf/core: Remove bogus UP_CANCELED hotplug state
perf/x86/amd/uncore: Plug reference leak
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: slab: free kmem_cache_node after destroy sysfs file
ipc/shm: handle removed segments gracefully in shm_mmap()
MAINTAINERS: update Kselftest Framework mailing list
devm_memremap_release(): fix memremap'd addr handling
mm/hugetlb.c: fix incorrect proc nr_hugepages value
mm, x86: fix pte_page() crash in gup_pte_range()
fsnotify: turn fsnotify reaper thread into a workqueue job
Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
mm: fix regression in remap_file_pages() emulation
thp, dax: do not try to withdraw pgtable from non-anon VMA
Pull livepatching fixes from Jiri Kosina:
- regression (from 4.4) fix for ordering issue, introduced by an
earlier ftrace change, that broke live patching of modules.
The fix replaces the ftrace module notifier by direct call in order
to make the ordering guaranteed and well-defined. The patch, from
Jessica Yu, has been acked both by Steven and Rusty
- error message fix from Miroslav Benes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
ftrace/module: remove ftrace module notifier
livepatch: change the error message in asm/livepatch.h header files
Commit 3565fce3a6 ("mm, x86: get_user_pages() for dax mappings") has
moved up the pte_page(pte) in x86's fast gup_pte_range(), for no
discernible reason: put it back where it belongs, after the pte_flags
check and the pfn_valid cross-check.
That may be the cause of the NULL pointer dereference in
gup_pte_range(), seen when vfio called vaddr_get_pfn() when starting a
qemu-kvm based VM.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Michael Long <Harn-Solo@gmx.de>
Tested-by: Michael Long <Harn-Solo@gmx.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A kernel page fault oops with the callstack below was observed
when a read syscall was made to a pmem device after a huge amount
(>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
system:
BUG: unable to handle kernel paging request at ffff880840000ff8
IP: vmalloc_fault+0x1be/0x300
PGD c7f03a067 PUD 0
Oops: 0000 [#1] SM
Call Trace:
__do_page_fault+0x285/0x3e0
do_page_fault+0x2f/0x80
? put_prev_entity+0x35/0x7a0
page_fault+0x28/0x30
? memcpy_erms+0x6/0x10
? schedule+0x35/0x80
? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
? schedule_timeout+0x183/0x240
btt_log_read+0x63/0x140 [nd_btt]
:
? __symbol_put+0x60/0x60
? kernel_read+0x50/0x80
SyS_finit_module+0xb9/0xf0
entry_SYSCALL_64_fastpath+0x1a/0xa4
Since v4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.
vmalloc faults do not normally happen in ioremap'd ranges since
ioremap() sets up the kernel page tables, which are shared by
user processes. pgd_ctor() sets the kernel's PGD entries to
user's during fork(). When allocation of the vmalloc ranges
crosses a 512GB boundary, ioremap() allocates a new pud table
and updates the kernel PGD entry to point it. If user process's
PGD entry does not have this update yet, a read/write syscall
to the range will cause a vmalloc fault, which hits the Oops
above as it does not handle a large page properly.
Following changes are made to vmalloc_fault().
64-bit:
- No change for the PGD sync operation as it handles large
pages already.
- Add pud_huge() and pmd_huge() to the validation code to
handle large pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
is not directly mapped (while the if-statement still works
with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
backed by struct page (while the if-statement still works
with a bogus addr).
32-bit:
- No change for the sync operation since the index3 PGD entry
covers the entire vmalloc range, which is always valid.
(A separate change to sync PGD entry is necessary if this
memory layout is changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle large pages.
This is for completeness since vmalloc_fault() won't happen
in ioremap'd ranges as its PGD entry is always valid.
Reported-by: Henning Schild <henning.schild@siemens.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org> # 4.1+
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The messages should be different depending on the type of error.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Pull x86 fixes from Thomas Gleixner:
"Two small fixlets for x86:
- Prevent a KASAN false positive in thread_saved_pc()
- Fix a 32-bit truncation problem in the x86 numa code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/numa: Fix 32-bit memblock range truncation bug on 32-bit NUMA kernels
x86: Fix KASAN false positives in thread_saved_pc()
The following commit:
a0acda9172 ("acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable")
Introduced numa_clear_kernel_node_hotplug(), which function is executed
during early bootup, and which marks all currently reserved memblock
regions as hot-memory-unswappable as well.
y14sg1 <y14sg1@comcast.net> reported that when running 32-bit NUMA kernels,
the grsecurity/PAX kernel patch flagged a size overflow in this function:
PAX: size overflow detected in function x86_numa_init arch/x86/mm/numa.c:691 [...]
... the reason for the overflow is that memblock_clear_hotplug() takes physical
addresses as arguments, while the start/end variables used by
numa_clear_kernel_node_hotplug() are 'unsigned long', which is 32-bit on PAE
kernels, but which has 64-bit physical addresses.
So on 32-bit PAE kernels that have physical memory above the 4GB boundary,
we truncate a 64-bit physical address range to 32 bits and pass it to
memblock_clear_hotplug(), which at minimum prevents the original memory-hotplug
bugfix from working, but might have other side effects as well.
The fix is to use the proper type to handle physical addresses, phys_addr_t.
Reported-by: y14sg1 <y14sg1@comcast.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 944d9fec8d ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.
After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
thread_saved_pc() reads stack of a potentially running task.
This can cause false KASAN stack-out-of-bounds reports,
because the running task concurrently poisons and unpoisons
own stack.
The same happens in get_wchan(), and get get_wchan() was fixed
by using READ_ONCE_NOCHECK(). Do the same here.
Example KASAN report triggered by sysrq-t:
BUG: KASAN: out-of-bounds in sched_show_task+0x306/0x3b0 at addr ffff880043c97c18
Read of size 8 by task syz-executor/23839
[...]
page dumped because: kasan: bad access detected
[...]
Call Trace:
[<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff813e7a26>] sched_show_task+0x306/0x3b0
[<ffffffff813e7bf4>] show_state_filter+0x124/0x1a0
[<ffffffff82d2ca00>] fn_show_state+0x10/0x20
[<ffffffff82d2cf98>] k_spec+0xa8/0xe0
[<ffffffff82d3354f>] kbd_event+0xb9f/0x4000
[<ffffffff843ca8a7>] input_to_handler+0x3a7/0x4b0
[<ffffffff843d1954>] input_pass_values.part.5+0x554/0x6b0
[<ffffffff843d29bc>] input_handle_event+0x2ac/0x1070
[<ffffffff843d3a47>] input_inject_event+0x237/0x280
[<ffffffff843e8c28>] evdev_write+0x478/0x680
[<ffffffff817ac653>] __vfs_write+0x113/0x480
[<ffffffff817ae0e7>] vfs_write+0x167/0x4a0
[<ffffffff817b13d1>] SyS_write+0x111/0x220
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: glider@google.com
Cc: kasan-dev@googlegroups.com
Cc: kcc@google.com
Cc: linux-kernel@vger.kernel.org
Cc: ryabinin.a.a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A bit on the largish side due to a series of fixes for a regression in
the x86 vector management which was introduced in 4.3. This work was
started in December already, but it took some time to fix all corner
cases and a couple of older bugs in that area which were detected
while at it
Aside of that a few platform updates for intel-mid, quark and UV and
two fixes for in the mm code:
- Use proper types for pgprot values to avoid truncation
- Prevent a size truncation in the pageattr code when setting page
attributes for large mappings"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm/pat: Avoid truncation when converting cpa->numpages to address
x86/mm: Fix types used in pgprot cacheability flags translations
x86/platform/quark: Print boundaries correctly
x86/platform/UV: Remove EFI memmap quirk for UV2+
x86/platform/intel-mid: Join string and fix SoC name
x86/platform/intel-mid: Enable 64-bit build
x86/irq: Plug vector cleanup race
x86/irq: Call irq_force_move_complete with irq descriptor
x86/irq: Remove outgoing CPU from vector cleanup mask
x86/irq: Remove the cpumask allocation from send_cleanup_vector()
x86/irq: Clear move_in_progress before sending cleanup IPI
x86/irq: Remove offline cpus from vector cleanup
x86/irq: Get rid of code duplication
x86/irq: Copy vectormask instead of an AND operation
x86/irq: Check vector allocation early
x86/irq: Reorganize the search in assign_irq_vector
x86/irq: Reorganize the return path in assign_irq_vector
x86/irq: Do not use apic_chip_data.old_domain as temporary buffer
x86/irq: Validate that irq descriptor is still active
x86/irq: Fix a race in x86_vector_free_irqs()
...
Pull perf fixes from Thomas Gleixner:
"This is much bigger than typical fixes, but Peter found a category of
races that spurred more fixes and more debugging enhancements. Work
started before the merge window, but got finished only now.
Aside of that this contains the usual small fixes to perf and tools.
Nothing particular exciting"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
perf: Remove/simplify lockdep annotation
perf: Synchronously clean up child events
perf: Untangle 'owner' confusion
perf: Add flags argument to perf_remove_from_context()
perf: Clean up sync_child_event()
perf: Robustify event->owner usage and SMP ordering
perf: Fix STATE_EXIT usage
perf: Update locking order
perf: Remove __free_event()
perf/bpf: Convert perf_event_array to use struct file
perf: Fix NULL deref
perf/x86: De-obfuscate code
perf/x86: Fix uninitialized value usage
perf: Fix race in perf_event_exit_task_context()
perf: Fix orphan hole
perf stat: Do not clean event's private stats
perf hists: Fix HISTC_MEM_DCACHELINE width setting
perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
perf tests: Remove wrong semicolon in while loop in CQM test
perf: Synchronously free aux pages in case of allocation failure
...
There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.
Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,
end = start + (cpa->numpages << PAGE_SHIFT);
And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.
Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit
a5caa209ba ("x86/efi: Fix boot crash by mapping EFI memmap
entries bottom-up at runtime, instead of top-down")
It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.
To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.
The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.
Reported-and-tested-by: Viorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When calling intel_alt_er() with .idx != EXTRA_REG_RSP_* we will not
initialize alt_idx and then use this uninitialized value to index an
array.
When that is not fatal, it can result in an infinite loop in its
caller __intel_shared_reg_get_constraints(), with IRQs disabled.
Alternative error modes are random memory corruption due to the
cpuc->shared_regs->regs[] array overrun, which manifest in either
get_constraints or put_constraints doing weird stuff.
Only took 6 hours of painful debugging to find this. Neither GCC nor
Smatch warnings flagged this bug.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: ae3f011fc2 ("perf/x86/intel: Fix SLM MSR_OFFCORE_RSP1 valid_mask")
Signed-off-by: Ingo Molnar <mingo@kernel.org>