radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:
RIP: radix_tree_next_slot include/linux/radix-tree.h:473
find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
....
Call Trace:
pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
do_writepages+0x97/0x100 mm/page-writeback.c:2364
__filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
vfs_fsync_range+0x10a/0x250 fs/sync.c:195
vfs_fsync fs/sync.c:209
do_fsync+0x42/0x70 fs/sync.c:219
SYSC_fdatasync fs/sync.c:232
SyS_fdatasync+0x19/0x20 fs/sync.c:230
entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
We must reset iterator's tags to bail out from radix_tree_next_slot()
and go to the slow-path in radix_tree_next_chunk().
Fixes: 46437f9a55 ("radix-tree: fix race in gang lookup")
Link: http://lkml.kernel.org/r/1468495196-10604-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears. At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild. Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs. Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.
Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.
Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.
Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages. And those
references are under the user's control, so they are manageable.
This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that. This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.
This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:
set -e
mkdir -p pages
for x in `seq 128000`; do
[ $((x % 1000)) -eq 0 ] && echo $x
mkdir /cgroup/foo
echo $$ >/cgroup/foo/cgroup.procs
echo trex >pages/$x
echo $$ >/cgroup/cgroup.procs
rmdir /cgroup/foo
done
When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:
[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device
After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.
[hannes@cmpxchg.org: init the IDR]
Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e6 ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: John Garcia <john.garcia@mesosphere.io>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Cc: <stable@vger.kernel.org> [3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull MTD fix from Brian Norris:
"Late MTD fix for v4.7:
One regression in the Device Tree handling for OMAP NAND handling of
the ELM node. TI migrated to using the property name "ti,elm-id", but
forgot to keep compatibility with the old "elm_id" property.
Also, might as well send out this MAINTAINERS fixup now"
* tag 'for-linus-20160715' of git://git.infradead.org/linux-mtd:
mtd: nand: omap2: Add check for old elm binding
MAINTAINERS: Add file patterns for mtd device tree bindings
Pull input fixes from Dmitry Torokhov:
"A few last-minute updates for the input subsystem"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: ts4800-ts - add missing of_node_put after calling of_parse_phandle
Input: synaptics-rmi4 - use of_get_child_by_name() to fix refcount
Revert "Input: wacom_w8001 - drop use of ABS_MT_TOOL_TYPE"
Input: xpad - validate USB endpoint count during probe
Input: add SW_PEN_INSERTED define
Pull workqueue fix from Tejun Heo:
"The optimization for setting unbound worker affinity masks collided
with recent scheduler changes triggering warning messages.
This late pull request fixes the bug by removing the optimization"
* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix setting affinity of unbound worker threads
Without this check, the following XFS_I invocations would return bad
pointers when used on non-XFS inodes (perhaps pointers into preceding
allocator chunks).
This could be used by an attacker to trick xfs_swap_extents into
performing locking operations on attacker-chosen structures in kernel
memory, potentially leading to code execution in the kernel. (I have
not investigated how likely this is to be usable for an attack in
practice.)
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a memory leak on probe error of the airspy usb device driver.
The problem is triggered when more than 64 usb devices register with
v4l2 of type VFL_TYPE_SDR or VFL_TYPE_SUBDEV.
The memory leak is caused by the probe function of the airspy driver
mishandeling errors and not freeing the corresponding control structures
when an error occours registering the device to v4l2 core.
A badusb device can emulate 64 of these devices, and then through
continual emulated connect/disconnect of the 65th device, cause the
kernel to run out of RAM and crash the kernel, thus causing a local DOS
vulnerability.
Fixes CVE-2016-5400
Signed-off-by: James Patrick-Evans <james@jmp-e.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver
detection") I broke Knights Landing because I failed to notice that it
called a wrapper macro "sbridge_get_all_devices_knl" instead of
"sbridge_get_all_devices" like all the other types.
Now that we include the processor type in the pci_id_table structure we
can skip the wrappers and just have the sbridge_get_all_devices() check
the type to decide whether to allow duplicate devices and controllers to
have registers spread across buses.
Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
of_node_put needs to be called when the device node which is got
from of_parse_phandle has finished using.
Signed-off-by: Peter Chen <peter.chen@nxp.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Merge misc fixes from Andrew Morton:
"20 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
m32r: fix build warning about putc
mm: workingset: printk missing log level, use pr_info()
mm: thp: refix false positive BUG in page_move_anon_rmap()
mm: rmap: call page_check_address() with sync enabled to avoid racy check
mm: thp: move pmd check inside ptl for freeze_page()
vmlinux.lds: account for destructor sections
gcov: add support for gcc version >= 6
mm, meminit: ensure node is online before checking whether pages are uninitialised
mm, meminit: always return a valid node from early_pfn_to_nid
kasan/quarantine: fix bugs on qlist_move_cache()
uapi: export lirc.h header
madvise_free, thp: fix madvise_free_huge_pmd return value after splitting
Revert "scripts/gdb: add documentation example for radix tree"
Revert "scripts/gdb: add a Radix Tree Parser"
scripts/gdb: Perform path expansion to lx-symbol's arguments
scripts/gdb: add constants.py to .gitignore
scripts/gdb: rebuild constants.py on dependancy change
scripts/gdb: silence 'nothing to do' message
kasan: add newline to messages
mm, compaction: prevent VM_BUG_ON when terminating freeing scanner
Pull rdma fixes from Doug Ledford:
"Round three of 4.7 rc fixes:
- two fixes for hfi1
- two fixes for i40iw
- one omission correction in the port table counter arrays"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
i40iw: Enable remote access rights for stag allocation
i40iw: do not print unitialized variables in error message
IB core: Add port_xmit_wait counter
IB/hfi1: Fix sleep inside atomic issue in init_asic_data
IB/hfi1: Correct issues with sc5 computation
Pull i2c fixes from Wolfram Sang:
"Four driver bugfixes for the I2C subsystem"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mux: reg: wrong condition checked for of_address_to_resource return value
i2c: tegra: Correct error path in probe
i2c: remove __init from i2c_register_board_info()
i2c: qup: Fix wrong value of index variable
Pull drm vmware fixes from Dave Airlie:
"These are some fixes for the vmware graphics driver, that fix some
black screen issues on at least Ubuntu 16.04, I think VMware would
like to get these in so stable can pick them up ASAP"
* tag 'drm-fixes-for-v4.7-rc8-vmware' of git://people.freedesktop.org/~airlied/linux:
drm/vmwgfx: Fix error paths when mapping framebuffer
drm/vmwgfx: Fix corner case screen target management
drm/vmwgfx: Delay pinning fbdev framebuffer until after mode set
drm/vmwgfx: Check pin count before attempting to move a buffer
drm/ttm: Make ttm_bo_mem_compat available
drm/vmwgfx: Add an option to change assumed FB bpp
drm/vmwgfx: Work around mode set failure in 2D VMs
drm/vmwgfx: Add a check to handle host message failure
Pull drm fixes from Dave Airlie:
"These are just some i915 and amdgpu fixes that shows up, the amdgpu
ones are polaris fixes, and the i915 one is a major regression fix"
* tag 'drm-fixes-for-v4.7-rc8' of git://people.freedesktop.org/~airlied/linux:
drm/amdgpu: fix power distribution issue for Polaris10 XT
drm/amdgpu: Add a missing register to Polaris golden setting
drm/i915: Ignore panel type from OpRegion on SKL
drm/i915: Update ifdeffery for mutex->owner
Pull scheduler fix from Ingo Molnar:
"Fix a CPU hotplug related corruption of the load average that got
introduced in this merge window"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Correct off by one bug in load migration calculation
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
worth: the syzkaller fuzzer hit it again. It's still wrong for some THP
cases, because linear_page_index() was never intended to apply to
addresses before the start of a vma.
That's easily fixed with a signed long cast inside linear_page_index();
and Dmitry has tested such a patch, to verify the false positive. But
why extend linear_page_index() just for this case? when the avoidance in
page_move_anon_rmap() has already grown ugly, and there's no reason for
the check at all (nothing else there is using address or index).
Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
remove CONFIG_DEBUG_VM PageTransHuge adjustment.
And one more thing: should the compound_head(page) be done inside or
outside page_move_anon_rmap()? It's usually pushed down to the lowest
level nowadays (and mm/memory.c shows no other explicit use of it), so I
think it's better done in page_move_anon_rmap() than by caller.
Fixes: 0798d3c022 ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patch addresses the race between split_huge_pmd_address()
and someone changing the pmd. The fix is only for splitting of normal
thp (i.e. pmd-mapped thp,) and for splitting of pte-mapped thp there
still is the similar race.
For splitting pte-mapped thp, the pte's conversion is done by
try_to_unmap_one(TTU_MIGRATION). This function checks
page_check_address() to get the target pte, but it can return NULL under
some race, leading to VM_BUG_ON() in freeze_page(). Fortunately,
page_check_address() already has an argument to decide whether we do a
quick/racy check or not, so let's flip it when called from
freeze_page().
Link: http://lkml.kernel.org/r/1466990929-7452-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_KASAN is enabled and gcc is configured with
--disable-initfini-array and/or gold linker is used, gcc emits
.ctors/.dtors and .text.startup/.text.exit sections instead of
.init_array/.fini_array. .dtors section is not explicitly accounted in
the linker script and messes vvar/percpu layout.
We want:
ffffffff822bfd80 D _edata
ffffffff822c0000 D __vvar_beginning_hack
ffffffff822c0000 A __vvar_page
ffffffff822c0080 0000000000000098 D vsyscall_gtod_data
ffffffff822c1000 A __init_begin
ffffffff822c1000 D init_per_cpu__irq_stack_union
ffffffff822c1000 A __per_cpu_load
ffffffff822d3000 D init_per_cpu__gdt_page
We got:
ffffffff8279a600 D _edata
ffffffff8279b000 A __vvar_page
ffffffff8279c000 A __init_begin
ffffffff8279c000 D init_per_cpu__irq_stack_union
ffffffff8279c000 A __per_cpu_load
ffffffff8279e000 D __vvar_beginning_hack
ffffffff8279e080 0000000000000098 D vsyscall_gtod_data
ffffffff827ae000 D init_per_cpu__gdt_page
This happens because __vvar_page and .vvar get different addresses in
arch/x86/kernel/vmlinux.lds.S:
. = ALIGN(PAGE_SIZE);
__vvar_page = .;
.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
/* work around gold bug 13023 */
__vvar_beginning_hack = .;
Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
Merge .text.startup into init text.
Link: http://lkml.kernel.org/r/1467386363-120030-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
early_pfn_to_nid can return node 0 if a PFN is invalid on machines that
has no node 0. A machine with only node 1 was observed to crash with
the following message:
BUG: unable to handle kernel paging request at 000000000002a3c8
PGD 0
Modules linked in:
Hardware name: Supermicro H8DSP-8/H8DSP-8, BIOS 080011 06/30/2006
task: ffffffff81c0d500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
RIP: reserve_bootmem_region+0x6a/0xef
CR2: 000000000002a3c8 CR3: 0000000001c06000 CR4: 00000000000006b0
Call Trace:
free_all_bootmem+0x4b/0x12a
mem_init+0x70/0xa3
start_kernel+0x25b/0x49b
The problem is that early_page_uninitialised uses the early_pfn_to_nid
helper which returns node 0 for invalid PFNs. No caller of
early_pfn_to_nid cares except early_page_uninitialised. This patch has
early_pfn_to_nid always return a valid node.
Link: http://lkml.kernel.org/r/1468008031-3848-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>