The kmemleak_init() definition in mm/kmemleak.c is marked __init but its
prototype in include/linux/kmemleak.h is marked __ref since commit
a6186d89c9 ("kmemleak: Mark the early log buffer as __initdata").
This causes a section mismatch which is reported as a warning when
building with clang -Wsection, because kmemleak_init() is declared in
section .ref.text but defined in .init.text.
Fix this by marking kmemleak_init() prototype __init.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 42cb14b110 ("mm: migrate dirty page without
clear_page_dirty_for_io etc") simplified the migration of a PageDirty
pagecache page: one stat needs moving from zone to zone and that's about
all.
It's convenient and safest for it to shift the PageDirty bit from old
page to new, just before updating the zone stats: before copying data
and marking the new PageUptodate. This is all done while both pages are
isolated and locked, just as before; and just as before, there's a
moment when the new page is visible in the radix_tree, but not yet
PageUptodate. What's new is that it may now be briefly visible as
PageDirty before it is PageUptodate.
When I scoured the tree to see if this could cause a problem anywhere,
the only places I found were in two similar functions __r4w_get_page():
which look up a page with find_get_page() (not using page lock), then
claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
I'm not sure whether that was right before, but now it might be wrong
(on rare occasions): only claim the page is uptodate if PageUptodate.
Or perhaps the page in question could never be migratable anyway?
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Boaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.
The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context. If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much. WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation. This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.
In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep. In order to prevent from issues handled by 0e093d9976
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.
The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 016c13daa5 ("mm, page_alloc: use masks and shifts when
converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
wasn't updated to reflect that.
As a result, the file /proc/pagetypeinfo shows the counts for Movable as
Reclaimable and vice versa.
Additionally, commit 0aaa29a56e ("mm, page_alloc: reserve pageblocks
for high-order atomic allocations on demand") introduced
MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
show_migration_types(), so it doesn't appear in the listing of free
areas during page alloc failures or oom kills.
This patch fixes both problems. The atomic reserves will show with a
letter 'H' in the free areas listings.
Fixes: 016c13daa5 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
Fixes: 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the memory.high threshold is exceeded, try_charge() schedules a
task_work to reclaim the excess. The reclaim target is set to the
number of pages requested by try_charge().
This is wrong, because try_charge() usually charges more pages than
requested (batch > nr_pages) in order to refill per cpu stocks. As a
result, a process in a cgroup can easily exceed memory.high
significantly when doing a lot of charges w/o returning to userspace
(e.g. reading a file in big chunks).
Fix this issue by assuring that when exceeding memory.high a process
reclaims as many pages as were actually charged (i.e. batch).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.
In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count. As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.
This patch simply adds decrementing code on that code path.
I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull device mapper fixes from Mike Snitzer:
"Five stable fixes:
- Two DM btree bufio buffer leak fixes that resolve reported BUG_ONs
during DM thinp metadata close's dm_bufio_client_destroy().
- A DM thinp range discard fix to handle discarding a partially
mapped range.
- A DM thinp metadata snapshot fix to make sure the btree roots saved
in the metadata snapshot are the most current.
- A DM space map metadata refcounting fix that improves both DM thinp
and DM cache metadata"
* tag 'dm-4.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm btree: fix bufio buffer leaks in dm_btree_del() error path
dm space map metadata: fix ref counting bug when bootstrapping a new space map
dm thin metadata: fix bug when taking a metadata snapshot
dm thin metadata: fix bug in dm_thin_remove_range()
dm btree: fix leak of bufio-backed block in btree_split_sibling error path
Pull drm fixes from Dave Airlie:
"Not too much this time.
- One nouveau workaround extended to a few more GPUs
- Some amdgpu big endian fixes, and a regression fixer
- Some vmwgfx fixes
- One ttm locking fix
- One vgaarb fix"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
vgaarb: fix signal handling in vga_get()
radeon: Fix VCE IB test on Big-Endian systems
radeon: Fix VCE ring test for Big-Endian systems
radeon/cik: Fix GFX IB test on Big-Endian
drm/amdgpu: fix the lost duplicates checking
drm/nouveau/pmu: remove whitelist for PGOB-exit WAR, enable by default
drm/vmwgfx: Implement the cursor_set2 callback v2
drm/vmwgfx: fix a warning message
drm/ttm: Fixed a read/write lock imbalance
There are few defects in vga_get() related to signal hadning:
- we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
case;
- if we found pending signal we must remove ourself from wait queue
and change task state back to running;
- -ERESTARTSYS is more appropriate, I guess.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: stable@vger.kernel.org
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
some big endian fixes and one regression fix.
* 'drm-fixes-4.4' of git://people.freedesktop.org/~agd5f/linux:
radeon: Fix VCE IB test on Big-Endian systems
radeon: Fix VCE ring test for Big-Endian systems
radeon/cik: Fix GFX IB test on Big-Endian
drm/amdgpu: fix the lost duplicates checking
Pull rdma fixes from Doug Ledford:
"Most are minor to important fixes.
There is one performance enhancement that I took on the grounds that
failing to check if other processes can run before running what's
intended to be a background, idle-time task is a bug, even though the
primary effect of the fix is to improve performance (and it was a very
simple patch)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/mlx5: Postpone remove_keys under knowledge of coming preemption
IB/mlx4: Use vmalloc for WR buffers when needed
IB/mlx4: Use correct order of variables in log message
iser-target: Remove explicit mlx4 work-around
mlx4: Expose correct max_sge_rd limit
IB/mad: Require CM send method for everything except ClassPortInfo
IB/cma: Add a missing rcu_read_unlock()
IB core: Fix ib_sg_to_pages()
IB/srp: Fix srp_map_sg_fr()
IB/srp: Fix indirect data buffer rkey endianness
IB/srp: Initialize dma_length in srp_map_idb
IB/srp: Fix possible send queue overflow
IB/srp: Fix a memory leak
IB/sa: Put netlink request into the request list before sending
IB/iser: use sector_div instead of do_div
IB/core: use RCU for uverbs id lookup
IB/qib: Minor fixes to qib per SFF 8636
IB/core: Fix user mode post wr corruption
IB/qib: Fix qib_mr structure
Pull sound fixes from Takashi Iwai:
"Again less intensive changes in this rc: you can find only a few
HD-audio fixes (noise fixes for Intel Broxton chip and a few Thinkpad
models, quirks for Alienware 17 and Packard Bell DOTS) in addition to
a long-standing rme96 bug fix"
* tag 'sound-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/ca0132 - quirk for Alienware 17 2015
ALSA: hda - Fix noise problems on Thinkpad T440s
ALSA: hda - Fixing speaker noise on the two latest thinkpad models
ALSA: hda - Add inverted dmic for Packard Bell DOTS
ALSA: hda - Fix playback noise with 24/32 bit sample size on BXT
ALSA: rme96: Fix unexpected volume reset after rate changes
If dm_btree_del()'s call to push_frame() fails, e.g. due to
btree_node_validator finding invalid metadata, the dm_btree_del() error
path must unlock all frames (which have active dm-bufio buffers) that
were pushed onto the del_stack.
Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
buffers have leaked, e.g.:
device-mapper: bufio: leaked buffer 3, hold count 1, list 0
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull VFIO fixes from Alex Williamson:
- Various fixes for removing redundancy, const'ifying structs, avoiding
stack usage, fixing WARN usage (Krzysztof Kozlowski, Julia Lawall,
Kees Cook, Dan Carpenter)
- Revert No-IOMMU mode as the intended user has not emerged (Alex
Williamson)
* tag 'vfio-v4.4-rc5' of git://github.com/awilliam/linux-vfio:
Revert: "vfio: Include No-IOMMU mode"
vfio: fix a warning message
vfio: platform: remove needless stack usage
vfio-pci: constify pci_error_handlers structures
vfio: Drop owner assignment from platform_driver
Pull DT fixes from Rob Herring:
"I think this should be all for 4.4:
- Fix incorrect warning about overlapping memory regions
- Export of_irq_find_parent again which was made static in 4.4, but
has users pending for 4.5.
- Fix of_msi_map_rid declaration location
- Fix re-entrancy for of_fdt_unflatten_tree
- Clean-up of phys_addr_t printks"
* tag 'devicetree-fixes-for-4.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/irq: move of_msi_map_rid declaration to the correct ifdef section
of/irq: Export of_irq_find_parent again
of/fdt: Add mutex protection for calls to __unflatten_device_tree()
of/address: fix typo in comment block of of_translate_one()
of: do not use 0x in front of %pa
of: Fix comparison of reserved memory regions
Pull clk fixes from Stephen Boyd:
"One small build fix, a couple do_div() fixes, and a fix for the gpio
basic clock type are the major changes here. There's also a couple
fixes for the TI, sunxi, and scpi clock drivers"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: sunxi: pll2: Fix clock running too fast
clk: scpi: add missing of_node_put
clk: qoriq: fix memory leak
imx/clk-pllv2: fix wrong do_div() usage
imx/clk-pllv1: fix wrong do_div() usage
clk: mmp: add linux/clk.h includes
clk: ti: drop locking code from mux/divider drivers
clk: ti816x: Add missing dmtimer clkdev entries
clk: ti: fapll: fix wrong do_div() usage
clk: ti: clkt_dpll: fix wrong do_div() usage
clk: gpio: Get parent clk names in of_gpio_clk_setup()
Pull IPMI fix from Corey Minyard:
"Fix an Oops if an interrupt occurs at startup. This can happen on
some hardware"
* tag 'for-linus-4.4-1' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi: move timer init to before irq is setup
We encountered a panic on boot in ipmi_si on a dell per320 due to an
uninitialized timer as follows.
static int smi_start_processing(void *send_info,
ipmi_smi_t intf)
{
/* Try to claim any interrupts. */
if (new_smi->irq_setup)
new_smi->irq_setup(new_smi);
--> IRQ arrives here and irq handler tries to modify uninitialized timer
which triggers BUG_ON(!timer->function) in __mod_timer().
Call Trace:
<IRQ>
[<ffffffffa0532617>] start_new_msg+0x47/0x80 [ipmi_si]
[<ffffffffa053269e>] start_check_enables+0x4e/0x60 [ipmi_si]
[<ffffffffa0532bd8>] smi_event_handler+0x1e8/0x640 [ipmi_si]
[<ffffffff810f5584>] ? __rcu_process_callbacks+0x54/0x350
[<ffffffffa053327c>] si_irq_handler+0x3c/0x60 [ipmi_si]
[<ffffffff810efaf0>] handle_IRQ_event+0x60/0x170
[<ffffffff810f245e>] handle_edge_irq+0xde/0x180
[<ffffffff8100fc59>] handle_irq+0x49/0xa0
[<ffffffff8154643c>] do_IRQ+0x6c/0xf0
[<ffffffff8100ba53>] ret_from_intr+0x0/0x11
/* Set up the timer that drives the interface. */
setup_timer(&new_smi->si_timer, smi_timeout, (long)new_smi);
The following patch fixes the problem.
To: Openipmi-developer@lists.sourceforge.net
To: Corey Minyard <minyard@acm.org>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Tony Camuso <tcamuso@redhat.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: stable@vger.kernel.org # Applies cleanly to 3.10-, needs small rework before
ROL on a 32 bit integer with a shift of 32 or more is undefined and the
result is arch-dependent. Avoid this by handling the trivial case of
roling by 0 correctly.
The trivial solution of checking if shift is 0 breaks gcc's detection
of this code as a ROL instruction, which is unacceptable.
This bug was reported and fixed in GCC
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157):
The standard rotate idiom,
(x << n) | (x >> (32 - n))
is recognized by gcc (for concreteness, I discuss only the case that x
is an uint32_t here).
However, this is portable C only for n in the range 0 < n < 32. For n
== 0, we get x >> 32 which gives undefined behaviour according to the
C standard (6.5.7, Bitwise shift operators). To portably support n ==
0, one has to write the rotate as something like
(x << n) | (x >> ((-n) & 31))
And this is apparently not recognized by gcc.
Note that this is broken on older GCCs and will result in slower ROL.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When applying block operations (BOPs) do not remove them from the
uncommitted BOP ring-buffer until after they've been applied -- in case
we recurse.
Also, perform BOP_INC operation, in dm_sm_metadata_create() and
sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather
than using direct calls to sm_ll_inc().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When you take a metadata snapshot the btree roots for the mapping and
details tree need to have their reference counts incremented so they
persist for the lifetime of the metadata snap.
The roots being incremented were those currently written in the
superblock, which could possibly be out of date if concurrent IO is
triggering new mappings, breaking of sharing, etc.
Fix this by performing a commit with the metadata lock held while taking
a metadata snapshot.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org