Pull percpu fixes from Tejun Heo:
"While adding GFP_ATOMIC support to the percpu allocator, the
synchronization for the fast-path which doesn't require external
allocations was separated into pcpu_lock.
Unfortunately, it incorrectly decoupled async paths and percpu
chunks could get destroyed while still being operated on. This
contains two patches to fix the bug"
* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix synchronization between synchronous map extension and chunk destruction
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. This contains:
- Two fixes for xen-blkfront, from Bob Liu.
- A bug fix for NVMe, releasing only the specific resources we
requested.
- Fix for a debugfs flags entry for nbd, from Josef.
- Plug fix from Omar, fixing up a case of code being switched between
two functions.
- A missing bio_put() for the new discard callers of
submit_bio_wait(), fixing a regression causing a leak of the bio.
From Shaun.
- Improve dirty limit calculation precision in the writeback code,
fixing a case where setting a limit lower than 1% of memory would
end up being zero. From Tejun"
* 'for-linus' of git://git.kernel.dk/linux-block:
NVMe: Only release requested regions
xen-blkfront: fix resume issues after a migration
xen-blkfront: don't call talk_to_blkback when already connected to blkback
nbd: pass the nbd pointer for flags debugfs
block: missing bio_put following submit_bio_wait
blk-mq: really fix plug list flushing for nomerge queues
writeback: use higher precision calculation in domain_dirty_limits()
I noticed that the logic in the fadvise64_64 syscall is incorrect for
partial pages. While first page of the region is correctly skipped if
it is partial, the last page of the region is mistakenly discarded.
This leads to problems for applications that read data in
non-page-aligned chunks discarding already processed data between the
reads.
A somewhat misguided application that does something like write(XX bytes
(non-page-alligned)); drop the data it just wrote; repeat gets a
significant penalty in performance as a result.
Link: http://lkml.kernel.org/r/1464917140-1506698-1-git-send-email-green@linuxhacker.ru
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is based on https://patchwork.ozlabs.org/patch/574623/.
Tejun submitted commit 23d11a58a9 ("workqueue: skip flush dependency
checks for legacy workqueues") for the legacy create*_workqueue()
interface.
But some workq created by alloc_workqueue still reports warning on
memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set:
workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
------------[ cut here ]------------
WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c
...
check_flush_dependency+0xb4/0x10c
flush_work+0x54/0x140
lru_add_drain_all+0x138/0x188
migrate_prep+0xc/0x18
alloc_contig_range+0xf4/0x350
cma_alloc+0xec/0x1e4
dma_alloc_from_contiguous+0x38/0x40
__dma_alloc+0x74/0x25c
nvme_alloc_queue+0xcc/0x36c
nvme_reset_work+0x5c4/0xda8
process_one_work+0x128/0x2ec
worker_thread+0x58/0x434
kthread+0xd4/0xe8
ret_from_fork+0x10/0x50
That's because lru_add_drain_all() will schedule the drain work on
system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM.
Introduce a dedicated WQ_MEM_RECLAIM workqueue to do
lru_add_drain_all(), aiding in getting memory freed.
Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christian Borntraeger reported a kernel panic after corrupt page counts,
and it turned out to be a regression introduced with commit aa88b68c3b
("thp: keep huge zero page pinned until tlb flush"), at least on s390.
put_huge_zero_page() was moved over from zap_huge_pmd() to
release_pages(), and it was replaced by tlb_remove_page(). However,
release_pages() might not always be triggered by (the arch-specific)
tlb_remove_page().
On s390 we call free_page_and_swap_cache() from tlb_remove_page(), and
not tlb_flush_mmu() -> free_pages_and_swap_cache() like the generic
version, because we don't use the MMU-gather logic. Although both
functions have very similar names, they are doing very unsimilar things,
in particular free_page_xxx is just doing a put_page(), while
free_pages_xxx calls release_pages().
This of course results in very harmful put_page()s on the huge zero
page, on architectures where tlb_remove_page() is implemented in this
way. It seems to affect only s390 and sh, but sh doesn't have THP
support, so the problem (currently) probably only exists on s390.
The following quick hack fixed the issue:
Link: http://lkml.kernel.org/r/20160602172141.75c006a9@thinkpad
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org> [4.6.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When creating a private mapping of a hugetlbfs file, it is possible to
unmap pages via ftruncate or fallocate hole punch. If subsequent faults
repopulate these mappings, the reserve counts will go negative. This is
because the code currently assumes all faults to private mappings will
consume reserves. The problem can be recreated as follows:
- mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
- write fault in pages in the mapping
- fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
- write fault in pages in the hole
This will result in negative huge page reserve counts and negative
subpool usage counts for the hugetlbfs. Note that this can also be
recreated with ftruncate, but fallocate is more straight forward.
This patch modifies the routines vma_needs_reserves and vma_has_reserves
to examine the reserve map associated with private mappings similar to
that for shared mappings. However, the reserve map semantics for
private and shared mappings are very different. This results in subtly
different code that is explained in the comments.
Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations. The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.
However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct. This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies. This will alter the zone used for statistics but only after
it is known that it makes sense for that context. Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone. Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.
Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geert Uytterhoeven reported the following problem that bisected to
commit c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone
in a zonelist twice") on m68k/ARAnyM
BUG: scheduling while atomic: cron/668/0x10c9a0c0
Modules linked in:
CPU: 0 PID: 668 Comm: cron Not tainted 4.6.0-atari-05133-gc33d6c06f60f710f #364
Call Trace: [<0003d7d0>] __schedule_bug+0x40/0x54
__schedule+0x312/0x388
__schedule+0x0/0x388
prepare_to_wait+0x0/0x52
schedule+0x64/0x82
schedule_timeout+0xda/0x104
set_next_entity+0x18/0x40
pick_next_task_fair+0x78/0xda
io_schedule_timeout+0x36/0x4a
bit_wait_io+0x0/0x40
bit_wait_io+0x12/0x40
__wait_on_bit+0x46/0x76
wait_on_page_bit_killable+0x64/0x6c
bit_wait_io+0x0/0x40
wake_bit_function+0x0/0x4e
__lock_page_or_retry+0xde/0x124
do_scan_async+0x114/0x17c
lookup_swap_cache+0x24/0x4e
handle_mm_fault+0x626/0x7de
find_vma+0x0/0x66
down_read+0x0/0xe
wait_on_page_bit_killable_timeout+0x77/0x7c
find_vma+0x16/0x66
do_page_fault+0xe6/0x23a
res_func+0xa3c/0x141a
buserr_c+0x190/0x6d4
res_func+0xa3c/0x141a
buserr+0x20/0x28
res_func+0xa3c/0x141a
buserr+0x20/0x28
The relationship is not obvious but it's due to a failure to rescan the
full zonelist after the fair zone allocation policy exhausts the batch
count. While this is a functional problem, it's also a performance
issue. A page allocator microbenchmark showed the following
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
Min alloc-odr0-1 327.00 ( 0.00%) 326.00 ( 0.31%)
Min alloc-odr0-2 235.00 ( 0.00%) 235.00 ( 0.00%)
Min alloc-odr0-4 198.00 ( 0.00%) 198.00 ( 0.00%)
Min alloc-odr0-8 170.00 ( 0.00%) 170.00 ( 0.00%)
Min alloc-odr0-16 156.00 ( 0.00%) 156.00 ( 0.00%)
Min alloc-odr0-32 150.00 ( 0.00%) 150.00 ( 0.00%)
Min alloc-odr0-64 146.00 ( 0.00%) 146.00 ( 0.00%)
Min alloc-odr0-128 145.00 ( 0.00%) 145.00 ( 0.00%)
Min alloc-odr0-256 155.00 ( 0.00%) 155.00 ( 0.00%)
Min alloc-odr0-512 168.00 ( 0.00%) 165.00 ( 1.79%)
Min alloc-odr0-1024 175.00 ( 0.00%) 174.00 ( 0.57%)
Min alloc-odr0-2048 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-4096 187.00 ( 0.00%) 186.00 ( 0.53%)
Min alloc-odr0-8192 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-16384 191.00 ( 0.00%) 191.00 ( 0.00%)
Min alloc-odr1-1 736.00 ( 0.00%) 445.00 ( 39.54%)
Min alloc-odr1-2 343.00 ( 0.00%) 335.00 ( 2.33%)
Min alloc-odr1-4 277.00 ( 0.00%) 270.00 ( 2.53%)
Min alloc-odr1-8 238.00 ( 0.00%) 233.00 ( 2.10%)
Min alloc-odr1-16 224.00 ( 0.00%) 218.00 ( 2.68%)
Min alloc-odr1-32 210.00 ( 0.00%) 208.00 ( 0.95%)
Min alloc-odr1-64 207.00 ( 0.00%) 203.00 ( 1.93%)
Min alloc-odr1-128 276.00 ( 0.00%) 202.00 ( 26.81%)
Min alloc-odr1-256 206.00 ( 0.00%) 202.00 ( 1.94%)
Min alloc-odr1-512 207.00 ( 0.00%) 202.00 ( 2.42%)
Min alloc-odr1-1024 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr1-2048 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr1-4096 218.00 ( 0.00%) 216.00 ( 0.92%)
Min alloc-odr1-8192 341.00 ( 0.00%) 219.00 ( 35.78%)
Note that order-0 allocations are unaffected but higher orders get a
small boost from this patch and a large reduction in system CPU usage
overall as can be seen here:
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
User 85.32 86.31
System 2221.39 2053.36
Elapsed 2368.89 2202.47
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
is never removed from the pcp list. Fix this by removing the page
before retrying. Also we don't need to check if page is non-NULL,
because we simply grab it from the list which was just tested for being
non-empty.
Fixes: 479f854a20 ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Link: http://lkml.kernel.org/r/20160530090154.GM2527@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg_offline_kmem() may be called from memcg_free_kmem() after a css
init failure. memcg_free_kmem() is a ->css_free callback which is
called without cgroup_mutex and memcg_offline_kmem() ends up using
css_for_each_descendant_pre() without any locking. Fix it by adding rcu
read locking around it.
mkdir: cannot create directory `65530': No space left on device
===============================
[ INFO: suspicious RCU usage. ]
4.6.0-work+ #321 Not tainted
-------------------------------
kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
[ 527.243970] other info that might help us debug this:
[ 527.244715]
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/0:5/1664:
#0: ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
#1: ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
[ 527.248098] stack backtrace:
CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Workqueue: cgroup_destroy css_free_work_fn
Call Trace:
dump_stack+0x68/0xa1
lockdep_rcu_suspicious+0xd7/0x110
css_next_descendant_pre+0x7d/0xb0
memcg_offline_kmem.part.44+0x4a/0xc0
mem_cgroup_css_free+0x1ec/0x200
css_free_work_fn+0x49/0x5e0
process_one_work+0x1c5/0x4a0
worker_thread+0x49/0x490
kthread+0xea/0x100
ret_from_fork+0x1f/0x40
Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio. However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.
Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations. Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Miao Xie <miaoxie@huawei.com>
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Cc: stable@vger.kernel.org # v4.2+
Fixes: 9fc3a43e17 ("writeback: separate out domain_dirty_limits()")
Reviewed-by: Jan Kara <jack@suse.cz>
Adjusted comment based on Jan's suggestion.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.
However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.
What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.
So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The register_page_bootmem_info_node() function needs to be marked __init
in order to avoid a new warning introduced by commit f65e91df25 ("mm:
use early_pfn_to_nid in register_page_bootmem_info_node").
Otherwise you'll get a warning about how a non-init function calls
early_pfn_to_nid (which is __meminit)
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pageblock_order can be (at least) an unsigned int or an unsigned long
depending on the kernel config and architecture, so use max_t(unsigned
long, ...) when comparing it.
fixes these warnings:
In file included from include/asm-generic/bug.h:13:0,
from arch/powerpc/include/asm/bug.h:127,
from include/linux/bug.h:4,
from include/linux/mmdebug.h:4,
from include/linux/mm.h:8,
from include/linux/memblock.h:18,
from mm/cma.c:28:
mm/cma.c: In function 'cma_init_reserved_mem':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
(void) (&_max1 == &_max2); ^
mm/cma.c:186:27: note: in expansion of macro 'max'
alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
^
mm/cma.c: In function 'cma_declare_contiguous':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
(void) (&_max1 == &_max2); ^
include/linux/kernel.h:747:9: note: in definition of macro 'max'
typeof(y) _max2 = (y); ^
mm/cma.c:270:29: note: in expansion of macro 'max'
(phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
^
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
(void) (&_max1 == &_max2); ^
include/linux/kernel.h:747:21: note: in definition of macro 'max'
typeof(y) _max2 = (y); ^
mm/cma.c:270:29: note: in expansion of macro 'max'
(phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
^
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>