Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.
The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.
[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have only a few places where we actually want to charge kmem so
instead of intruding into the general page allocation path with
__GFP_KMEMCG it's better to explictly charge kmem there. All kmem
charges will be easier to follow that way.
This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
from memcg caches' allocflags. Instead it makes slab allocation path
call memcg_charge_kmem directly getting memcg to charge from the cache's
memcg params.
This also eliminates any possibility of misaccounting an allocation
going from one memcg's cache to another memcg, because now we always
charge slabs against the memcg the cache belongs to. That's why this
patch removes the big comment to memcg_kmem_get_cache.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:
1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return
Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.
This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On system with 2TiB ram, current x86_64 have 128M as section size, and
one memory_block only include one section. So will have 16400 entries
under /sys/devices/system/memory/.
Current code try to use block id to find block pointer in /sys for any
section, and reuse that block pointer. that finding will take some time
even after commit 7c243c7168 ("mm: speedup in __early_pfn_to_nid")
that will skip the search in that case during booting up.
So solution could be increase block size just like SGI UV system did.
(harded code to 2g).
This patch is trying to probe the block size to make it match mmio remap
size. for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.
We could use 2g as block size instead of default 128M. That will reduce
number of entries in /sys/devices/system/memory/
On system 6TiB system will reduce boot time by 35 seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86. Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults. This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.
Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.
Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.
This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."
[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ocfs2 cluster size may be 1MB, which has 20 bits. When resize, the
input new clusters is mostly the number of clusters in a group
descriptor(32256).
Since the input clusters is defined as type int, so it will overflow
when shift left 20 bits and then lead to incorrect global bitmap i_size.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found a race situation when dlm recovery and node joining occurs
simultaneously if the network state is bad.
N1 N4
start joining dlm and send
query join to all live nodes
set joining node to N1, return OK
send query join to other
live nodes and it may take
a while
call dlm_send_join_assert()
to send assert join message
when N2 is down, so keep
trying to send message to N2
until find N2 is down
send assert join message to
N3, but connection is down
with N3, so it may take a
while
become the recovery master for N2
and send begin reco message to other
nodes in domain map but no N1
connection with N3 is rebuild,
then send assert join to N4
call dlm_assert_joined_handler(),
add N1 to domain_map
dlm recovery done, send finalize message
to nodes in domain map, including N1
receiving finalize message,
trigger the BUG() because
recovery master mismatch.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 75f82eaa50 ("ocfs2: fix NULL pointer dereference when
dismount and ocfs2rec simultaneously") because it may cause a umount
hang while shutting down the truncate log.
fix NULL pointer dereference when dismount and ocfs2rec simultaneously
The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
-> free osb->recovery_map
-> ocfs2_truncate_shutdown
-> lock global bitmap inode
-> ocfs2_wait_for_recovery
-> check whether osb->recovery_map->rm_used is zero
Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.
To prevent NULL pointer dereference while getting sys_root_inode, we use
a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
truncate log shutdown.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found there is a conversion deadlock when the owner of lockres
happened to crash before send DLM_PROXY_AST_MSG for a downconverting
lock. The situation is as follows:
Node1 Node2 Node3
the owner of lockresA
lock_1 granted at EX mode
and call ocfs2_cluster_unlock
to decrease ex_holders.
converting lock_3 from
NL to EX
send DLM_PROXY_AST_MSG
to Node1, asking Node 1
to downconvert.
receiving DLM_PROXY_AST_MSG,
thread ocfs2dc send
DLM_CONVERT_LOCK_MSG
to Node2 to downconvert
lock_1(EX->NL).
lock_1 can be granted and
put it into pending_asts
list, return DLM_NORMAL.
then something happened
and Node2 crashed.
received DLM_NORMAL, waiting
for DLM_PROXY_AST_MSG.
selected as the recovery
master, receving migrate
lock from Node1, queue
lock_1 to the tail of
converting list.
After dlm recovery, converting list in the master of lockresA(Node3)
will be: converting list head <-> lock_3(NL->EX) <->lock_1(EX<->NL).
Requested mode of lock_3 is not compatible with the granted mode of
lock_1, so it can not be granted. and lock_1 can not downconvert
because covnerting queue is strictly FIFO. So a deadlock is created.
We think function dlm_process_recovery_data() should queue_ast for
lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
process lock_1 first. And if there are multiple downconverting locks,
they must convert form PR to NL, so no need to sort them.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
ocfs2_commit_thread. Then it will get into a loop with mass logs. This
will meaninglessly consume a larger number of resource and may lead to
the system hanging. So limit printk in this case.
[akpm@linux-foundation.org: document the msleep]
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two standard techniques for dereferencing structures pointed
to by void *: cast to the right type each time they're used, or assign
to local variables of the right type.
But there's no need to do *both*.
Signed-off-by: George Spelvin <linux@horizon.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Direct conversion of one KERN_DEBUG message without DEBUG definition
(suggested by Josh Triplett)
That message will now be disabled by default. (see
Documentation/CodingStyle Chapter 13)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>