Commit Graph

568031 Commits

Author SHA1 Message Date
Daniel Cashman d07e22597d mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack.  This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.

The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems.  The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy.  This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.

The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:

  http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device.  The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece).  With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.

The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.

When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.

This patch (of 4):

ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures.  This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations.  This may not be an issue on all
platforms.  Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Piotr Kwapulinski bc36f7017c mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_region
The following flag comparison in mmap_region makes no sense:

    if (!(vm_flags & MAP_FIXED))
        return -ENOMEM;

The condition is always false and thus the above "return -ENOMEM" is
never executed.  The vm_flags must not be compared with MAP_FIXED flag.
The vm_flags may only be compared with VM_* flags.  MAP_FIXED has the
same value as VM_MAYREAD.

Hitting the rlimit is a slow path and find_vma_intersection should
realize that there is no overlapping VMA for !MAP_FIXED case pretty
quickly.

Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Michal Hocko 9f6c399ddc mm, vmscan: consider isolated pages in zone_reclaimable_pages
zone_reclaimable_pages counts how many pages are reclaimable in the
given zone.  This currently includes all pages on file lrus and anon
lrus if there is an available swap storage.  We do not consider
NR_ISOLATED_{ANON,FILE} counters though which is not correct because
these counters reflect temporarily isolated pages which are still
reclaimable because they either get back to their LRU or get freed
either by the page reclaim or page migration.

The number of these pages might be sufficiently high to confuse users of
zone_reclaimable_pages (e.g.  mbind can migrate large ranges of memory
at once).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Andrew Morton b832861cca fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO)
bdev_write_page() is used by swapout and by writepage where we cannot
use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
here.

blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
bugs were harmed in the making of this patch.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov 9ee11ba425 memcg: do not allow to disable tcp accounting after limit is set
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.

This functionality looks dubious, because it is not clear what a use
case would be.  By enabling tcp accounting a user accepts the price.  If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled.  It does not seem
there is any need to flip it while the workload is running.

Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled.  Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.

Since this API peculiarity is not documented anywhere, I propose to drop
it.  This will allow to simplify the code by dropping cg_proto->flags.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov 316bda0e6c vmscan: do not force-scan file lru if its absolute size is small
We assume there is enough inactive page cache if the size of inactive
file lru is greater than the size of active file lru, in which case we
force-scan file lru ignoring anonymous pages.  While this logic works
fine when there are plenty of page cache pages, it fails if the size of
file lru is small (several MB): in this case (lru_size >> prio) will be
0 for normal scan priorities, as a result, if inactive file lru happens
to be larger than active file lru, anonymous pages of a cgroup will
never get evicted unless the system experiences severe memory pressure,
even if there are gigabytes of unused anonymous memory there, which is
unfair in respect to other cgroups, whose workloads might be page cache
oriented.

This patch attempts to fix this by elaborating the "enough inactive page
cache" check: it makes it not only check that inactive lru size > active
lru size, but also that we will scan something from the cgroup at the
current scan priority.  If these conditions do not hold, we proceed to
SCAN_FRACT as usual.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
David Rientjes 244d63ee34 mm, vmalloc: remove VM_VPAGES
VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
reading /proc/vmallocinfo.

[akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Geliang Tang 146693471f mm, thp: use list_first_entry_or_null()
Simplify the code with list_first_entry_or_null().

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Jerome Marchand 8cee852ec5 mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status
There are several shortcomings with the accounting of shared memory
(SysV shm, shared anonymous mapping, mapping of a tmpfs file).  The
values in /proc/<pid>/status and <...>/statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even though
theirs implication on memory usage are quite different: during reclaim,
file mapping can be dropped or written back on disk, while shmem needs a
place in swap.

Also, to distinguish the memory occupied by anonymous and file mappings,
one has to read the /proc/pid/statm file, which has a field for the file
mappings (again, including shmem) and total memory occupied by these
mappings (i.e.  equivalent to VmRSS in the <...>/status file.  Getting
the value for anonymous mappings only is thus not exactly user-friendly
(the statm file is intended to be rather efficiently machine-readable).

To address both of these shortcomings, this patch adds a breakdown of
VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and
RssShmem, making use of the previous preparatory patch.  These fields
tell the user the memory occupied by private anonymous pages, mapped
regular files and shmem, respectively.  Other existing fields in /status
and /statm files are left without change.  The /statm file can be
extended in the future, if there's a need for that.

Example (part of) /proc/pid/status output including the new Rss* fields:

VmPeak:  2001008 kB
VmSize:  2001004 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      5108 kB
VmRSS:      5108 kB
RssAnon:              92 kB
RssFile:            1324 kB
RssShmem:           3692 kB
VmData:      192 kB
VmStk:       136 kB
VmExe:         4 kB
VmLib:      1784 kB
VmPTE:      3928 kB
VmPMD:        20 kB
VmSwap:        0 kB
HugetlbPages:          0 kB

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Jerome Marchand eca56ff906 mm, shmem: add internal shmem resident memory accounting
Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.

The internal accounting currently counts shmem pages together with
regular files.  As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES.  The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before.  The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka 48131e03ca mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function.  We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop.  This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all.  I time how long does it take to
cat /proc/pid/smaps of this process 100 times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to the case where a radix tree iterator is employed
on the whole mapping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka 6a15a37097 mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible.  Only for writable private
mappings of shmem objects (i.e.  tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.

Hopefully such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka c261e7d94f mm, proc: account for shmem swap in /proc/pid/smaps
Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
shmem-backed mappings, even if the mapped portion does contain pages
that were swapped out.  This is because unlike private anonymous
mappings, shmem does not change pte to swap entry, but pte_none when
swapping the page out.  In the smaps page walk, such page thus looks
like it was never faulted in.

This patch changes smaps_pte_entry() to determine the swap status for
such pte_none entries for shmem mappings, similarly to how
mincore_page() does it.  Swapped out shmem pages are thus accounted for.
For private mappings of tmpfs files that COWed some of the pages, swaped
out status of the original shmem pages is naturally ignored.  If some of
the private copies was also swapped out, they are accounted via their
page table swap entries, so the resulting reported swap usage is then a
sum of both swapped out private copies, and swapped out shmem pages that
were not COWed.  No double accounting can thus happen.

The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in
question never accessed, but another process populated them and then let
them become swapped out.  I believe it is still less confusing and
subtle than not showing any swap usage by shmem mappings at all.
Swapped out counter might of interest of users who would like to prevent
from future swapins during performance critical operation and pre-fault
them at their convenience.  Especially for larger swapped out regions
the cost of swapin is much higher than a fresh page allocation.  So a
differentiation between pte_none vs.  swapped out is important for those
usecases.

One downside of this patch is that it makes /proc/pid/smaps more
expensive for shmem mappings, as we consult the radix tree for each
pte_none entry, so the overal complexity is O(n*log(n)).  I have
measured this on a process that creates a 2GB mapping and dirties single
pages with a stride of 2MB, and time how long does it take to cat
/proc/pid/smaps of this process 100 times.

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

Mapping of a /dev/shm/file:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

The difference is rather substantial, so the next patch will reduce the
cost for shared or read-only mappings.

In a less controlled experiment, I've gathered pids of processes on my
desktop that have either '/dev/shm/*' or 'SYSV*' in smaps.  This
included the Chrome browser and some KDE processes.  Again, I've run cat
/proc/pid/smaps on each 100 times.

Before this patch:

real    0m9.050s
user    0m0.518s
sys     0m8.066s

After this patch:

real    0m9.221s
user    0m0.541s
sys     0m8.187s

This suggests low impact on average systems.

Note that this patch doesn't attempt to adjust the SwapPss field for
shmem mappings, which would need extra work to determine who else could
have the pages mapped.  Thus the value stays zero except for COWed
swapped out pages in a shmem mapping, which are accounted as usual.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka bf9683d699 mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem
This series is based on Jerome Marchand's [1] so let me quote the first
paragraph from there:

There are several shortcomings with the accounting of shared memory
(sysV shm, shared anonymous mapping, mapping to a tmpfs file).  The
values in /proc/<pid>/status and statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even though
their implications on memory usage are quite different: at reclaim, file
mapping can be dropped or written back on disk while shmem needs a place
in swap.  As for shmem pages that are swapped-out or in swap cache, they
aren't accounted at all.

The original motivation for myself is that a customer found (IMHO
rightfully) confusing that e.g.  top output for process swap usage is
unreliable with respect to swapped out shmem pages, which are not
accounted for.

The fundamental difference between private anonymous and shmem pages is
that the latter has PTE's converted to pte_none, and not swapents.  As
such, they are not accounted to the number of swapents visible e.g.  in
/proc/pid/status VmSwap row.  It might be theoretically possible to use
swapents when swapping out shmem (without extra cost, as one has to
change all mappers anyway), and on swap in only convert the swapent for
the faulting process, leaving swapents in other processes until they
also fault (so again no extra cost).  But I don't know how many
assumptions this would break, and it would be too disruptive change for
a relatively small benefit.

Instead, my approach is to document the limitation of VmSwap, and
provide means to determine the swap usage for shmem areas for those who
are interested and willing to pay the price, using /proc/pid/smaps.
Because outside of ipcs, I don't think it's possible to currently to
determine the usage at all.  The previous patchset [1] did introduce new
shmem-specific fields into smaps output, and functions to determine the
values.  I take a simpler approach, noting that smaps output already has
a "Swap: X kB" line, where currently X == 0 always for shmem areas.  I
think we can just consider this a bug and provide the proper value by
consulting the radix tree, as e.g.  mincore_page() does.  In the patch
changelog I explain why this is also not perfect (and cannot be without
swapents), but still arguably much better than showing a 0.

The last two patches are adapted from Jerome's patchset and provide a
VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status.
Hugh noted that this is a welcome addition, and I agree that it might
help e.g.  debugging process memory usage at albeit non-zero, but still
rather low cost of extra per-mm counter and some page flag checks.

[1] http://lwn.net/Articles/611966/

This patch (of 6):

The documentation for /proc/pid/status does not mention that the value
of VmSwap counts only swapped out anonymous private pages, and not
swapped out pages of the underlying shmem objects (for shmem mappings).
This is not obvious, so document this limitation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai 5b80287a65 mm/mmzone.c: memmap_valid_within() can be boolean
Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Geliang Tang 6219c2a2ec mm/vmalloc.c: use list_{next,first}_entry
To make the intention clearer, use list_{next,first}_entry instead of
list_entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Michal Hocko 33d5310306 mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested.  This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us.  The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example.  Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.

We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all.  We shouldn't break the
__GFP_NOFAIL semantic here though.  It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Michal Hocko fde82aaa73 mm/page_alloc.c: get rid of __alloc_pages_high_priority()
__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds.  It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand.  Opencoding the function into its only
caller allows to simplify it a bit as well.

This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai c00eb15a89 mm/zonelist: enumerate zonelists array index
Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.

No functional change.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai 06640290bf include/linux/mmzone.h: remove unused is_unevictable_lru()
Since commit a0b8cab3b9 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") there's no
user of this function anymore, so remove it.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai b4ad0c7e00 mm/memblock.c: memblock_is_memory()/reserved() can be boolean
Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai 719ff32162 include/linux/hugetlb.h: is_file_hugepages() can be boolean
Make is_file_hugepages() return bool to improve readability due to this
particular function only using either one or zero as its return value.

This patch also removed the if condition to make is_file_hugepages
return directly.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
yalin wang ba5e957943 mm: change mm_vmscan_lru_shrink_inactive() proto types
Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.

Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joonsoo Kim 8ef5849fa8 mm/cma: always check which page caused allocation failure
Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated.  But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.

This patch change this situation by calling test_pages_isolated() in
almost error path.  In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.

In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.

There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it.  Before this patch, it didn't matter, because
end result is same thing.  But, after this patch, tracepoint will report
failed pfn so it should be accurate.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joonsoo Kim 0f0848e511 mm/page_isolation.c: add new tracepoint, test_pages_isolated
cma allocation should be guranteeded to succeed.  But sometimes it can
fail in the current implementation.  To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00