Commit Graph

500735 Commits

Author SHA1 Message Date
Vladimir Davydov dbcf73e26c memcg: rename some cache id related variables
memcg_limited_groups_array_size, which defines the size of memcg_caches
arrays, sounds rather cumbersome.  Also it doesn't point anyhow that
it's related to kmem/caches stuff.  So let's rename it to
memcg_nr_cache_ids.  It's concise and points us directly to
memcg_cache_id.

Also, rename kmem_limited_groups to memcg_cache_ida.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:09 -08:00
Vladimir Davydov cb731d6c62 vmscan: per memory cgroup slab shrinkers
This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
set, it will be called per memory cgroup.  The memory cgroup to scan
objects from is passed in shrink_control->memcg.  If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list.  Unaware shrinkers are only called on global pressure with
memcg=NULL.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:09 -08:00
Vladimir Davydov 4101b62435 fs: consolidate {nr,free}_cached_objects args in shrink_control
We are going to make FS shrinkers memcg-aware.  To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan.  Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Vladimir Davydov 503c358cf1 list_lru: introduce list_lru_shrink_{count,walk}
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support.  That means when we hit the limit we will get ENOMEM w/o any
chance to recover.  What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup.  This is what
this patch set is intended to do.

Basically, it does two things.  First, it introduces the notion of
per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
passed the memory cgroup to scan from in shrink_control->memcg.  For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.

Secondly, this patch set makes the list_lru structure per-memcg.  It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru.  Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to.  This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.

As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit.  Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.

This patch (of 9):

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists.  Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 10c1045f28 mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries
If a PTE or PMD is already marked NUMA when scanning to mark entries for
NUMA hinting then it is not necessary to update the entry and incur a TLB
flush penalty.  Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman c0e7cad9f2 mm: numa: add paranoid check around pte_protnone_numa
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete.  Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it.  BUG_ON looks like
overkill but if this is hit then it's a serious bug that could result in
corruption so do not even try recovering.  It would have been more
comprehensive to check VMA flags in pte_protnone_numa but it would have
made the API ugly just for a debugging check.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman c819f37e7e x86: mm: restore original pte_special check
Commit b38af4721f ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had
SPECIAL and neither PRESENT nor PROTNONE.  Now that NUMA hinting PTEs
are no longer modifying _PAGE_PRESENT it should be safe to restore the
original pte_special behaviour.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman e944fd67b6 mm: numa: do not trap faults on the huge zero page
Faults on the huge zero page are pointless and there is a BUG_ON to catch
them during fault time.  This patch reintroduces a check that avoids
marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 21d9ee3eda mm: remove remaining references to NUMA hinting bits and helpers
This patch removes the NUMA PTE bits and associated helpers.  As a
side-effect it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory.  The base case is safe but transhuge has problems in
the past due to an different migration mechanism and a dependance on page
lock to serialise migrations and warrants a closer look.

task_work hinting update			parallel fault
------------------------			--------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						pmd_none
						  do_huge_pmd_anonymous_page
						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
      pmd_modify
      set_pmd_at

task_work hinting update			parallel migration
------------------------			------------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						  do_huge_pmd_numa_page
						    migrate_misplaced_transhuge_page
						    pmd_lock waits for updates to complete, recheck pmd_same
      pmd_modify
      set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged.  The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok.  I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed.  If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 4d94246699 mm: convert p[te|md]_mknonnuma and remaining page table manipulations
With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
[akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 842915f566 ppc64: add paranoid warnings for unexpected DSISR_PROTFAULT
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if
they are triggered.  This patch adds warnings just in case they are being
accidentally depended upon.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 8a0516ed8b mm: convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper.  Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered.  This patch layout is to make review
easier.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman e7bb4b6d16 mm: add p[te|md] protnone helpers for use by NUMA balancing
This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Mel Gorman 5d83306213 mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault
Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information.  Very broadly speaking
it would mark PTEs as not present and use another bit to distinguish
between NUMA hinting faults and other types of faults.  It was
universally loved by everybody and caused no problems whatsoever.  That
last sentence might be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections.  I did alter and add parts of it but I consider them
relatively minor contributions.  At their suggestion, acked-bys are in
there but I've no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
for that.  I tested trinity under kvm-tool and passed and ran a few
other basic tests.  At the time of writing, only the short-lived tests
have completed but testing of V2 indicated that long-term testing had no
surprises.  In most cases I'm leaving out detail as it's not that
interesting.

specjbb single JVM: There was negligible performance difference in the
	benchmark itself for short runs. However, system activity is
	higher and interrupts are much higher over time -- possibly TLB
	flushes. Migrations are also higher. Overall, this is more overhead
	but considering the problems faced with the old approach I think
	we just have to suck it up and find another way of reducing the
	overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmark
	but like the single JVM case, the system overhead is noticeably
	higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
	reasonably concluded is that it's different but not necessarily
	better or worse.

autonumabench
                                     3.18.0-rc5            3.18.0-rc5
                                 mmotm-20141119         protnone-v3r3
User    NUMA01               32380.24 (  0.00%)    21642.92 ( 33.16%)
User    NUMA01_THEADLOCAL    22481.02 (  0.00%)    22283.22 (  0.88%)
User    NUMA02                3137.00 (  0.00%)     3116.54 (  0.65%)
User    NUMA02_SMT            1614.03 (  0.00%)     1543.53 (  4.37%)
System  NUMA01                 322.97 (  0.00%)     1465.89 (-353.88%)
System  NUMA01_THEADLOCAL       91.87 (  0.00%)       49.32 ( 46.32%)
System  NUMA02                  37.83 (  0.00%)       14.61 ( 61.38%)
System  NUMA02_SMT               7.36 (  0.00%)        7.45 ( -1.22%)
Elapsed NUMA01                 716.63 (  0.00%)      599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL      553.98 (  0.00%)      539.94 (  2.53%)
Elapsed NUMA02                  83.85 (  0.00%)       83.04 (  0.97%)
Elapsed NUMA02_SMT              86.57 (  0.00%)       79.15 (  8.57%)
CPU     NUMA01                4563.00 (  0.00%)     3855.00 ( 15.52%)
CPU     NUMA01_THEADLOCAL     4074.00 (  0.00%)     4136.00 ( -1.52%)
CPU     NUMA02                3785.00 (  0.00%)     3770.00 (  0.40%)
CPU     NUMA02_SMT            1872.00 (  0.00%)     1959.00 ( -4.65%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters.  On
the other workloads that are sensible on this machine, system CPU usage is
great.  Overall time to complete the benchmark is comparable

          3.18.0-rc5  3.18.0-rc5
        mmotm-20141119protnone-v3r3
User        59612.50    48586.44
System        460.22     1537.45
Elapsed      1442.20     1304.29

NUMA alloc hit                 5075182     5743353
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               5075174     5743339
NUMA base PTE updates        637061448   443106883
NUMA huge PMD updates          1243434      864747
NUMA page range updates     1273699656   885857347
NUMA hint faults               1658116     1214277
NUMA hint local faults          959487      754113
NUMA hint local percent             57          62
NUMA pages migrated            5467056    61676398

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01.  This correlates with high system CPU usage and could be
simply down to bad luck but any modifications that affect that workload
would be related to scan rates and migrations, not the protection
mechanism.  For all other workloads, migration activity was comparable.

Overall, headline performance figures are comparable but the overhead is
higher, mostly in interrupts.  To some extent, higher overhead from this
approach was anticipated but not to this degree.  It's going to be
necessary to reduce this again with a separate series in the future.  It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.

This patch (of 10):

A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes.  The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Linus Torvalds 87c9172f71 Merge tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
 "A couple cleanups for jfs"

* tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy:
  jfs: Deletion of an unnecessary check before the function call "unload_nls"
  jfs: get rid of homegrown endianness helpers
2015-02-12 10:58:59 -08:00
Linus Torvalds 61845143fe Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "The main change is the pNFS block server support from Christoph, which
  allows an NFS client connected to shared disk to do block IO to the
  shared disk in place of NFS reads and writes.  This also requires xfs
  patches, which should arrive soon through the xfs tree, barring
  unexpected problems.  Support for other filesystems is also possible
  if there's interest.

  Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
  shape"

* 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
  nfsd: default NFSv4.2 to on
  nfsd: pNFS block layout driver
  exportfs: add methods for block layout exports
  nfsd: add trace events
  nfsd: update documentation for pNFS support
  nfsd: implement pNFS layout recalls
  nfsd: implement pNFS operations
  nfsd: make find_any_file available outside nfs4state.c
  nfsd: make find/get/put file available outside nfs4state.c
  nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
  nfsd: add fh_fsid_match helper
  nfsd: move nfsd_fh_match to nfsfh.h
  fs: add FL_LAYOUT lease type
  fs: track fl_owner for leases
  nfs: add LAYOUT_TYPE_MAX enum value
  nfsd: factor out a helper to decode nfstime4 values
  sunrpc/lockd: fix references to the BKL
  nfsd: fix year-2038 nfs4 state problem
  svcrdma: Handle additional inline content
  svcrdma: Move read list XDR round-up logic
  ...
2015-02-12 10:39:41 -08:00
Linus Torvalds a26be149fa Merge tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
 "This time with:

   - Generic page-table framework for ARM IOMMUs using the LPAE
     page-table format, ARM-SMMU and Renesas IPMMU make use of it
     already.

   - Break out the IO virtual address allocator from the Intel IOMMU so
     that it can be used by other DMA-API implementations too.  The
     first user will be the ARM64 common DMA-API implementation for
     IOMMUs

   - Device tree support for Renesas IPMMU

   - Various fixes and cleanups all over the place"

* tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
  iommu/amd: Convert non-returned local variable to boolean when relevant
  iommu: Update my email address
  iommu/amd: Use wait_event in put_pasid_state_wait
  iommu/amd: Fix amd_iommu_free_device()
  iommu/arm-smmu: Avoid build warning
  iommu/fsl: Various cleanups
  iommu/fsl: Use %pa to print phys_addr_t
  iommu/omap: Print phys_addr_t using %pa
  iommu: Make more drivers depend on COMPILE_TEST
  iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered
  iommu: Disable on !MMU builds
  iommu/fsl: Remove unused fsl_of_pamu_ids[]
  iommu/fsl: Fix section mismatch
  iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator
  iommu: Fix trace_map() to report original iova and original size
  iommu/arm-smmu: add support for iova_to_phys through ATS1PR
  iopoll: Introduce memory-mapped IO polling macros
  iommu/arm-smmu: don't touch the secure STLBIALL register
  iommu/arm-smmu: make use of generic LPAE allocator
  iommu: io-pgtable-arm: add non-secure quirk
  ...
2015-02-12 09:16:56 -08:00
Linus Torvalds cdd305454e Merge tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree changes from Rob Herring:

 - DT unittests for I2C probing and overlays from Pantelis Antoniou

 - Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha

 - Add Tegra compatible strings missing for newer parts from Paul
   Walmsley

 - Various vendor prefix additions

* tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  of: Add vendor prefix for OmniVision Technologies
  of: Use ovti for Omnivision
  of: Add vendor prefix for Truly Semiconductors Limited
  of: Add vendor prefix for Himax Technologies Inc.
  of/fdt: fix sparse warning
  of: unitest: Add I2C overlay unit tests.
  Documentation: DT: document compatible string existence requirement
  Documentation: DT bindings: add nvidia, tegra132-denver compatible string
  Documentation: DT bindings: add more Tegra chip compatible strings
  of: EXPORT_SYMBOL_GPL of_property_read_u64_array
  of: Fix brace position for struct of_device_id definition
  of/unittest: Remove obsolete code
  dt-bindings: use isil prefix for Intersil in vendor-prefixes.txt
  Add AD Holdings Plc. to vendor-prefixes.
  dt-bindings: Add Silicon Mitus vendor prefix
  Removes OF_UNITTEST dependency on OF_DYNAMIC config symbol
  pinctrl: fix up device tree bindings
  DT: Vendors: Add Everspin
  doc: add bindings document for altera fpga manager
  drivers: of: Export of_reserved_mem_device_{init,release}
2015-02-12 08:58:43 -08:00
Linus Torvalds 42cf0f203e Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:

 - clang assembly fixes from Ard

 - optimisations and cleanups for Aurora L2 cache support

 - efficient L2 cache support for secure monitor API on Exynos SoCs

 - debug menu cleanup from Daniel Thompson to allow better behaviour for
   multiplatform kernels

 - StrongARM SA11x0 conversion to irq domains, and pxa_timer

 - kprobes updates for older ARM CPUs

 - move probes support out of arch/arm/kernel to arch/arm/probes

 - add inline asm support for the rbit (reverse bits) instruction

 - provide an ARM mode secondary CPU entry point (for Qualcomm CPUs)

 - remove the unused ARMv3 user access code

 - add driver_override support to AMBA Primecell bus

* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (55 commits)
  ARM: 8256/1: driver coamba: add device binding path 'driver_override'
  ARM: 8301/1: qcom: Use secondary_startup_arm()
  ARM: 8302/1: Add a secondary_startup that assumes ARM mode
  ARM: 8300/1: teach __asmeq that r11 == fp and r12 == ip
  ARM: kprobes: Fix compilation error caused by superfluous '*'
  ARM: 8297/1: cache-l2x0: optimize aurora range operations
  ARM: 8296/1: cache-l2x0: clean up aurora cache handling
  ARM: 8284/1: sa1100: clear RCSR_SMR on resume
  ARM: 8283/1: sa1100: collie: clear PWER register on machine init
  ARM: 8282/1: sa1100: use handle_domain_irq
  ARM: 8281/1: sa1100: move GPIO-related IRQ code to gpio driver
  ARM: 8280/1: sa1100: switch to irq_domain_add_simple()
  ARM: 8279/1: sa1100: merge both GPIO irqdomains
  ARM: 8278/1: sa1100: split irq handling for low GPIOs
  ARM: 8291/1: replace magic number with PAGE_SHIFT macro in fixup_pv code
  ARM: 8290/1: decompressor: fix a wrong comment
  ARM: 8286/1: mm: Fix dma_contiguous_reserve comment
  ARM: 8248/1: pm: remove outdated comment
  ARM: 8274/1: Fix DEBUG_LL for multi-platform kernels (without PL01X)
  ARM: 8273/1: Seperate DEBUG_UART_PHYS from DEBUG_LL on EP93XX
  ...
2015-02-12 08:51:56 -08:00
Linus Torvalds a2f0bb03f7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
Pull AVR32 update from Hans-Christian Egtvedt.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
  avr32: update all default configurations
  avr32: remove fake at91 cpu identification
  avr32: wire up missing syscalls
2015-02-12 08:47:09 -08:00
Linus Torvalds 41cbc01f6e Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
 "The updates included in this pull request for ftrace are:

   o Several clean ups to the code

     One such clean up was to convert to 64 bit time keeping, in the
     ring buffer benchmark code.

   o Adding of __print_array() helper macro for TRACE_EVENT()

   o Updating the sample/trace_events/ to add samples of different ways
     to make trace events.  Lots of features have been added since the
     sample code was made, and these features are mostly unknown.
     Developers have been making their own hacks to do things that are
     already available.

   o Performance improvements.  Most notably, I found a performance bug
     where a waiter that is waiting for a full page from the ring buffer
     will see that a full page is not available, and go to sleep.  The
     sched event caused by it going to sleep would cause it to wake up
     again.  It would see that there was still not a full page, and go
     back to sleep again, and that would wake it up again, until finally
     it would see a full page.  This change has been marked for stable.

  Other improvements include removing global locks from fast paths"

* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Do not wake up a splice waiter when page is not full
  tracing: Fix unmapping loop in tracing_mark_write
  tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
  tracing: Add TRACE_EVENT_FN example
  tracing: Add TRACE_EVENT_CONDITION sample
  tracing: Update the TRACE_EVENT fields available in the sample code
  tracing: Separate out initializing top level dir from instances
  tracing: Make tracing_init_dentry_tr() static
  trace: Use 64-bit timekeeping
  tracing: Add array printing helper
  tracing: Remove newline from trace_printk warning banner
  tracing: Use IS_ERR() check for return value of tracing_init_dentry()
  tracing: Remove unneeded includes of debugfs.h and fs.h
  tracing: Remove taking of trace_types_lock in pipe files
  tracing: Add ref count to tracer for when they are being read by pipe
2015-02-12 08:37:41 -08:00
Linus Torvalds 12df4289ee Merge tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest
Pull ktest updates from Steven Rostedt:
 "The following ktest updates were done:

   o Added timings to various parts of the test (build, install, boot,
     tests) and report them so that the users can keep track of changes.

   o Josh Poimboeuf fixed the console output to work better with virtual
     machine targets.

   o Various clean ups and fixes"

* tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
  ktest: Place quotes around item variable
  ktest: Cleanup terminal on dodie() failure
  ktest: Print build,install,boot,test times at success and failure
  ktest: Enable user input to the console
  ktest: Give console process a dedicated tty
  ktest: Rename start_monitor_and_boot to start_monitor_and_install
  ktest: Show times for build, install, boot and test
  ktest: Restore tty settings after closing console
  ktest: Add timings for commands
2015-02-12 08:36:38 -08:00
Linus Torvalds 8cc748aa76 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "Highlights:

   - Smack adds secmark support for Netfilter
   - /proc/keys is now mandatory if CONFIG_KEYS=y
   - TPM gets its own device class
   - Added TPM 2.0 support
   - Smack file hook rework (all Smack users should review this!)"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
  cipso: don't use IPCB() to locate the CIPSO IP option
  SELinux: fix error code in policydb_init()
  selinux: add security in-core xattr support for pstore and debugfs
  selinux: quiet the filesystem labeling behavior message
  selinux: Remove unused function avc_sidcmp()
  ima: /proc/keys is now mandatory
  Smack: Repair netfilter dependency
  X.509: silence asn1 compiler debug output
  X.509: shut up about included cert for silent build
  KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
  MAINTAINERS: email update
  tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
  smack: fix possible use after frees in task_security() callers
  smack: Add missing logging in bidirectional UDS connect check
  Smack: secmark support for netfilter
  Smack: Rework file hooks
  tpm: fix format string error in tpm-chip.c
  char/tpm/tpm_crb: fix build error
  smack: Fix a bidirectional UDS connect check typo
  smack: introduce a special case for tmpfs in smack_d_instantiate()
  ...
2015-02-11 20:25:11 -08:00
Linus Torvalds 7184487f14 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fix from Paul Moore:
 "Just one patch from the audit tree for v3.20, and a very minor one at
  that.

  The patch simply removes an old, unused field from the audit_krule
  structure, a private audit-only struct.  In audit related news, we did
  a proper overhaul of the audit pathname code and removed the nasty
  getname()/putname() hacks for audit, you should see those patches in
  Al's vfs tree if you haven't already.

  That's it for audit this time, let's hope for a quiet -rcX series"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: remove vestiges of vers_ops
2015-02-11 20:07:47 -08:00
Rob Herring 3c3c8e3618 Merge remote-tracking branch 'grant/devicetree/next' into for-next 2015-02-11 21:28:45 -06:00