Commit Graph

222 Commits

Author SHA1 Message Date
Linus Torvalds
988adfdffd Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "Highlights:

   - AMD KFD driver merge

     This is the AMD HSA interface for exposing a lowlevel interface for
     GPGPU use.  They have an open source userspace built on top of this
     interface, and the code looks as good as it was going to get out of
     tree.

   - Initial atomic modesetting work

     The need for an atomic modesetting interface to allow userspace to
     try and send a complete set of modesetting state to the driver has
     arisen, and been suffering from neglect this past year.  No more,
     the start of the common code and changes for msm driver to use it
     are in this tree.  Ongoing work to get the userspace ioctl finished
     and the code clean will probably wait until next kernel.

   - DisplayID 1.3 and tiled monitor exposed to userspace.

     Tiled monitor property is now exposed for userspace to make use of.

   - Rockchip drm driver merged.

   - imx gpu driver moved out of staging

  Other stuff:

   - core:
        panel - MIPI DSI + new panels.
        expose suggested x/y properties for virtual GPUs

   - i915:
        Initial Skylake (SKL) support
        gen3/4 reset work
        start of dri1/ums removal
        infoframe tracking
        fixes for lots of things.

   - nouveau:
        tegra k1 voltage support
        GM204 modesetting support
        GT21x memory reclocking work

   - radeon:
        CI dpm fixes
        GPUVM improvements
        Initial DPM fan control

   - rcar-du:
        HDMI support added
        removed some support for old boards
        slave encoder driver for Analog Devices adv7511

   - exynos:
        Exynos4415 SoC support

   - msm:
        a4xx gpu support
        atomic helper conversion

   - tegra:
        iommu support
        universal plane support
        ganged-mode DSI support

   - sti:
        HDMI i2c improvements

   - vmwgfx:
        some late fixes.

   - qxl:
        use suggested x/y properties"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
  drm: sti: fix module compilation issue
  drm/i915: save/restore GMBUS freq across suspend/resume on gen4
  drm: sti: correctly cleanup CRTC and planes
  drm: sti: add HQVDP plane
  drm: sti: add cursor plane
  drm: sti: enable auxiliary CRTC
  drm: sti: fix delay in VTG programming
  drm: sti: prepare sti_tvout to support auxiliary crtc
  drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
  drm: sti: fix hdmi avi infoframe
  drm: sti: remove event lock while disabling vblank
  drm: sti: simplify gdp code
  drm: sti: clear all mixer control
  drm: sti: remove gpio for HDMI hot plug detection
  drm: sti: allow to change hdmi ddc i2c adapter
  drm/doc: Document drm_add_modes_noedid() usage
  drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
  drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
  drm: Zero out DRM object memory upon cleanup
  drm/i915/bdw: Fix the write setting up the WIZ hashing mode
  ...
2014-12-15 15:52:01 -08:00
Linus Torvalds
27afc5dbda Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "The most notable change for this pull request is the ftrace rework
  from Heiko.  It brings a small performance improvement and the ground
  work to support a new gcc option to replace the mcount blocks with a
  single nop.

  Two new s390 specific system calls are added to emulate user space
  mmio for PCI, an artifact of the how PCI memory is accessed.

  Two patches for the memory management with changes to common code.
  For KVM mm_forbids_zeropage is added which disables the empty zero
  page for an mm that is used by a KVM process.  And an optimization,
  pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.

  Some micro optimization for the cmpxchg and the spinlock code.

  And as usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
  s390/cputime: fix 31-bit compile
  s390/scm_block: make the number of reqs per HW req configurable
  s390/scm_block: handle multiple requests in one HW request
  s390/scm_block: allocate aidaw pages only when necessary
  s390/scm_block: use mempool to manage aidaw requests
  s390/eadm: change timeout value
  s390/mm: fix memory leak of ptlock in pmd_free_tlb
  s390: use local symbol names in entry[64].S
  s390/ptrace: always include vector registers in core files
  s390/simd: clear vector register pointer on fork/clone
  s390: translate cputime magic constants to macros
  s390/idle: convert open coded idle time seqcount
  s390/idle: add missing irq off lockdep annotation
  s390/debug: avoid function call for debug_sprintf_*
  s390/kprobes: fix instruction copy for out of line execution
  s390: remove diag 44 calls from cpu_relax()
  s390/dasd: retry partition detection
  s390/dasd: fix list corruption for sleep_on requests
  s390/dasd: fix infinite term I/O loop
  s390/dasd: remove unused code
  ...
2014-12-11 17:30:55 -08:00
Kirill A. Shutemov
e544a4e74e thp: do not mark zero-page pmd write-protected explicitly
Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.

Let's drop redundant pmd_wrprotect() in set_huge_zero_page().

Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
David Rientjes
6d50e60cd2 mm, thp: fix collapsing of hugepages on madvise
If an anonymous mapping is not allowed to fault thp memory and then
madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
collapse this memory into thp memory.

This occurs because the madvise(2) handler for thp, hugepage_madvise(),
clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
until the final action of madvise_behavior().  This causes the
khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
the vma had previously had VM_NOHUGEPAGE set.

Fix this by passing the correct vma flags to the khugepaged mm slot
handler.  There's no chance khugepaged can run on this vma until after
madvise_behavior() returns since we hold mm->mmap_sem.

It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
in hugepage_advise(), but I didn't want to introduce special case
behavior into madvise_behavior().  I think it's best to just let it
always set vma->vm_flags itself.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29 16:33:14 -07:00
Yu Zhao
5ddacbe92b mm: free compound page with correct order
Compound page should be freed by put_page() or free_pages() with correct
order.  Not doing so will cause tail pages leaked.

The compound order can be obtained by compound_order() or use
HPAGE_PMD_ORDER in our case.  Some people would argue the latter is
faster but I prefer the former which is more general.

This bug was observed not just on our servers (the worst case we saw is
11G leaked on a 48G machine) but also on our workstations running Ubuntu
based distro.

  $ cat /proc/vmstat  | grep thp_zero_page_alloc
  thp_zero_page_alloc 55
  thp_zero_page_alloc_failed 0

This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.

Fixes: 97ae17497e ("thp: implement refcounting for huge zero page")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: <stable@vger.kernel.org>	[3.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29 16:33:14 -07:00
Martin Schwidefsky
fcbe08d66f s390/mm: pmdp_get_and_clear_full optimization
Analog to ptep_get_and_clear_full define a variant of the
pmpd_get_and_clear primitive which gets the full hint from the
mmu_gather struct. This allows s390 to avoid a costly instruction
when destroying an address space.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-10-27 13:27:30 +01:00
Dominik Dingel
593befa6ab mm: introduce mm_forbids_zeropage function
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-10-27 13:27:24 +01:00
Sasha Levin
96dad67ff2 mm: use VM_BUG_ON_MM where possible
Dump the contents of the relevant struct_mm when we hit the bug condition.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Sasha Levin
81d1b09c6b mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA
Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
more information when they trigger.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:57 -04:00
Vlastimil Babka
8b1645685a mm, THP: don't hold mmap_sem in khugepaged when allocating THP
When allocating huge page for collapsing, khugepaged currently holds
mmap_sem for reading on the mm where collapsing occurs.  Afterwards the
read lock is dropped before write lock is taken on the same mmap_sem.

Holding mmap_sem during whole huge page allocation is therefore useless,
the vma needs to be rechecked after taking the write lock anyway.
Furthemore, huge page allocation might involve a rather long sync
compaction, and thus block any mmap_sem writers and i.e.  affect workloads
that perform frequent m(un)map or mprotect oterations.

This patch simply releases the read lock before allocating a huge page.
It also deletes an outdated comment that assumed vma must be stable, as it
was using alloc_hugepage_vma().  This is no longer true since commit
9f1b868a13 ("mm: thp: khugepaged: add policy for finding target node").

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Mel Gorman
abc40bd2ee mm: numa: Do not mark PTEs pte_numa when splitting huge pages
This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

 VM_RW|VM_PROTNONE
|-----------------|
      ^
      split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-02 11:57:18 -07:00
Johannes Weiner
00501b531c mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code and
reduces charging and uncharging overhead.  The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
 executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.31%              cat  [kernel.kallsyms]   [k] memset
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.38%              cat  [kernel.kallsyms]   [k] put_page
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.48%           cat  [kernel.kallsyms]   [k] memset
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.46%           cat  [kernel.kallsyms]   [k] put_page
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
     1.30%           cat  [kernel.kallsyms]   [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

This patch (of 4):

The memcg charge API charges pages before they are rmapped - i.e.  have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again.  Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
David Rientjes
14a4e2141e mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
Commit 9f1b868a13 ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which
may suffer from additional access latency.  With the current policy, it
is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
remotely if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every
attempt to allocate locally to prevent NUMA performance degradation.  In
this case, we do not want to collapse hugepages to remote nodes that
would suffer from increased access latency.  Thus, when
zone_reclaim_mode is enabled, only allow collapsing to nodes with
RECLAIM_DISTANCE or less.

There is no functional change for systems that disable
zone_reclaim_mode.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:20 -07:00
Johannes Weiner
d51d885bbb mm: huge_memory: use GFP_TRANSHUGE when charging huge pages
Transparent huge page charges prefer falling back to regular pages
rather than spending a lot of time in direct reclaim.

Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled
for THP charges, and that OOM-disabled charges don't retry reclaim.
Needless to say, this is anything but obvious and quite error prone.

Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:17 -07:00
Waiman Long
3a79d52aa3 mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic
In some architectures like x86, atomic_add() is a full memory barrier.
In that case, an additional smp_mb() is just a waste of time.  This
patch replaces that smp_mb() by smp_mb__after_atomic() which will avoid
the redundant memory barrier in some architectures.

With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us.  A
reduction of 19% which is quite sizeable.  It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from
2.18% to 1.15%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:17 -07:00
Waiman Long
f8303c2582 mm, thp: move invariant bug check out of loop in __split_huge_page_map
In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop.  Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.

This patch moves the invariant bug check out of the loop so that it will
be done only once.  On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively.  The performance gain is about 1%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:16 -07:00
Hugh Dickins
f72e7dcdd2 mm: let mm_find_pmd fix buggy race with THP fault
Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                            3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
                  kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.

But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.

Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.

I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.

And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:44 -07:00
Hugh Dickins
5338a93722 mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep()
Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
in copy_page_rep() called from copy_user_huge_page() called from
do_huge_pmd_wp_page().

I believe this is a DEBUG_PAGEALLOC false positive, due to the source
page being split, and a tail page freed, while copy is in progress; and
not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
prevent a miscopy from being made visible.

Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
the usual get_page() and put_page() on head page in the usual config;
but get and put references to all of the tail pages when
DEBUG_PAGEALLOC.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:44 -07:00
Andrew Morton
ae3a8c1c23 mm/huge_memory.c: complete conversion to pr_foo()
It was using a mix of pr_foo() and printk(KERN_ERR ...).

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:58 -07:00
Kirill A. Shutemov
ff9e43eb4f thp: consolidate assert checks in __split_huge_page()
It doesn't make sense to have two assert checks for each invariant: one
for printing and one for BUG().

Let's trigger BUG() if we print error message.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:57 -07:00
Ingo Molnar
2fe5de9ce7 Merge branch 'sched/urgent' into sched/core, to avoid conflicts
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:15:46 +02:00
Kirill A. Shutemov
b5a8cad376 thp: close race between split and zap huge pages
Sasha Levin has reported two THP BUGs[1][2].  I believe both of them
have the same root cause.  Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".  It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().  From my
testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd().  page_check_address_pmd() misses PMD which is
under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[3.13+]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18 16:40:09 -07:00
Dongsheng Yang
8698a745d8 sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICE
Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com
Cc: devel@driverdev.osuosl.org
Cc: devicetree@vger.kernel.org
Cc: fcoe-devel@open-fcoe.org
Cc: linux390@de.ibm.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: nbd-general@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: openipmi-developer@lists.sourceforge.net
Cc: qla2xxx-upstream@qlogic.com
Cc: linux-arch@vger.kernel.org
[ Consolidated the patches, twiddled the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 12:07:24 +02:00
Michal Hocko
d715ae08f2 memcg: rename high level charging functions
mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.

mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:57 -07:00
Alex Thorlton
1e1836e84f mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e.  statically allocated data).  This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from.  If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov.  The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Here's a bit of test data with the new patch in place...

First with the flag unset:

  # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
  Setting thp_disabled for this task...
  thp_disable: 0
  Set thp_disabled state to 0
  Process pid = 18027

                                                                                                                       PF/
                                  MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
  TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
   512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23

   Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
           1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
               7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
           1,698,932 page-faults               #    0.000 M/sec
  355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
  536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
  409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
  148,286,797,266,411 instructions             #    0.42  insns per cycle
                                               #    3.62  stalled cycles per insn [100.00%]
  27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
       1,188,655,196 branch-misses             #    0.00% of all branches

       427.001706337 seconds time elapsed

Now with the flag set:

  # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
  Setting thp_disabled for this task...
  thp_disable: 1
  Set thp_disabled state to 1
  Process pid = 144957

                                                                                                                       PF/
                                  MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
  TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
   512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23

   Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
             534,205 context-switches          #    0.000 M/sec                   [100.00%]
               4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
          63,133,119 page-faults               #    0.000 M/sec
  147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
  200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
  105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
  180,916,213,503,160 instructions             #    1.22  insns per cycle
                                               #    1.11  stalled cycles per insn [100.00%]
  26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
         714,066,351 branch-misses             #    0.00% of all branches

       216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here.  Here's a link to the test case I used, along
with the little wrapper to activate the flag:

  http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

This patch (of 3):

Revert commit 8e72033f2a and add in code to fix up any issues caused
by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:

  "The intent of commit 8e72033f2a was to guard against any future
   programming errors that may result in an madvice(MADV_HUGEPAGE) on
   guest mappings, which would crash the kernel.

   Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
   8e72033f2a was to be reverted, because that check will also prevent
   a kernel crash in the case described above, it will now send a
   SIGSEGV instead.

   This would now also allow to do the madvise on other parts, if
   needed, so it is a more flexible approach.  One could also say that
   it would have been better to do it this way right from the
   beginning..."

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:51 -07:00