Commit Graph

6751 Commits

Author SHA1 Message Date
Linus Torvalds d652e1eb8e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Main changes:

   - scheduler side full-dynticks (user-space execution is undisturbed
     and receives no timer IRQs) preparation changes that convert the
     cputime accounting code to be full-dynticks ready, from Frederic
     Weisbecker.

   - Initial sched.h split-up changes, by Clark Williams

   - select_idle_sibling() performance improvement by Mike Galbraith:

        " 1 tbench pair (worst case) in a 10 core + SMT package:

          pre   15.22 MB/sec 1 procs
          post 252.01 MB/sec 1 procs "

  - sched_rr_get_interval() ABI fix/change.  We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

  - misc RT scheduling cleanups, optimizations"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
  cputime: Remove irqsave from seqlock readers
  sched, powerpc: Fix sched.h split-up build failure
  cputime: Restore CPU_ACCOUNTING config defaults for PPC64
  sched/rt: Move rt specific bits into new header file
  sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
  sched: Move sched.h sysctl bits into separate header
  sched: Fix signedness bug in yield_to()
  sched: Fix select_idle_sibling() bouncing cow syndrome
  sched/rt: Further simplify pick_rt_task()
  sched/rt: Do not account zero delta_exec in update_curr_rt()
  cputime: Safely read cputime of full dynticks CPUs
  kvm: Prepare to add generic guest entry/exit callbacks
  cputime: Use accessors to read task cputime stats
  cputime: Allow dynamic switch between tick/virtual based cputime accounting
  cputime: Generic on-demand virtual cputime accounting
  cputime: Move default nsecs_to_cputime() to jiffies based cputime file
  cputime: Librarize per nsecs resolution cputime definitions
  cputime: Avoid multiplication overflow on utime scaling
  context_tracking: Export context state for generic vtime
  ...

Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-19 18:19:48 -08:00
Linus Torvalds 7c45512df9 mm: fix pageblock bitmap allocation
Commit c060f943d0 ("mm: use aligned zone start for pfn_to_bitidx
calculation") fixed out calculation of the index into the pageblock
bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.

However, the _allocation_ of that bitmap had never taken this alignment
requirement into accout, so depending on the exact size and alignment of
the zone, the use of that index could then access past the allocation,
resulting in some very subtle memory corruption.

This was reported (and bisected) by Ingo Molnar: one of his random
config builds would hang with certain very specific kernel command line
options.

In the meantime, commit c060f943d0 has been marked for stable, so this
fix needs to be back-ported to the stable kernels that backported the
commit to use the right alignment.

Bisected-and-tested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-18 09:58:02 -08:00
Marek Szyprowski 41a7973447 mm: cma: fix accounting of CMA pages placed in high memory
The total number of low memory pages is determined as totalram_pages -
totalhigh_pages, so without this patch all CMA pageblocks placed in
highmem were accounted to low memory.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12 14:34:00 -08:00
Glauber Costa 4ba902b574 memcg: fix kmemcg registration for late caches
The designed workflow for the caches in kmemcg is: register it with
memcg_register_cache() if kmemcg is already available or later on when a
new kmemcg appears at memcg_update_cache_sizes() which will handle all
caches in the system.  The caches created at boot time will be handled
by the later, and the memcg-caches as well as any system caches that are
registered later on by the former.

There is a bug, however, in memcg_register_cache: we correctly set up
the array size, but do not mark the cache as a root cache.

This means that allocations for any cache appearing late in the game
will see memcg->memcg_params->is_root_cache == false, and in particular,
trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in
__memcg_kmem_cache_get.

The obvious fix is to include the missing assignment.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12 14:34:00 -08:00
Gerald Schaefer 9977f0f164 mm: don't overwrite mm->def_flags in do_mlockall()
With commit 8e72033f2a ("thp: make MADV_HUGEPAGE check for
mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in
mm->def_flags for certain processes, to prevent future thp mappings.
This would be overwritten by do_mlockall(), which sets it back to 0 with
an optional VM_LOCKED flag set.

To fix this, instead of overwriting mm->def_flags in do_mlockall(), only
the VM_LOCKED flag should be set or cleared.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12 14:34:00 -08:00
Clark Williams 8bd75c77b7 sched/rt: Move rt specific bits into new header file
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:51:08 +01:00
Clark Williams cf4aebc292 sched: Move sched.h sysctl bits into separate header
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:50:54 +01:00
Yuanhan Liu 631b0cfdbd mm: fix wrong comments about anon_vma lock
We use rwsem since commit 5a505085f0 ("mm/rmap: Convert the struct
anon_vma::mutex to an rwsem").  And most of comments are converted to
the new rwsem lock; while just 2 more missed from:

	 $ git grep 'anon_vma->mutex'

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05 20:38:48 +11:00
Tony Lu be7517d6ab mm/hugetlb: set PTE as huge in hugetlb_change_protection and remove_migration_pte
When setting a huge PTE, besides calling pte_mkhuge(), we also need to
call arch_make_huge_pte(), which we indeed do in make_huge_pte(), but we
forget to do in hugetlb_change_protection() and remove_migration_pte().

Signed-off-by: Zhigang Lu <zlu@tilera.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05 20:38:47 +11:00
Kirill A. Shutemov 85facf2570 thp: avoid dumping huge zero page
No reason to preserve the huge zero page in core dumps.

Reported-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05 20:38:46 +11:00
Mel Gorman 8fb74b9fb2 mm: compaction: partially revert capture of suitable high-order page
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket.  It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed.  For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped.  However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach.  What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way.  While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested.  This patch
partially reverts commit 1fb3f8ca0e ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:56 -08:00
Mel Gorman 062f1af217 mm: thp: acquire the anon_vma rwsem for write during split
Zhouping Liu reported the following against 3.8-rc1 when running a mmap
testcase from LTP.

  mapcount 0 page_mapcount 3
  ------------[ cut here ]------------
  kernel BUG at mm/huge_memory.c:1798!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
  CPU 1
  Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
  RIP: __split_huge_page+0x677/0x6d0
  RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
  RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
  RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
  RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
  R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
  R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
  FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
  Stack:
   ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
   ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
   00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
  Call Trace:
    split_huge_page+0x68/0xb0
    __split_huge_page_pmd+0x134/0x330
    split_huge_page_pmd_mm+0x51/0x60
    split_huge_page_address+0x3b/0x50
    __vma_adjust_trans_huge+0x9c/0xf0
    vma_adjust+0x684/0x750
    __split_vma.isra.28+0x1fa/0x220
    do_munmap+0xf9/0x420
    vm_munmap+0x4e/0x70
    sys_munmap+0x2b/0x40
    system_call_fastpath+0x16/0x1b

Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton
identified that commit 5a505085f0 ("mm/rmap: Convert the struct
anon_vma::mutex to an rwsem") and commit 4fc3f1d66b ("mm/rmap,
migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable")
were likely the problem.  Reverting these commits was reported to solve
the problem for Alexander.

Despite the reason for these commits, NUMA balancing is not the direct
source of the problem.  split_huge_page() expects the anon_vma lock to
be exclusive to serialise the whole split operation.  Ordinarily it is
expected that the anon_vma lock would only be required when updating the
avcs but THP also uses the anon_vma rwsem for collapse and split
operations where the page lock or compound lock cannot be used (as the
page is changing from base to THP or vice versa) and the page table
locks are insufficient.

This patch takes the anon_vma lock for write to serialise against parallel
split_huge_page as THP expected before the conversion to rwsem.

Reported-and-tested-by: Zhouping Liu <zliu@redhat.com>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Alex Xu <alex_y_xu@yahoo.ca>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Jiri Kosina 572043c90d mm: mmap: annotate vm_lock_anon_vma locking properly for lockdep
Commit 5a505085f0 ("mm/rmap: Convert the struct anon_vma::mutex to an
rwsem") turned anon_vma mutex to rwsem.

However, the properly annotated nested locking in mm_take_all_locks()
has been converted from

	mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);

to

	down_write(&anon_vma->root->rwsem);

which is incomplete, and causes the false positive report from lockdep
below.

Annotate the fact that mmap_sem is used as an outter lock to serialize
taking of all the anon_vma rwsems at once no matter the order, using the
down_write_nest_lock() primitive.

This patch fixes this lockdep report:

 =============================================
 [ INFO: possible recursive locking detected ]
 3.8.0-rc2-00036-g5f73896 #171 Not tainted
 ---------------------------------------------
 qemu-kvm/2315 is trying to acquire lock:
  (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

 but task is already holding lock:
  (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&anon_vma->rwsem);
   lock(&anon_vma->rwsem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by qemu-kvm/2315:
  #0:  (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
  #1:  (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
  #2:  (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
  #3:  (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

 stack backtrace:
 Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171
 Call Trace:
   print_deadlock_bug+0xf2/0x100
   validate_chain+0x4f6/0x720
   __lock_acquire+0x359/0x580
   lock_acquire+0x121/0x190
   down_write+0x3f/0x70
   mm_take_all_locks+0x149/0x1b0
   do_mmu_notifier_register+0x68/0x170
   mmu_notifier_register+0xe/0x10
   kvm_create_vm+0x22b/0x330 [kvm]
   kvm_dev_ioctl+0xf8/0x1a0 [kvm]
   do_vfs_ioctl+0x9d/0x350
   sys_ioctl+0x91/0xb0
   system_call_fastpath+0x16/0x1b

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Max Filippov 10d73e655c mm: bootmem: fix free_all_bootmem_core() with odd bitmap alignment
Currently free_all_bootmem_core ignores that node_min_pfn may be not
multiple of BITS_PER_LONG.  Eg commit 6dccdcbe2c ("mm: bootmem: fix
checking the bitmap when finally freeing bootmem") shifts vec by lower
bits of start instead of lower bits of idx.  Also

  if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL)

assumes that vec bit 0 corresponds to start pfn, which is only true when
node_min_pfn is a multiple of BITS_PER_LONG.  Also loop in the else
clause can double-free pages (e.g.  with node_min_pfn == start == 1,
map[0] == ~0 on 32-bit machine page 32 will be double-freed).

This bug causes the following message during xtensa kernel boot:

  bootmem::free_all_bootmem_core nid=0 start=1 end=8000
  BUG: Bad page state in process swapper  pfn:00001
  page:d04bd020 count:0 mapcount:-127 mapping:  (null) index:0x2
  page flags: 0x0()
  Call Trace:
    bad_page+0x8c/0x9c
    free_pages_prepare+0x5e/0x88
    free_hot_cold_page+0xc/0xa0
    __free_pages+0x24/0x38
    __free_pages_bootmem+0x54/0x56
    free_all_bootmem_core$part$11+0xeb/0x138
    free_all_bootmem+0x46/0x58
    mem_init+0x25/0xa4
    start_kernel+0x11e/0x25c
    should_never_return+0x0/0x3be7

The fix is the following:
 - always align vec so that its bit 0 corresponds to start
 - provide BITS_PER_LONG bits in vec, if those bits are available in the
   map
 - don't free pages past next start position in the else clause.

Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Prasad Koya <prasad.koya@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Laura Abbott c060f943d0 mm: use aligned zone start for pfn_to_bitidx calculation
The current calculation in pfn_to_bitidx assumes that (pfn -
zone->zone_start_pfn) >> pageblock_order will return the same bit for
all pfn in a pageblock.  If zone_start_pfn is not aligned to
pageblock_nr_pages, this may not always be correct.

Consider the following with pageblock order = 10, zone start 2MB:

  pfn     | pfn - zone start | (pfn - zone start) >> page block order
  ----------------------------------------------------------------
  0x26000 | 0x25e00	   |  0x97
  0x26100 | 0x25f00	   |  0x97
  0x26200 | 0x26000	   |  0x98
  0x26300 | 0x26100	   |  0x98

This means that calling {get,set}_pageblock_migratetype on a single page
will not set the migratetype for the full block.  Fix this by rounding
down zone_start_pfn when doing the bitidx calculation.

For our use case, the effects of this bug were mostly tied to the fact
that CMA allocations would either take a long time or fail to happen.
Depending on the driver using CMA, this could result in anything from
visual glitches to application failures.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Jason Liu 7964c06d66 mm: compaction: fix echo 1 > compact_memory return error issue
when run the folloing command under shell, it will return error

  sh/$ echo 1 > /proc/sys/vm/compact_memory
  sh/$ sh: write error: Bad address

After strace, I found the following log:

  ...
  write(1, "1\n", 2)               = 3
  write(1, "", 4294967295)         = -1 EFAULT (Bad address)
  write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
  ) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu <r64343@freescale.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:54 -08:00
Lin Feng c0232ae861 mm: memblock: fix wrong memmove size in memblock_merge_regions()
The memmove span covers from (next+1) to the end of the array, and the
index of next is (i+1), so the index of (next+1) is (i+2).  So the size
of remaining array elements is (type->cnt - (i + 2)).

Since the remaining elements of the memblock array are move forward by
one element and there is only one additional element caused by this bug.
So there won't be any write overflow here but read overflow.  It may
read one more element out of the array address if the array happens to
be full.  Commonly it doesn't matter at all but if the array happens to
be located at the end a memblock, it may cause a invalid read operation
for the physical address doesn't exist.

There are 2 *happens to be* here, so I think the probability is quite
low, I don't know if any guy is haunted by this bug before.

Mostly I think it's user-invisible.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:54 -08:00
Mel Gorman 04fa5d6a65 mm: migrate: check page_count of THP before migrating
Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
not check page_count before migrating like base page migration and
khugepage.  He could not see why this was safe and he is right.

The potential impact of the bug is avoided due to the limitations of
NUMA balancing.  The page_mapcount() check ensures that only a single
address space is using this page and as THPs are typically private it
should not be possible for another address space to fault it in
parallel.  If the address space has one associated task then it's
difficult to have both a GUP pin and be referencing the page at the same
time.  If there are multiple tasks then a buggy scenario requires that
another thread be accessing the page while the direct IO is in flight.
This is dodgy behaviour as there is a possibility of corruption with or
without THP migration.  It would be

While we happen to be safe for the most part it is shoddy to depend on
such "safety" so this patch checks the page count similar to anonymous
pages.  Note that this does not mean that the page_mapcount() check can
go away.  If we were to remove the page_mapcount() check the the THP
would have to be unmapped from all referencing PTEs, replaced with
migration PTEs and restored properly afterwards.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:54 -08:00
Linus Torvalds e53289c0c5 mm: reinstante dropped pmd_trans_splitting() check
The check for a pmd being in the process of being split was dropped by
mistake by commit d10e63f294 ("mm: numa: Create basic numa page
hinting infrastructure"). Put it back.

Reported-by: Dave Jones <davej@redhat.com>
Debugged-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-09 08:36:54 -08:00
Michal Hocko 53a59fc67f mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT
Since commit e303297e6c ("mm: extended batches for generic
mmu_gather") we are batching pages to be freed until either
tlb_next_batch cannot allocate a new batch or we are done.

This works just fine most of the time but we can get in troubles with
non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
on large machines where too aggressive batching might lead to soft
lockups during process exit path (exit_mmap) because there are no
scheduling points down the free_pages_and_swap_cache path and so the
freeing can take long enough to trigger the soft lockup.

The lockup is harmless except when the system is setup to panic on
softlockup which is not that unusual.

The simplest way to work around this issue is to limit the maximum
number of batches in a single mmu_gather.  10k of collected pages should
be safe to prevent from soft lockups (we would have 2ms for one) even if
they are all freed without an explicit scheduling point.

This patch doesn't add any new explicit scheduling points because it
relies on zap_pmd_range during page tables zapping which calls
cond_resched per PMD.

The following lockup has been reported for 3.0 kernel with a huge
process (in order of hundreds gigs but I do know any more details).

  BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
  Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
  Supported: Yes
  CPU 56
  Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
  RIP: 0010:  _raw_spin_unlock_irqrestore+0x8/0x10
  RSP: 0018:ffff883ec1037af0  EFLAGS: 00000206
  RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
  RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
  RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
  R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
  R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
  FS:  00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
  Call Trace:
    release_pages+0xc5/0x260
    free_pages_and_swap_cache+0x9d/0xc0
    tlb_flush_mmu+0x5c/0x80
    tlb_finish_mmu+0xe/0x50
    exit_mmap+0xbd/0x120
    mmput+0x49/0x120
    exit_mm+0x122/0x160
    do_exit+0x17a/0x430
    do_group_exit+0x3d/0xb0
    get_signal_to_deliver+0x247/0x480
    do_signal+0x71/0x1b0
    do_notify_resume+0x98/0xb0
    int_signal+0x12/0x17
  DWARF2 unwinder stuck at int_signal+0x12/0x17

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>	[3.0+]
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04 16:11:46 -08:00
Bartlomiej Zolnierkiewicz a458431e17 mm: fix zone_watermark_ok_safe() accounting of isolated pages
Commit 702d1a6e07 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.

Then later, commit 2139cbe627 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter.  It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).

This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.

Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04 16:11:46 -08:00
Greg Kroah-Hartman fcb35a9bac MM: vmscan: remove __devinit attribute.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit from the file.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:13 -08:00
Mel Gorman 42288fe366 mm: mempolicy: Convert shared_policy mutex to spinlock
Sasha was fuzzing with trinity and reported the following problem:

  BUG: sleeping function called from invalid context at kernel/mutex.c:269
  in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
  2 locks held by trinity-main/6361:
   #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
   #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
  Pid: 6361, comm: trinity-main Tainted: G        W
  3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
  Call Trace:
    __might_sleep+0x1c3/0x1e0
    mutex_lock_nested+0x29/0x50
    mpol_shared_policy_lookup+0x2e/0x90
    shmem_get_policy+0x2e/0x30
    get_vma_policy+0x5a/0xa0
    mpol_misplaced+0x41/0x1d0
    handle_pte_fault+0x465/0x6a0

This was triggered by a different version of automatic NUMA balancing
but in theory the current version is vunerable to the same problem.

do_numa_page
  -> numa_migrate_prep
    -> mpol_misplaced
      -> get_vma_policy
        -> shmem_get_policy

It's very unlikely this will happen as shared pages are not marked
pte_numa -- see the page_mapcount() check in change_pte_range() -- but
it is possible.

To address this, this patch restores sp->lock as originally implemented
by Kosaki Motohiro.  In the path where get_vma_policy() is called, it
should not be calling sp_alloc() so it is not necessary to treat the PTL
specially.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 17:32:13 -08:00
Hugh Dickins a7a88b2373 mempolicy: remove arg from mpol_parse_str, mpol_to_str
Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:27:10 -08:00
Hugh Dickins f2a07f40db tmpfs mempolicy: fix /proc/mounts corrupting memory
Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
mempolicy testing.  Very nasty.  Reading /proc/mounts, /proc/pid/mounts
or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
worse.  "mpol=prefer" and "mpol=prefer:Node" are equally toxic.

Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
when commit e17f74af35 "mempolicy: don't call mpol_set_nodemask() when
no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
With slab poisoning, you can then rely on mpol_to_str() to set the bit
for node 0x6b6b, probably in the next page above the caller's stack.

mpol_parse_str() is only called from shmem_parse_options(): no_context
is always true, so call it unused for now, and remove !no_context code.
Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
expect.  Then mpol_to_str() can ignore its no_context argument also,
the mpol being appropriately initialized whether contextualized or not.
Rename its no_context unused too, and let subsequent patch remove them
(that's not needed for stable backporting, which would involve rejects).

I don't understand why MPOL_LOCAL is described as a pseudo-policy:
it's a reasonable policy which suffers from a confusing implementation
in terms of MPOL_PREFERRED with MPOL_F_LOCAL.  I believe this would be
much more robust if MPOL_LOCAL were recognized in switch statements
throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
empty) nodes mask like everyone else, instead of its preferred_node
variant (I presume an optimization from the days before MPOL_LOCAL).
But that would take me too long to get right and fully tested.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:27:10 -08:00