Commit Graph

9019 Commits

Author SHA1 Message Date
Linus Torvalds f788baadbd Merge branch 'gadget' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull gadgetfs fixes from Al Viro:
 "Assorted fixes around AIO on gadgetfs: leaks, use-after-free, troubles
  caused by ->f_op flipping"

* 'gadget' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gadgetfs: really get rid of switching ->f_op
  gadgetfs: get rid of flipping ->f_op in ep_config()
  gadget: switch ep_io_operations to ->read_iter/->write_iter
  gadgetfs: use-after-free in ->aio_read()
  gadget/function/f_fs.c: switch to ->{read,write}_iter()
  gadget/function/f_fs.c: use put iov_iter into io_data
  gadget/function/f_fs.c: close leaks
  move iov_iter.c from mm/ to lib/
  new helper: dup_iter()
2015-03-13 10:55:32 -07:00
Linus Torvalds c202baf017 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  memcg: disable hierarchy support if bound to the legacy cgroup hierarchy
  mm: reorder can_do_mlock to fix audit denial
  kasan, module: move MODULE_ALIGN macro into <linux/moduleloader.h>
  kasan, module, vmalloc: rework shadow allocation for modules
  fanotify: fix event filtering with FAN_ONDIR set
  mm/nommu.c: export symbol max_mapnr
  arch/c6x/include/asm/pgtable.h: define dummy pgprot_writecombine for !MMU
  nilfs2: fix deadlock of segment constructor during recovery
  mm: cma: fix CMA aligned offset calculation
  mm, hugetlb: close race when setting PageTail for gigantic pages
  mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled
  drivers/rtc/rtc-s3c.c: add .needs_src_clk to s3c6410 RTC data
  ocfs2: make append_dio an incompat feature
2015-03-12 18:46:19 -07:00
Vladimir Davydov 7feee590bb memcg: disable hierarchy support if bound to the legacy cgroup hierarchy
If the memory cgroup controller is initially mounted in the scope of the
default cgroup hierarchy and then remounted to a legacy hierarchy, it will
still have hierarchy support enabled, which is incorrect.  We should
disable hierarchy support if bound to the legacy cgroup hierarchy.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Jeff Vander Stoep a5a6579db3 mm: reorder can_do_mlock to fix audit denial
A userspace call to mmap(MAP_LOCKED) may result in the successful locking
of memory while also producing a confusing audit log denial.  can_do_mlock
checks capable and rlimit.  If either of these return positive
can_do_mlock returns true.  The capable check leads to an LSM hook used by
apparmour and selinux which produce the audit denial.  Reordering so
rlimit is checked first eliminates the denial on success, only recording a
denial when the lock is unsuccessful as a result of the denial.

Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Acked-by: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Cassella <cassella@cray.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Andrey Ryabinin a5af5aa8b6 kasan, module, vmalloc: rework shadow allocation for modules
Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used.  vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
        if (unlikely(in_interrupt())) {
            struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
            if (llist_add((struct llist_node *)addr, &p->list))
                    schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory.  However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area.  At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
gchen gchen 5b8bf30721 mm/nommu.c: export symbol max_mapnr
Several modules may need max_mapnr, so export, the related error with
allmodconfig under c6x:

  MODPOST 3327 modules
  ERROR: "max_mapnr" [fs/pstore/ramoops.ko] undefined!
  ERROR: "max_mapnr" [drivers/media/v4l2-core/videobuf2-dma-contig.ko] undefined!

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Danesh Petigara 850fc430f4 mm: cma: fix CMA aligned offset calculation
The CMA aligned offset calculation is incorrect for non-zero order_per_bit
values.

For example, if cma->order_per_bit=1, cma->base_pfn= 0x2f800000 and
align_order=12, the function returns a value of 0x17c00 instead of 0x400.

This patch fixes the CMA aligned offset calculation.

The previous calculation was wrong and would return too-large values for
the offset, so that when cma_alloc looks for free pages in the bitmap with
the requested alignment > order_per_bit, it starts too far into the bitmap
and so CMA allocations will fail despite there actually being plenty of
free pages remaining.  It will also probably have the wrong alignment.
With this change, we will get the correct offset into the bitmap.

One affected user is powerpc KVM, which has kvm_cma->order_per_bit set to
KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, or 18 - 12 = 6.

[gregory.0xf0@gmail.com: changelog additions]
Signed-off-by: Danesh Petigara <dpetigara@broadcom.com>
Reviewed-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
David Rientjes 44fc80573c mm, hugetlb: close race when setting PageTail for gigantic pages
Now that gigantic pages are dynamically allocatable, care must be taken to
ensure that p->first_page is valid before setting PageTail.

If this isn't done, then it is possible to race and have compound_head()
return NULL.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
Michal Hocko e009d5dc0a mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled
Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
after OOM killer is disabled if the allocation is performed by a kernel
thread.  This behavior was introduced from the very beginning by
7f33d49a2e ("mm, PM/Freezer: Disable OOM killer when tasks are frozen").
 This means that the basic contract for the allocation request is broken
and the context requesting such an allocation might blow up unexpectedly.

There are basically two ways forward.

1) move oom_killer_disable after kernel threads are frozen.  This has a
   risk that the OOM victim wouldn't be able to finish because it would
   depend on an already frozen kernel thread.  This would be really tricky
   to debug.

2) do not fail GFP_NOFAIL allocation no matter what and risk a
   potential Freezable kernel threads will loop and fail the suspend.
   Incidental allocations after kernel threads are frozen will at least
   dump a warning - if we are lucky and the serial console is still active
   of course...

This patch implements the later option because it is safer.  We would see
warning rather than allocation failures for the kernel threads which would
blow up otherwise and have a higher chances to identify __GFP_NOFAIL users
from deeper pm code.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@gooogle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
Mel Gorman ba68bc0115 mm: thp: Return the correct value for change_huge_pmd
The wrong value is being returned by change_huge_pmd since commit
10c1045f28 ("mm: numa: avoid unnecessary TLB flushes when setting
NUMA hinting entries") which allows a fallthrough that tries to adjust
non-existent PTEs. This patch corrects it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 14:07:41 -07:00
Linus Torvalds 53da3bc2ba mm: fix up numa read-only thread grouping logic
Dave Chinner reported that commit 4d94246699 ("mm: convert
p[te|md]_mknonnuma and remaining page table manipulations") slowed down
his xfsrepair test enormously.  In particular, it was using more system
time due to extra TLB flushing.

The ultimate reason turns out to be how the change to use the regular
page table accessor functions broke the NUMA grouping logic.  The old
special mknuma/mknonnuma code accessed the page table present bit and
the magic NUMA bit directly, while the new code just changes the page
protections using PROT_NONE and the regular vma protections.

That sounds equivalent, and from a fault standpoint it really is, but a
subtle side effect is that the *other* protection bits of the page table
entries also change.  And the code to decide how to group the NUMA
entries together used the writable bit to decide whether a particular
page was likely to be shared read-only or not.

And with the change to make the NUMA handling use the regular permission
setting functions, that writable bit was basically always cleared for
private mappings due to COW.  So even if the page actually ends up being
written to in the end, the NUMA balancing would act as if it was always
shared RO.

This code is a heuristic anyway, so the fix - at least for now - is to
instead check whether the page is dirty rather than writable.  The bit
doesn't change with protection changes.

NOTE! This also adds a FIXME comment to revisit this issue,

Not only should we probably re-visit the whole "is this a shared
read-only page" heuristic (we might want to take the vma permissions
into account and base this more on those than the per-page ones, and
also look at whether the particular access that triggers it is a write
or not), but the whole COW issue shows that we should think about the
NUMA fault handling some more.

For example, maybe we should do the early-COW thing that a regular fault
does.  Or maybe we should accept that while using the same bits as
PROTNONE was a good thing (and got rid of the specual NUMA bit), we
might still want to just preseve the other protection bits across NUMA
faulting.

Those are bigger questions, left for later.  This just fixes up the
heuristic so that it at least approximates working again.  More analysis
and work needed.

Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 08:45:46 -07:00
Linus Torvalds a015d33c98 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Two smaller fixes for this cycle:

   - A fixup from Keith so that NVMe compiles without BLK_INTEGRITY,
     basically just moving the code around appropriately.

   - A fixup for shm, fixing an oops in shmem_mapping() for mapping with
     no inode.  From Sasha"

[ The shmem fix doesn't look block-layer-related, but fixes a bug that
  happened due to the backing_dev_info removal..  - Linus ]

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm: shmem: check for mapping owner before dereferencing
  NVMe: Fix for BLK_DEV_INTEGRITY not set
2015-02-28 10:21:57 -08:00
Johannes Weiner cc87317726 mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.

Commit 9879de7373 ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point.  This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[3.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Johannes Weiner d2973697b3 mm: memcontrol: use "max" instead of "infinity" in control knobs
The memcg control knobs indicate the highest possible value using the
symbolic name "infinity", which is long and awkward to type.

Switch to the string "max", which is just as descriptive but shorter and
sweeter.

This changes a user interface, so do it before the release and before
the development flag is dropped from the default hierarchy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Michal Hocko 4e54dede38 memcg: fix low limit calculation
A memcg is considered low limited even when the current usage is equal to
the low limit.  This leads to interesting side effects e.g.
groups/hierarchies with no memory accounted are considered protected and
so the reclaim will emit MEMCG_LOW event when encountering them.

Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
NULL ptr dereference with the legacy cgroup API which even doesn't have
low limit exposed.  The limit is 0 by default but the initial check fails
for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
use_hierarchy is 0 and so page_counter_read would try to dereference NULL.

I suppose that the current implementation is just an overlook because the
documentation in Documentation/cgroups/unified-hierarchy.txt says:

  "The memory.low boundary on the other hand is a top-down allocated
  reserve.  A cgroup enjoys reclaim protection when it and all its
  ancestors are below their low boundaries"

Fix the usage and the low limit comparision in mem_cgroup_low accordingly.

Fixes: 241994ed86 (mm: memcontrol: default hierarchy interface for memory)
Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Joonsoo Kim da616534ed mm/nommu: fix memory leak
Maxime reported the following memory leak regression due to commit
dbc8358c72 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").

On v3.19, I am facing a memory leak.  Each time I run a command one page
is lost.  Here an example with busybox's free command:

  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1972       5956          0          0        492
  -/+ buffers/cache:       1480       6448
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1976       5952          0          0        492
  -/+ buffers/cache:       1484       6444
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1980       5948          0          0        492
  -/+ buffers/cache:       1488       6440
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1984       5944          0          0        492
  -/+ buffers/cache:       1492       6436
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1988       5940          0          0        492
  -/+ buffers/cache:       1496       6432

At some point, the system fails to sastisfy 256KB allocations:

  free: page allocation failure: order:6, mode:0xd0
  CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
  Hardware name: STM32 (Device Tree Support)
    show_stack+0xb/0xc
    warn_alloc_failed+0x97/0xbc
    __alloc_pages_nodemask+0x295/0x35c
    __get_free_pages+0xb/0x24
    alloc_pages_exact+0x19/0x24
    do_mmap_pgoff+0x423/0x658
    vm_mmap_pgoff+0x3f/0x4e
    load_flat_file+0x20d/0x4f8
    load_flat_binary+0x3f/0x26c
    search_binary_handler+0x51/0xe4
    do_execveat_common+0x271/0x35c
    do_execve+0x19/0x1c
    ret_fast_syscall+0x1/0x4a
  Mem-info:
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  2048 pages of RAM
  1538 free pages
  66 reserved pages
  109 slab pages
  -46 pages shared
  0 pages swap cached
  nommu: Allocation of length 221184 from process 67 (free) failed
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  Unable to allocate RAM for process text/data, errno 12 SEGV

This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series().  In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.

To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.

Fixes: dbc8358c72 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
Reported-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Tested-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>	[3.19]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:50 -08:00
Sasha Levin f0774d884b mm: shmem: check for mapping owner before dereferencing
mapping->host can be NULL and shouldn't be dereferenced before being checked.

[ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
[ 1295.746387] Dumping ftrace buffer:
[ 1295.748217]    (ftrace buffer empty)
[ 1295.749527] Modules linked in:
[ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
[ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
[ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
[ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
[ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
[ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
[ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
[ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
[ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
[ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
[ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
[ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
[ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[ 1295.750268] Stack:
[ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
[ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
[ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
[ 1295.750268] Call Trace:
[ 1295.750268] mincore_page (mm/mincore.c:61)
[ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
[ 1295.750268] mincore_pte_range (mm/mincore.c:151)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
[ 1295.750268] walk_page_range (mm/pagewalk.c:275)
[ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
[ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
[ 1295.750268] ? ptlock_free (mm/mincore.c:24)
[ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
[ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48

All code
========
   0:	e5 48                	in     $0x48,%eax
   2:	c1 ea 03             	shr    $0x3,%edx
   5:	53                   	push   %rbx
   6:	48 89 fb             	mov    %rdi,%rbx
   9:	48 83 ec 08          	sub    $0x8,%rsp
   d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
  11:	75 4f                	jne    0x62
  13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  1a:	fc ff df
  1d:	48 8b 1b             	mov    (%rbx),%rbx
  20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
  24:	48 89 fa             	mov    %rdi,%rdx
  27:	48 c1 ea 03          	shr    $0x3,%rdx
  2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
  2f:	75 3f                	jne    0x70
  31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  38:	fc ff df
  3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  3f:	48                   	rex.W
	...

Code starting with the faulting instruction
===========================================
   0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   4:	75 3f                	jne    0x45
   6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   d:	fc ff df
  10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  14:	48                   	rex.W
	...
[ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
[ 1295.750268]  RSP <ffff8803a4dcfbf8>

Fixes: 97b713ba3e ("fs: kill BDI_CAP_SWAP_BACKED")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-23 10:00:11 -08:00
Linus Torvalds be5e6616dd Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted stuff from this cycle.  The big ones here are multilayer
  overlayfs from Miklos and beginning of sorting ->d_inode accesses out
  from David"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  Documentation/filesystems/Locking: ->get_sb() is long gone
  trylock_super(): replacement for grab_super_passive()
  fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
  SELinux: Use d_is_positive() rather than testing dentry->d_inode
  Smack: Use d_is_positive() rather than testing dentry->d_inode
  TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
  Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
  Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
  VFS: Split DCACHE_FILE_TYPE into regular and special types
  VFS: Add a fallthrough flag for marking virtual dentries
  VFS: Add a whiteout dentry type
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  Infiniband: Fix potential NULL d_inode dereference
  posix_acl: fix reference leaks in posix_acl_create
  autofs4: Wrong format for printing dentry
  ...
2015-02-22 17:42:14 -08:00
David Howells e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
Linus Torvalds b11a278397 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
 "Yann E Morin was supposed to take over kconfig maintainership, but
  this hasn't happened.  So I'm sending a few kconfig patches that I
  collected:

   - Fix for missing va_end in kconfig
   - merge_config.sh displays used if given too few arguments
   - s/boolean/bool/ in Kconfig files for consistency, with the plan to
     only support bool in the future"

* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kconfig: use va_end to match corresponding va_start
  merge_config.sh: Display usage if given too few arguments
  kconfig: use bool instead of boolean for type definition attributes
2015-02-19 10:36:45 -08:00
Al Viro d879cb8341 move iov_iter.c from mm/ to lib/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:22:17 -05:00
Al Viro 4b8164b91d new helper: dup_iter()
Copy iter and kmemdup the underlying array for the copy.  Returns
a pointer to result of kmemdup() to be kfree()'d later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:21:11 -05:00
Linus Torvalds 038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Linus Torvalds 66dc830d14 Merge branch 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "More iov_iter work - missing counterpart of iov_iter_init() for
  bvec-backed ones and vfs_read_iter()/vfs_write_iter() - wrappers for
  sync calls of ->read_iter()/->write_iter()"

* 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: add vfs_iter_{read,write} helpers
  new helper: iov_iter_bvec()
2015-02-17 15:48:33 -08:00
Matthew Wilcox e748dcd095 vfs: remove get_xip_mem
All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
 Also remove mm/filemap_xip.c as it is now empty.  Also remove
documentation of the long-gone get_xip_page().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00