Commit Graph

32197 Commits

Author SHA1 Message Date
Oleg Nesterov e7fd1549ae coredump: format_corename() can leak cn->corename
do_coredump() assumes that format_corename() can only fail if
expand_corename() fails and frees cn->corename.  This is not true, for
example cn_print_exe_file() can fail and in this case nobody frees
cn->corename.

Change do_coredump() to always do kfree(cn->corename) after it calls
format_corename() (NULL is fine), change expand_corename() to do nothing
if kmalloc() fails.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Colin Walters <walters@verbum.org>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Lucas De Marchi <lucas.de.marchi@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:02 -07:00
Oleg Nesterov 77d5591802 signals: eventpoll: do not use sigprocmask()
sigprocmask() should die. None of the current callers actually
need this strange interface.

Change fs/eventpoll.c to use set_current_blocked(). This also
means we should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:01 -07:00
Gu Zheng e68e96d2a7 fs/fat: use fat_msg() to replace printk() in __fat_fs_error()
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:01 -07:00
Vyacheslav Dubeyko e5f7f84843 ] nilfs2: use atomic64_t type for inodes_count and blocks_count fields in nilfs_root struct
The cp_inodes_count and cp_blocks_count are represented as __le64 type in
on-disk structure (struct nilfs_checkpoint).  But analogous fields in
in-core structure (struct nilfs_root) are represented by atomic_t type.

This patch replaces atomic_t on atomic64_t type in representation of
inodes_count and blocks_count fields in struct nilfs_root.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Clemens Eisserer <linuxhippy@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:01 -07:00
Vyacheslav Dubeyko c7ef972c44 nilfs2: implement calculation of free inodes count
Currently, NILFS2 returns 0 as free inodes count (f_ffree) and current
used inodes count as total file nodes in file system (f_files):

df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/loop0           2      2       0  100% /mnt/nilfs2

This patch implements real calculation of free inodes count.  First of
all, it is calculated total file nodes in file system as
(desc_blocks_count * groups_per_desc_block * entries_per_group).  Then, it
is calculated free inodes count as difference the total file nodes and
used inodes count.  As a result, we have such output for NILFS2:

df -i
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
/dev/loop0      4194304 2114701  2079603   51% /mnt/nilfs2

Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Joern Engel <joern@logfs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:01 -07:00
Kees Cook f170168b9a drivers: avoid parsing names as kthread_run() format strings
Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:41 -07:00
Kees Cook 096a8aac6b clean up scary strncpy(dst, src, strlen(src)) uses
Fix various weird constructions of strncpy(dst, src, strlen(src)).

Length limits should be about the space available in the destination,
not repurposed as a method to either always include or always exclude a
trailing NULL byte.  Either the NULL should always be copied (using
strlcpy), or it should not be copied (using something like memcpy).
Readable code should not depend on the weird behavior of strncpy when it
hits the length limit.  Better to avoid the anti-pattern entirely.

[akpm@linux-foundation.org: revert getdelays.c part due to missing bsd/string.h]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	[staging]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[acpi]
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:41 -07:00
Jiang Liu 0ed5fd1385 mm: use totalram_pages instead of num_physpages at runtime
The global variable num_physpages is scheduled to be removed, so use
totalram_pages instead of num_physpages at runtime.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:35 -07:00
Mel Gorman a0b8cab3b9 mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API
Now that the LRU to add a page to is decided at LRU-add time, remove the
misleading lru parameter from __pagevec_lru_add.  A consequence of this
is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
helpers are misleading as the caller no longer has direct control over
what LRU the page is added to.  Unused helpers are removed by this patch
and existing users of pagevec_lru_add_file() are converted to use
lru_cache_add_file() directly and use the per-cpu pagevecs instead of
creating their own pagevec.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:31 -07:00
HATAYAMA Daisuke 83086978c6 vmcore: support mmap() on /proc/vmcore
This patch introduces mmap_vmcore().

Don't permit writable nor executable mapping even with mprotect()
because this mmap() is aimed at reading crash dump memory.  Non-writable
mapping is also requirement of remap_pfn_range() when mapping linear
pages on non-consecutive physical pages; see is_cow_mapping().

Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
remap_vmalloc_range_pertial at the same time for a single vma.
do_munmap() can correctly clean partially remapped vma with two
functions in abnormal case.  See zap_pte_range(), vm_normal_page() and
their comments for details.

On x86-32 PAE kernels, mmap() supports at most 16TB memory only.  This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

[akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Tested-by: Maxim Uvarov <muvarov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 591ff71664 vmcore: calculate vmcore file size from buffer size and total size of vmcore objects
The previous patches newly added holes before each chunk of memory and
the holes need to be count in vmcore file size.  There are two ways to
count file size in such a way:

1) suppose m is a poitner to the last vmcore object in vmcore_list.
   Then file size is (m->offset + m->size), or

2) calculate sum of size of buffers for ELF header, program headers,
   ELF note segments and objects in vmcore_list.

Although 1) is more direct and simpler than 2), 2) seems better in that
it reflects internal object structure of /proc/vmcore.  Thus, this patch
changes get_vmcore_size_elf{64, 32} so that it calculates size in the
way of 2).

As a result, both get_vmcore_size_elf{64, 32} have the same definition.
Merge them as get_vmcore_size.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke ef9e78fd27 vmcore: allow user process to remap ELF note segment buffer
Now ELF note segment has been copied in the buffer on vmalloc memory.
To allow user process to remap the ELF note segment buffer with
remap_vmalloc_page, the corresponding VM area object has to have
VM_USERMAP flag set.

[akpm@linux-foundation.org: use the conventional comment layout]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 087350c9dc vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory
The reasons why we don't allocate ELF note segment in the 1st kernel
(old memory) on page boundary is to keep backward compatibility for old
kernels, and that if doing so, we waste not a little memory due to
round-up operation to fit the memory to page boundary since most of the
buffers are in per-cpu area.

ELF notes are per-cpu, so total size of ELF note segments depends on
number of CPUs.  The current maximum number of CPUs on x86_64 is 5192,
and there's already system with 4192 CPUs in SGI, where total size
amounts to 1MB.  This can be larger in the near future or possibly even
now on another architecture that has larger size of note per a single
cpu.  Thus, to avoid the case where memory allocation for large block
fails, we allocate vmcore objects on vmalloc memory.

This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
to the ELF note segment buffer and its size.  There's no longer the
vmcore object that corresponds to the ELF note segment in vmcore_list.
Accordingly, read_vmcore() has new case for ELF note segment and
set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
calculating offset from sum of size of ELF headers and size of ELF note
segment.

[akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 7f614cd1e0 vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list.  Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

This change affects layout of /proc/vmcore.  The gaps generated by the
rearrangement are newly made visible to applications as holes.
Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
[end, roundup(end, PAGE_SIZE)].

Suppose variable m points at a vmcore object in vmcore_list, and
variable phdr points at the program header of PT_LOAD type the variable
m corresponds to.  Then, pictorially:

  m->offset                    +---------------+
                               | hole          |
phdr->p_offset =               +---------------+
  m->offset + (paddr - start)  |               |\
                               | kernel memory | phdr->p_memsz
                               |               |/
                               +---------------+
                               | hole          |
  m->offset + m->size          +---------------+

where m->offset and m->offset + m->size are always page-size aligned.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke f2bdacdd59 vmcore: allocate buffer for ELF headers on page-size alignment
Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().

Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used.  Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually used
buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.

The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.

The merged, removed PT_NOTE entries, i.e.  the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.

Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that the
offset includes the holes towards the page boundary.

As a result, both set_vmcore_list_offsets_elf{64,32} have the same
definition.  Merge them as set_vmcore_list_offsets.

[akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke b27eb18660 vmcore: clean up read_vmcore()
Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
Mel Gorman f919b19614 fs: nfs: inform the VM about pages being committed or unstable
VM page reclaim uses dirty and writeback page states to determine if
flushers are cleaning pages too slowly and that page reclaim should
stall waiting on flushers to catch up.  Page state in NFS is a bit more
complex and a clean page can be unreclaimable due to being unstable
which is effectively "dirty" from the perspective of the VM from reclaim
context.  Similarly, if the inode is currently being committed then it's
similar to being under writeback.

This patch adds a is_dirty_writeback() handled for NFS that checks if a
pages backing inode is being committed and should be accounted as
writeback and if a page has private state indicating that it is
effectively dirty.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:29 -07:00
Mel Gorman b45972265f mm: vmscan: take page buffers dirty and locked state into account
Page reclaim keeps track of dirty and under writeback pages and uses it
to determine if wait_iff_congested() should stall or if kswapd should
begin writing back pages.  This fails to account for buffer pages that
can be under writeback but not PageWriteback which is the case for
filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
can have all the buffers clean and writepage does no IO so it should not
be accounted as congested.

This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback.  An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode.  By default the
page flags are obeyed.

Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:29 -07:00
Libin ef9f515a4c ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:26 -07:00
Pavel Emelyanov 541c237c09 pagemap: prepare to reuse constant bits with page-shift
In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits
55-60 are about to change in a couple of releases.  Next, if a user
issues soft-dirty clear command via the clear_refs file (it was disabled
before v3.9) we assume that he's aware of the new pagemap format, note
that fact and report the bits in pagemap in the new manner.

The "migration strategy" looks like this then:

1. existing users are not affected -- they don't touch soft-dirty feature, thus
   see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
   this trick with clear-soft-dirty affecting pagemap format.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:26 -07:00
Pavel Emelyanov 0f8975ec4d mm: soft-dirty bits for user memory changes tracking
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to.  In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is.  Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast.  This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:26 -07:00
Pavel Emelyanov 2b0a9f0175 pagemap: introduce pagemap_entry_t without pmshift bits
These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry.  Moreover, in next patch we will need to report one more bit
in the pagemap, but all bits are already busy on it.

That said, describe the pagemap entry that has 6 more free zero bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Pavel Emelyanov af9de7eb18 clear_refs: introduce private struct for mm_walk
In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry
this info.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Pavel Emelyanov 040fa02077 clear_refs: sanitize accepted commands declaration
This is the implementation of the soft-dirty bit concept that should
help keep track of changes in user memory, which in turn is very-very
required by the checkpoint-restore project (http://criu.org).

To create a dump of an application(s) we save all the information about
it to files, and the biggest part of such dump is the contents of tasks'
memory.  However, there are usage scenarios where it's not required to
get _all_ the task memory while creating a dump.  For example, when
doing periodical dumps, it's only required to take full memory dump only
at the first step and then take incremental changes of memory.  Another
example is live migration.  We copy all the memory to the destination
node without stopping all tasks, then stop them, check for what pages
has changed, dump it and the rest of the state, then copy it to the
destination node.  This decreases freeze time significantly.

That said, some help from kernel to watch how processes modify the
contents of their memory is required.

The proposal is to track changes with the help of new soft-dirty bit
this way:

1. First do "echo 4 > /proc/$pid/clear_refs".
   At that point kernel clears the soft dirty _and_ the writable bits from all
   ptes of process $pid. From now on every write to any page will result in #pf
   and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
   the soft dirty flag.

2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
   (the 55'th one). If set, the respective pte was written to since last call
   to clear refs.

The soft-dirty bit is the _PAGE_BIT_HIDDEN one.  Although it's used by
kmemcheck, the latter one marks kernel pages with it, while the former
bit is put on user pages so they do not conflict to each other.

This patch:

A new clear-refs type will be added in the next patch, so prepare
code for that.

[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Xue jiufei 4a184b4ff4 ocfs2: fix NULL pointer dereference when traversing o2hb_all_regions
There may exist NULL pointer dereference in config_item_name() when one
volume (say Volume A) unmounts while another (say Volume B) mounting.

     Volume A                          Volume B

  already Mounted.
  Unmounting, call
  o2hb_heartbeat_group_drop_item()
    -> config_item_put(item)
    set reg(A)->item.ci_name to NULL
    in function config_item_cleanup().

                                    begin mounting, call
                                    o2hb_region_pin() and tranverse all
                                    regions. When reading
                                    reg(A)->item.ci_name, it causes
                                    NULL pointer dereference.

  call o2hb_region_release() and
  del reg(A) from list.

So we should skip accessing regions that is going to release when
tranverse o2hb_all_regions.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce <xuejiufei@huawei.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00