Commit Graph

290 Commits

Author SHA1 Message Date
Tang Chen 0197518cd3 memory-hotplug: remove memmap of sparse-vmemmap
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables.  Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().

Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Wen Congyang 24d335ca36 memory-hotplug: introduce new arch_remove_memory() for removing page table
For removing memory, we need to remove page tables.  But it depends on
architecture.  So the patch introduce arch_remove_memory() for removing
page table.  Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Martin Schwidefsky abf09bed3c s390/mm: implement software dirty bits
The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71
"mm: fix XFS oops due to dirty pages without buffers on s390".

To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.

With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.

For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the mk_pte
primitive checks for PageDirty if the pgprot value allows for writes
and pre-dirties the pte. That avoids all additional faults for
tmpfs and shmem pages until these pages are added to the swap cache.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-02-14 15:55:23 +01:00
Hendrik Brueckner 486c0a0bc8 s390/mm: Fix crst upgrade of mmap with MAP_FIXED
Right now the page table upgrade does not happen if the end address
of a fixed mapping is greater than TASK_SIZE.
Enhance s390_mmap_check() to handle MAP_FIXED mappings correctly.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-02-14 15:55:22 +01:00
Heiko Carstens 420f42ecf4 s390/irq: remove split irq fields from /proc/stat
Now that irq sum accounting for /proc/stat's "intr" line works again we
have the oddity that the sum field (first field) contains only the sum
of the second (external irqs) and third field (I/O interrupts).
The reason for that is that these two fields are already sums of all other
fields. So if we would sum up everything we would count every interrupt
twice.
This is broken since the split interrupt accounting was merged two years
ago: 052ff461c8 "[S390] irq: have detailed
statistics for interrupt types".
To fix this remove the split interrupt fields from /proc/stat's "intr"
line again and only have them in /proc/interrupts.

This restores the old behaviour, seems to be the only sane fix and mimics
a behaviour from other architectures where /proc/interrupts also contains
more than /proc/stat's "intr" line does.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-01-08 10:57:07 +01:00
Martin Schwidefsky 39efd4ec9a s390/ptrace: race of single stepping vs signal delivery
The current single step code is racy in regard to concurrent delivery
of signals. If a signal is delivered after a PER program check occurred
but before the TIF_PER_TRAP bit has been checked in entry[64].S the code
clears TIF_PER_TRAP and then calls do_signal. This is wrong, if the
instruction completed (or has been suppressed) a SIGTRAP should be
delivered to the debugger in any case. Only if the instruction has been
nullified the SIGTRAP may not be send.

The new logic always sets TIF_PER_TRAP if the program check indicates PER
tracing but removes it again for all program checks that are nullifying.
The effect is that for each change in the PSW address we now get a
single SIGTRAP.

Reported-by: Andreas Arnez <arnez@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:33 +01:00
Heiko Carstens 0a4ccc9929 s390/mm: move kernel_page_present/kernel_map_pages to page_attr.c
Keep related functions together and move to appropriate file.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:31 +01:00
Heiko Carstens 6b70a92080 s390/memory hotplug: use pfmf instruction to initialize storage keys
Move and rename init_storage_keys() to pageattr.c, so it can also be
used from the sclp memory hotplug code in order to initialize
storage keys.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:30 +01:00
Heiko Carstens a4f32bdbd9 s390/mm: keep fault_init() private to fault.c
Just convert fault_init() to an early initcall. That's still early
enough since it only needs be called before user space processes get
executed. No reason to externalize it.
Also add the function to the init section and move the store_indication
variable to the read_mostly section.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:29 +01:00
Heiko Carstens f7817968d0 s390/mm,vmemmap: use 1MB frames for vmemmap
Use 1MB frames for vmemmap if EDAT1 is available in order to
reduce TLB pressure
Always use a 1MB frame even if its only partially needed for
struct pages. Otherwise we would end up with a mix of large
frame and page mappings, because vmemmap_populate gets called
for each section (256MB -> 3.5MB memmap) separately.
Worst case is that we would waste 512KB.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:25 +01:00
Heiko Carstens 18da236908 s390/mm,vmem: use 2GB frames for identity mapping
Use 2GB frames for indentity mapping if EDAT2 is
available to reduce TLB pressure.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23 11:14:24 +01:00
Heiko Carstens 516bad44b9 s390/gup: fix access_ok() usage in __get_user_pages_fast()
access_ok() returns always "true" on s390. Therefore all access_ok()
invocations are rather pointless.
However when walking page tables we need to make sure that everything
is within bounds of the ASCE limit of the task's address space.
So remove the access_ok() call and add the same check we have in
get_user_pages_fast().

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-13 11:02:28 +01:00
Heiko Carstens d55c4c613f s390/gup: add missing TASK_SIZE check to get_user_pages_fast()
When walking page tables we need to make sure that everything
is within bounds of the ASCE limit of the task's address space.
Otherwise we might calculate e.g. a pud pointer which is not
within a pud and dereference it.
So check against TASK_SIZE (which is the ASCE limit) before
walking page tables.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-13 11:02:26 +01:00
Gerald Schaefer 156152f84e s390/mm: use pmd_large() instead of pmd_huge()
Without CONFIG_HUGETLB_PAGE, pmd_huge() will always return 0. So
pmd_large() should be used instead in places where both transparent
huge pages and hugetlbfs pages can occur.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-26 16:44:23 +02:00
Heiko Carstens fc7e48aad3 s390/mm,vmem: fix vmem_add_mem()/vmem_remove_range()
vmem_add_mem() should only then insert a large page if pmd_none() is true
for the specific entry. We might have a leftover from a previous mapping.
In addition make vmem_remove_range()'s page table walk code more complete
and fix a couple of potential endless loops (which can never happen :).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:17:01 +02:00
Heiko Carstens c972cc60c2 s390/vmalloc: have separate modules area
Add a special module area on top of the vmalloc area, which may be only
used for modules and bpf jit generated code.
This makes sure that inter module branches will always happen without a
trampoline and in addition having all the code within a 2GB frame is
branch prediction unit friendly.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:17:01 +02:00
Heiko Carstens 8fe234d3c8 s390/mm: fix mapping of read-only kernel text section
Within the identity mapping the kernel text section is mapped read-only.
However when mapping the first and last page of the text section we must
round upwards and downwards respectively, if only parts of a page belong
to the section.
Otherwise potential rw data can be mapped read-only. So the rounding must
be done just the other way we have it right now.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:16:59 +02:00
Heiko Carstens e76e82d772 s390/mm: add page table dumper
This is more or less the same as the x86 page table dumper which was
merged four years ago: 926e5392 "x86: add code to dump the (kernel)
page tables for visual inspection by kernel developers".

We add a file at /sys/kernel/debug/kernel_page_tables for debugging
purposes so it's quite easy to see the kernel page table layout and
possible odd mappings:

---[ Identity Mapping ]---
0x0000000000000000-0x0000000000100000        1M PTE RW
---[ Kernel Image Start ]---
0x0000000000100000-0x0000000000800000        7M PMD RO
0x0000000000800000-0x00000000008a9000      676K PTE RO
0x00000000008a9000-0x0000000000900000      348K PTE RW
0x0000000000900000-0x0000000001500000       12M PMD RW
---[ Kernel Image End ]---
0x0000000001500000-0x0000000280000000    10219M PMD RW
0x0000000280000000-0x000003d280000000     3904G PUD I
---[ vmemmap Area ]---
0x000003d280000000-0x000003d288c00000      140M PTE RW
0x000003d288c00000-0x000003d300000000     1908M PMD I
0x000003d300000000-0x000003e000000000       52G PUD I
---[ vmalloc Area ]---
0x000003e000000000-0x000003e000009000       36K PTE RW
0x000003e000009000-0x000003e0000ee000      916K PTE I
0x000003e0000ee000-0x000003e000146000      352K PTE RW
0x000003e000146000-0x000003e000200000      744K PTE I
0x000003e000200000-0x000003e080000000     2046M PMD I
0x000003e080000000-0x0000040000000000      126G PUD I

This usually makes only sense for kernel developers. The output
with CONFIG_DEBUG_PAGEALLOC is not very helpful, because of the
huge number of mapped out pages, however I decided for the time
being to not add a !DEBUG_PAGEALLOC dependency.
Maybe it's helpful for somebody even with that option.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:16:58 +02:00
Heiko Carstens cc5b9a4518 s390/mm,pageattr: remove superfluous EXPORT_SYMBOLs
Remove some EXPORT_SYMBOLs. There is no module code anywhere
which calls these pageattr functions.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:16:57 +02:00
Heiko Carstens 5b1ba9e30c s390/mm,pageattr: add more page table walk sanity checks
The current page table walk code in pageattr.c only checks for large pages
while walking the kernel page table, but happily assumes that everything
else is just fine.
Add more checks so we never access invalid memory regions.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:16:57 +02:00
Heiko Carstens 378b1e7a80 s390/mm: fix pmd_huge() usage for kernel mapping
pmd_huge() will always return 0 on !HUGETLBFS, however we use that helper
function when walking the kernel page tables to decide if we have a
1MB page frame or not.
Since we create 1MB frames for the kernel 1:1 mapping independently of
HUGETLBFS this can lead to incorrect storage accesses since the code
can assume that we have a pointer to a page table instead of a pointer
to a 1MB frame.

Fix this by adding a pmd_large() primitive like other architectures have
it already and remove all references to HUGETLBFS/HUGETLBPAGE from the
code that walks kernel page tables.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09 14:16:56 +02:00
Shaohua Li 45cac65b0f readahead: fault retry breaks mmap file read random detection
.fault now can retry.  The retry can break state machine of .fault.  In
filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
try, since the page is in page cache now, ra->mmap_miss is decreased.  And
these are done in one fault, so we can't detect random mmap file access.

Add a new flag to indicate .fault is tried once.  In the second try, skip
ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.

I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)

Signed-off-by: Shaohua Li <shaohua.li@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:47 +09:00
Gerald Schaefer 1ae1c1d09f thp, s390: architecture backend for thp on s390
This implements the architecture backend for transparent hugepages
on s390.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:31 +09:00
Gerald Schaefer 274023da1e thp, s390: disable thp for kvm host on s390
This patch is part of the architecture backend for thp on s390.  It
disables thp for kvm hosts, because there is no kvm host hugepage support
so far.  Existing thp mappings are split by follow_page() with FOLL_SPLIT,
and future thp mappings are prevented by setting VM_NOHUGEPAGE in
mm->def_flags.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:30 +09:00
Gerald Schaefer 9501d09fa3 thp, s390: thp pagetable pre-allocation for s390
This patch is part of the architecture backend for thp on s390.  It
provides the pagetable pre-allocation functions
pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw().  Unlike
other archs, s390 has no struct page * as pgtable_t, but rather a pointer
to the page table.  So instead of saving the pagetable pre- allocation
list info inside the struct page, it is being saved within the pagetable
itself.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:30 +09:00