This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.
Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...
The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocating huge pages directly from the buddy allocator is not guaranteed to
succeed. Success depends on several factors (such as the amount of physical
memory available and the level of fragmentation). With the addition of
dynamic hugetlb pool resizing, allocations can occur much more frequently.
For these reasons it is desirable to keep track of huge page allocation
successes and failures.
Add two new vmstat entries to track huge page allocations that succeed and
fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
being enabled.
[akpm@linux-foundation.org: reduced ifdeffery]
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
pages, scan the page list in two parts. First, transfer the needed pages to
the hugetlb pool. Then drop the lock and free the remaining pages back to the
buddy allocator.
In the common case there are zero excess pages and no lock operations are
required.
Thanks Mel Gorman for this improvement.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.
A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.
[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.
This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.
[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations. Based on the zones
allowed by a gfp mask, one of these zonelists is selected. All of these
zonelists consume memory and occupy cache lines.
This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages. The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.
An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
and 2.6.25-rc5-mm1:
BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000
TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
GPR12: 0000000028000442 c000000000770080
NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
Call Trace:
[c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
[c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
[c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
[c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
[c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
[c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
[c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
[c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
[c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
[c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
[c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
[c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
[c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
[c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010
This was tracked down to a potential livelock in
return_unused_surplus_hugepages(). In the case where we have surplus
pages on some node, but no free pages on the same node, we may never
break out of the loop. To avoid this livelock, terminate the search if
we iterate a number of times equal to the number of online nodes without
freeing a page.
Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
the patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we show the surplus hugetlb pool state in /proc/meminfo, but
not in the per-node meminfo files, even though we track the information
on a per-node basis. Printing it there can help track down dynamic pool
bugs including the one in the follow-on patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free pages in the hugetlb pool are free and as such have a reference count of
zero. Regular allocations into the pool from the buddy are "freed" into the
pool which results in their page_count dropping to zero. However, surplus
pages can be directly utilized by the caller without first being freed to the
pool. Therefore, a call to put_page_testzero() is in order so that such a
page will be handed to the caller with a correct count.
This has not affected end users because the bad page count is reset before the
page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when
the page count is validated.
Thanks go to Mel for first spotting this issue and providing an initial fix.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adam Litke noticed that currently we grow the hugepage pool independent of any
cpuset the running process may be in, but when shrinking the pool, the cpuset
is checked. This leads to inconsistency when shrinking the pool in a
restricted cpuset -- an administrator may have been able to grow the pool on a
node restricted by a containing cpuset, but they cannot shrink it there.
There are two options: either prevent growing of the pool outside of the
cpuset or allow shrinking outside of the cpuset. >From previous discussions
on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
should not be restricted by cpusets. So allow shrinking the pool by removing
pages from nodes outside of current's cpuset.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A hugetlb reservation may be inadequately backed in the event of racing
allocations and frees when utilizing surplus huge pages. Consider the
following series of events in processes A and B:
A) Allocates some surplus pages to satisfy a reservation
B) Frees some huge pages
A) A notices the extra free pages and drops hugetlb_lock to free some of
its surplus pages back to the buddy allocator.
B) Allocates some huge pages
A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()
Avoid this by commiting the reservation after pages have been allocated but
before dropping the lock to free excess pages. For parity, release the
reservation in return_unused_surplus_pages().
This patch also corrects the cpuset_mems_nr() error path in
hugetlb_acct_memory(). If the cpuset check fails, uncommit the
reservation, but also be sure to return any surplus huge pages that may
have been allocated to back the failed reservation.
Thanks to Andy Whitcroft for discovering this.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we free a page via free_huge_page and we detect that we are in surplus
the page will be returned to the buddy. After this we no longer own the page.
However at the end free_huge_page we clear out our mapping pointer from
page private. Even where the page is not a surplus we free the page to
the hugepage pool, drop the pool locks and then clear page private. In
either case the page may have been reallocated. BAD.
Make sure we clear out page private before we free the page.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Reported-by: Miles Lane <miles.lane@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock. Add a
callback into the hugetlb code similar to the one for nr_hugepages. Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the
page uptodate.
Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.
Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.
Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if
SetPageUptodate were only called while the page is locked. Unfortunately
that is not always the case for some filesystems, but it could be an idea
for the future.
Also bring the handling of anonymous page uptodateness in line with that of
file backed page management, by marking anon pages as uptodate when they
_are_ uptodate, rather than when our implementation requires that they be
marked as such. Doing allows us to get rid of the smp_wmb's in the page
copying functions, which were especially added for anonymous pages for an
analogous memory ordering problem. Both file and anonymous pages are
handled with the same barriers.
FAQ:
Q. Why not do this in flush_dcache_page?
A. Firstly, flush_dcache_page handles only one side (the smb side) of the
ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
memory barriers in a completely unrelated function is nasty; at least in the
PageUptodate macros, they are located together with (half) the operations
involved in the ordering. Thirdly, the smp_wmb is only required when first
bringing the page uptodate, wheras flush_dcache_page should be called each time
it is written to through the kernel mapping. It is logically the wrong place to
put it.
Q. Why does this increase my text size / reduce my performance / etc.
A. Because it is adding the necessary instructions to eliminate the data-race.
Q. Can it be improved?
A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
run under the page lock, we could avoid the smp_rmb places where PageUptodate
is queried under the page lock. Requires audit of all filesystems and at least
some would need reworking. That's great you're interested, I'm eagerly awaiting
your patches.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The shared page table code for hugetlb memory on x86 and x86_64
is causing a leak. When a user of hugepages exits using this code
the system leaks some of the hugepages.
-------------------------------------------------------
Part of /proc/meminfo just before database startup:
HugePages_Total: 5500
HugePages_Free: 5500
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
Just before shutdown:
HugePages_Total: 5500
HugePages_Free: 4475
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
After shutdown:
HugePages_Total: 5500
HugePages_Free: 4988
HugePages_Rsvd:
0 Hugepagesize: 2048 kB
----------------------------------------------------------
The problem occurs durring a fork, in copy_hugetlb_page_range(). It
locates the dst_pte using huge_pte_alloc(). Since huge_pte_alloc() calls
huge_pmd_share() it will share the pmd page if can, yet the main loop in
copy_hugetlb_page_range() does a get_page() on every hugepage. This is a
violation of the shared hugepmd pagetable protocol and creates additional
referenced to the hugepages causing a leak when the unmap of the VMA
occurs. We can skip the entire replication of the ptes when the hugepage
pagetables are shared. The attached patch skips copying the ptes and the
get_page() calls if the hugetlbpage pagetable is shared.
[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the error path of both shared and private hugetlb page allocation,
the file system quota is never undone, leading to fs quota leak. Fix
them up.
[akpm@linux-foundation.org: cleanup, micro-optimise]
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 54f9f80d65 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")
Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).
(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe817) missed one case. If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing. Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.
This simple one-liner makes sure we also fault on the pte for this case.
Please apply.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
to guarantee no problems will occur later when instantiating pages. If quotas
are in force, page instantiation could fail due to a race with another process
or an oversized (but approved) shared mapping.
To prevent these scenarios, debit the quota for the full reservation amount up
front and credit the unused quota when the reservation is released.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
seem out of place. The alloc/free API is unbalanced because we handle the
hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
Move the get inside alloc_huge_page to clean up this disparity.
This patch has been kept apart from the previous patch because of the somewhat
dodgy ERR_PTR() use herein. Moving the quota logic means that
alloc_huge_page() has two failure modes. Quota failure must result in a
SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR()
doesn't like the small positive errnos we have in VM_FAULT_* so they must be
negated before they are used.
Does anyone take issue with the way I am using PTR_ERR. If so, what are your
thoughts on how to clean this up (without needing an if,else if,else block at
each alloc_huge_page() callsite)?
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>