This patch separates the parts of hugetlbpage.c which are inherently
specific to the hash MMU into a new hugelbpage-hash64.c file.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch simplifies the logic used to initialize hugepages on
powerpc. The somewhat oddly named set_huge_psize() is renamed to
add_huge_page_size() and now does all necessary verification of
whether it's given a valid hugepage sizes (instead of just some) and
instantiates the generic hstate structure (but no more).
hugetlbpage_init() now steps through the available pagesizes, checks
if they're valid for hugepages by calling add_huge_page_size() and
initializes the kmem_caches for the hugepage pagetables. This means
we can now eliminate the mmu_huge_psizes array, since we no longer
need to pass the sizing information for the pagetable caches from
set_huge_psize() into hugetlbpage_init()
Determination of the default huge page size is also moved from the
hash code into the general hugepage code.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently each available hugepage size uses a slightly different
pagetable layout: that is, the bottem level table of pointers to
hugepages is a different size, and may branch off from the normal page
tables at a different level. Every hugepage aware path that needs to
walk the pagetables must therefore look up the hugepage size from the
slice info first, and work out the correct way to walk the pagetables
accordingly. Future hardware is likely to add more possible hugepage
sizes, more layout options and more mess.
This patch, therefore reworks the handling of hugepage pagetables to
reduce this complexity. In the new scheme, instead of having to
consult the slice mask, pagetable walking code can check a flag in the
PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
and the entry also contains the information (eseentially hugepage
shift) necessary to then interpret that table without recourse to the
slice mask. This scheme can be extended neatly to handle multiple
levels of self-describing "special" hugepage pagetables, although for
now we assume only one level exists.
This approach means that only the pagetable allocation path needs to
know how the pagetables should be set out. All other (hugepage)
pagetable walking paths can just interpret the structure as they go.
There already was a flag bit in PGD/PUD/PMD entries for hugepage
directory pointers, but it was only used for debug. We alter that
flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
pointer (normally it would be 1 since the pointer lies in the linear
mapping). This means that asm pagetable walking can test for (and
punt on) hugepage pointers with the same test that checks for
unpopulated page directory entries (beq becomes bge), since hugepage
pointers will always be positive, and normal pointers always negative.
While we're at it, we get rid of the confusing (and grep defeating)
#defining of hugepte_shift to be the same thing as mmu_huge_psizes.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels. We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.
This patch cleans this all up by taking a different approach. Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size. Thus
sharing of caches between different types/levels of pagetables happens
naturally. The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, hpte_need_flush() only correctly flushes the given address
for normal pages. Callers for hugepages are required to mask the
address themselves.
But hpte_need_flush() already looks up the page sizes for its own
reasons, so this is a rather silly imposition on the callers. This
patch alters it to mask based on the pagesize it has looked up itself,
and removes the awkward masking code in the hugepage caller.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When running Active Memory Sharing, the Collaborative Memory Manager (CMM)
may mark some pages as "loaned" with the hypervisor. Periodically, the
CMM will query the hypervisor for a loan request, which is a single signed
value. When kexec'ing into a kdump kernel, the CMM driver in the kdump
kernel is not aware of the pages the previous kernel had marked as "loaned",
so the hypervisor and the CMM driver are out of sync. Fix the CMM driver
to handle this scenario by ignoring requests to decrease the number of loaned
pages if we don't think we have any pages loaned. Pages that are marked as
"loaned" which are not in the balloon will automatically get switched to "active"
the next time we touch the page. This also fixes the case where totalram_pages
is smaller than min_mem_mb, which can occur during kdump.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
get_irq_desc() is a powerpc-specific version of irq_to_desc(). That
is reason enough to remove it, but it also doesn't know about sparse
irq_desc support which irq_to_desc() does (when we enable it).
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rather than open-coding our own check, use irq_has_action()
to check if an irq has an action - ie. is "in use".
irq_has_action() doesn't take the descriptor lock, but it
shouldn't matter - we're just using it as an indicator
that the irq is in use. disable_irq_nosync() will take
the descriptor lock before doing anything also.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The irq_desc array consumes quite a lot of space, and for systems
that don't need or can't have 512 irqs it's just wasted space.
The first 16 are reserved for ISA, so the minimum of 32 is really
16 - and no one has asked for more than 512 so leave that as the
maximum.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linux power management subsystem supports vast amount of new PM
callbacks that are crucial for proper suspend and hibernation support
in drivers.
This patch implements support for dev_pm_ops, preserving support
for legacy callbacks.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The CHRP code has some fishy timer based code to scan the RTAS event
log, which uses a 1KB stack buffer and doesn't even use the results.
The pSeries code as a nicer daemon that allows userspace to read the
event log and basically uses the same RTAS interface
This patch moves rtasd.c out of platform/pseries and makes it usable
by CHRP, after removing the old crufty event log mechanism in there.
The nvram logging part of the daemon is still only available on 64-bit
since the underlying nvram management routines aren't currently shared.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some of the stuff in /proc/ppc64 such as the RTAS bits are actually
useful to some 32-bit platforms. Rename the file, and create a
symlink on 64-bit for backward compatibility
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Just as with kexec, hibernation may fail even on well-tested platforms:
some PCI device, a driver of which doesn't play well with hibernation,
is enough to break resuming.
Hibernation code is not much platform dependent, and hiding features only
because these were not verified on a particular hardware is
counterproductive: we just prevent the features from being widely tested.
For example, with this patch I just tested hibernation on a MPC83xx
board, and it works quite well, modulo a few drivers that need some
fixing.
So, let's make it possible to select hibernation support for all
PowerPCs, then let's wait for any possible bug reports, and actually fix
(or just collect ;-) the bugs instead of hiding them. If some platforms
really can't stand hibernation, we can make a blacklist, with proper
comments why exactly hibernation doesn't work, whether it is possible to
fix, and what needs to be done to fix it.
CONFIG_HIBERNATION is still =n by default, so the commit doesn't change
anything apart from ability to set it to =y.
I'm not sure if EXPERIMENTAL dependency is needed, I'd rather not add it
for a few reasons:
1) It doesn't matter much, for distro kernels user has no clue that some
feature is experimental. Majority of defconfigs enable EXPERIMENTAL
anyway (90 vs. 4, which, btw, means that EXPERIMENTAL is overused
in Kconfigs);
2) EXPERIMENTAL is a good thing for features that change default
behaviour of a kernel, while for hibernation user has to explicitly
issue 'echo disk > /sys/power/state' to trigger any hibernation bugs;
3) Per init/Kconfig, EXPERIMENTAL is a good thing to scare and discourage
users from 'widespread use of a feature', while we want to encourage
that use.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The non-debug case in ps3/mm.c uses pr_debug(), so that the compiler
still does type checks etc. and doesn't complain about unused
variables in the non-debug case.
However with DEBUG=n and CONFIG_DYNAMIC_DEBUG=y there's still code
generated for those pr_debugs().
size before:
text data bss dec hex filename
17553 4112 88 21753 54f9 arch/powerpc/platforms/ps3/mm.o
size after:
text data bss dec hex filename
7377 776 88 8241 2031 arch/powerpc/platforms/ps3/mm.o
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (43 commits)
net: Fix 'Re: PACKET_TX_RING: packet size is too long'
netdev: usb: dm9601.c can drive a device not supported yet, add support for it
qlge: Fix firmware mailbox command timeout.
qlge: Fix EEH handling.
AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2)
bonding: fix a race condition in calls to slave MII ioctls
virtio-net: fix data corruption with OOM
sfc: Set ip_summed correctly for page buffers passed to GRO
cnic: Fix L2CTX_STATUSB_NUM offset in context memory.
MAINTAINERS: rt2x00 list is moderated
airo: Reorder tests, check bounds before element
mac80211: fix for incorrect sequence number on hostapd injected frames
libertas spi: fix sparse errors
mac80211: trivial: fix spelling in mesh_hwmp
cfg80211: sme: deauthenticate on assoc failure
mac80211: keep auth state when assoc fails
mac80211: fix ibss joining
b43: add 'struct b43_wl' missing declaration
b43: Fix Bugzilla #14181 and the bug from the previous 'fix'
rt2x00: Fix crypto in TX frame for rt2800usb
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
sched: move rq_weight data array out of .percpu
percpu: allow pcpu_alloc() to be called with IRQs off
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes:
param: fix setting arrays of bool
param: fix NULL comparison on oom
param: fix lots of bugs with writing charp params from sysfs, by leaking mem.
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
virtio: order used ring after used index read
virtio-pci: fix per-vq MSI-X request logic
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
backing-dev: ensure that a removed bdi no longer has super_block referencing it
block: use after free bug in __blkdev_get
block: silently error unsupported empty barriers too