Commit Graph

44 Commits

Author SHA1 Message Date
Ingo Molnar
00d1c5e057 x86: add gbpages switches
These new controls toggle experimental support for a new CPU feature,
the straightforward extension of largepages from the pmd level to the
pud level, which allows 1GB (kernel) TLBs instead of 2MB TLBs.

Turn it off by default, as this code has not been tested well enough yet.

Use the CONFIG_DIRECT_GBPAGES=y .config option or gbpages on the
boot line can be used to enable it. If enabled in the .config then
nogbpages boot option disables it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
Rob Landley
c66315e0a7 documentation: add Documentation/x86-64/00-INDEX
Signed-off-by: Rob Landley <rob@landley.net>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-04 16:48:02 +01:00
Huang, Ying
8b2cb7a8f5 x86: 32-bit EFI runtime service support: fixes in sync with 64-bit support
support according to fixes of x86_64 support.

- Delete efi_rt_lock because it is used during system early boot,
  before SMP is initialized.

- Change local_flush_tlb() to __flush_tlb_all() to flush global page
  mapping.

- Clean up includes.

- Revise Kconfig description.

- Enable noefi kernel parameter on i386.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:11 +01:00
Huang, Ying
9ad65e4748 x86: EFI runtime service support: document for EFI runtime services
This patch adds document for EFI x86_64 runtime services support.

Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:19 +01:00
Aaron Durbin
fa20efd2fc x86: add ACPI reboot option
Add the ability to reboot an x86_64 based machine using the RESET_REG in the
FADT ACPI table.

Signed-off-by: Aaron Durbin <adurbin@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:17 +01:00
Huang, Ying
164e217cec x86_64 EFI boot support: EFI boot document
This patch adds document for EFI x86_64 boot support.  The setup and operation
guide of EFI based system is documented in Documentation/x86_64/uefi.txt.

Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:54 -08:00
Christoph Lameter
0889eba5b3 x86_64: SPARSEMEM_VMEMMAP 2M page size support
x86_64 uses 2M page table entries to map its 1-1 kernel space.  We also
implement the virtual memmap using 2M page table entries.  So there is no
additional runtime overhead over FLATMEM, initialisation is slightly more
complex.  As FLATMEM still references memory to obtain the mem_map pointer and
SPARSEMEM_VMEMMAP uses a compile time constant, SPARSEMEM_VMEMMAP should be
superior.

With this SPARSEMEM becomes the most efficient way of handling virt_to_page,
pfn_to_page and friends for UP, SMP and NUMA on x86_64.

[apw@shadowen.org: code resplit, style fixups]
[apw@shadowen.org: vmemmap x86_64: ensure end of section memmap is initialised]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Tim Hockin
bd78432c8f x86_64: mcelog tolerant level cleanup
Background:
 The MCE handler has several paths that it can take, depending on various
 conditions of the MCE status and the value of the 'tolerant' knob.  The
 exact semantics are not well defined and the code is a bit twisty.

Description:
 This patch makes the MCE handler's behavior more clear by documenting the
 behavior for various 'tolerant' levels.  It also fixes or enhances
 several small things in the handler.  Specifically:
     * If RIPV is set it is not safe to restart, so set the 'no way out'
       flag rather than the 'kill it' flag.
     * Don't panic() on correctable MCEs.
     * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
       dropped uncorrected errors), set the 'no way out' flag.
     * Use EIPV for testing whether an app can be killed (SIGBUS) rather
       than RIPV.  According to docs, EIPV indicates that the error is
       related to the IP, while RIPV simply means the IP is valid to
       restart from.
     * Don't clear the MCi_STATUS registers until after the panic() path.
       This leaves the status bits set after the panic() so clever BIOSes
       can find them (and dumb BIOSes can do nothing).

 This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

Result:
 Tolerant levels behave almost identically to how they always have, but
 not it's well defined.  There's a slightly higher chance of panic()ing
 when multiple errors happen (a good thing, IMHO).  If you take an MBE and
 panic(), the error status bits are not cleared.

Alternatives:
 None.

Testing:
 I used software to inject correctable and uncorrectable errors.  With
 tolerant = 3, the system usually survives.  With tolerant = 2, the system
 usually panic()s (PCC) but not always.  With tolerant = 1, the system
 always panic()s.  When the system panic()s, the BIOS is able to detect
 that the cause of death was an MC4.  I was not able to reproduce the
 case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
 That will be rare at best.

Signed-off-by: Tim Hockin <thockin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-21 18:37:10 -07:00
Jan Beulich
d567b6a955 x86_64: remove unused variable maxcpus
.. and adjust documentation to properly reflect options that are
x86-64 specific.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-21 18:37:09 -07:00
Tim Hockin
8a336b0a4b [PATCH] x86-64: Dynamically adjust machine check interval
Background:
 We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
 especially when we are trying really hard to stress the system out.  The
 current MCE poller uses a static interval which does not care whether it
 has or has not found MCEs recently.

Description:
 This patch makes the MCE poller adjust the polling interval dynamically.
 If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
 MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
 tunable becomes the max polling interval.  The "Machine check events
 logged" printk() is rate limited to the check_interval, which should be
 identical behavior to the old functionality.

Result:
 If you start to take a lot of correctable errors (not exceptions), you
 log them faster and more accurately (less chance of overflowing the MCA
 registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
 I considered simply reducing the polling interval to 10 ms immediately
 and keeping it there as long as we continue to find errors.  This felt a
 bit heavy handed, but does perform significantly better for the default
 check_interval of 5 minutes (we're using a few seconds when testing for
 DRAM errors).  I could be convinced to go with this, if anyone felt it
 was not too aggressive.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that the polling interval accelerates.  The printk() only
 happens once per check_interval seconds.

Patch:
 This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <thockin@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:19 +02:00
David Rientjes
20280195f2 [PATCH] x86-64: fake numa for cpusets document
Create a document to explain how to use numa=fake in conjunction with cpusets
for coarse memory resource management.

An attempt to get more awareness and testing for this feature.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
David Rientjes
382591d500 [PATCH] x86-64: fixed size remaining fake nodes
Extends the numa=fake x86_64 command-line option to split the remaining system
memory into nodes of fixed size.  Any leftover memory is allocated to a final
node unless the command-line ends with a comma.

For example:
  numa=fake=2*512,*128	gives two 512M nodes and the remaining system
			memory is split into nodes of 128M each.

This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the size of the remaining nodes to be allocated is
known based on their capacity for resource management.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
David Rientjes
14694d736b [PATCH] x86-64: split remaining fake nodes equally
Extends the numa=fake x86_64 command-line option to split the remaining
system memory into equal-sized nodes.

For example:
numa=fake=2*512,4*	gives two 512M nodes and the remaining system
			memory is split into four approximately equal
			chunks.

This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the granularity with which nodes shall be allocated
is known.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
David Rientjes
8b8ca80e19 [PATCH] x86-64: configurable fake numa node sizes
Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes.  These nodes can be used in conjunction with cpusets for coarse
memory resource management.

The old command-line option is still supported:
  numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
		actual machine.

But now you may configure your system for the node sizes of your choice:
  numa=fake=2*512,1024,2*256
		gives two 512M nodes, one 1024M node, two 256M nodes, and
		the rest of system memory to a sixth node.

The existing hash function is maintained to support the various node sizes
that are possible with this implementation.

Each node of the same size receives roughly the same amount of available
pages, regardless of any reserved memory with its address range.  The total
available pages on the system is calculated and divided by the number of equal
nodes to allocate.  These nodes are then dynamically allocated and their
borders extended until such time as their number of available pages reaches
the required size.

Configurable node sizes are recommended when used in conjunction with cpusets
for memory control because it eliminates the overhead associated with scanning
the zonelists of many smaller full nodes on page_alloc().

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
Andi Kleen
9ce883becb [PATCH] x86: Remove noreplacement option
noreplacement is dangerous on modern systems because it will not replace the
context switch FNSAVE with SSE aware FXSAVE. But other places in the kernel still assume
SSE and do FXSAVE and the CPU will then access FXSAVE information with
FNSAVE and cause corruption.

Easiest way to avoid this is to remove the option. It was mostly for paranoia
reasons anyways and alternative()s have been stable for some time.

Thanks to Jeremy F. for reporting and helping debug it.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-04-24 13:05:37 +02:00
Andi Kleen
a98f0dd34d [PATCH] x86-64: Allow to run a program when a machine check event is detected
When a machine check event is detected (including a AMD RevF threshold
overflow event) allow to run a "trigger" program. This allows user space
to react to such events sooner.

The trigger is configured using a new trigger entry in the
machinecheck sysfs interface. It is currently shared between
all CPUs.

I also fixed the AMD threshold handler to run the machine
check polling code immediately to actually log any events
that might have caused the threshold interrupt.

Also added some documentation for the mce sysfs interface.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-02-13 13:26:23 +01:00
Randy Dunlap
57d307720c [PATCH] x86-64: cleanup Doc/x86_64/ files
Fix typos.
Lots of whitespace changes for readability and consistency.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-02-13 13:26:23 +01:00
Karsten Weiss
5558870bfb [PATCH] x86-64: improved iommu documentation
- add SWIOTLB config help text
- mention Documentation/x86_64/boot-options.txt in
  Documentation/kernel-parameters.txt
- remove the duplication of the iommu kernel parameter documentation.
- Better explanation of some of the iommu kernel parameter options.
- "32MB<<order" instead of "32MB^order".
- Mention the default "order" value.
- list the four existing PCI-DMA mapping implementations of arch x86_64
- group the iommu= option keywords by PCI-DMA mapping implementation.
- Distinguish iommu= option keywords from number arguments.
- Explain the meaning of DAC and SAC.

Signed-off-by: Karsten Weiss <knweiss@science-computing.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-13 13:26:21 +01:00
Linus Torvalds
fea5f1e196 Revert "[PATCH] x86-64: Try multiple timer variants in check_timer"
This reverts commit b026872601, which has
been linked to several problem reports with IO-APIC and the timer.
Machines either don't boot because the timer doesn't happen, or we get
double timer interrupts because we end up double-routing the timer irq
through multiple interfaces.

See for example

	http://lkml.org/lkml/2006/12/16/101
	http://lkml.org/lkml/2007/1/3/9
	http://bugzilla.kernel.org/show_bug.cgi?id=7789

about some of the discussion.

Patches to fix this cleanup exist (and have been confirmed to work fine
at least for some of the affected cases) and we'll revisit it for
2.6.21, but this late in the -rc series we're better off just reverting
the incomplete commit that caused the problems.

Suggested-by: Adrian Bunk <bunk@stusta.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Yinghai Lu <yinghai.lu@amd.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-08 15:04:46 -08:00
Muli Ben-Yehuda
bff6547bb6 [PATCH] Calgary: allow compiling Calgary in but not using it by default
This patch makes it possible to compile Calgary in but not use it by
default. In this mode, use 'iommu=calgary' to activate it.

Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:07 +01:00
Andi Kleen
b026872601 [PATCH] x86-64: Try multiple timer variants in check_timer
Instead of adding all kinds of more quirks try various timer
routing variants in check_timer.

In particular this tries to handle quirks from:
- Nvidia NF2-4 reference BIOS: wrong timer override
- Asus: Wrong timer override but no HPET table
- ATI: require timer disabled in 8259
- Some boards: require timer enabled in 8259

We just try many of the the known variants in the hopefully right order
in check_timer.

Trying pin 0/2 on Nvidia suggested by Tim Hockin.

TBD Experimental. Needs a lot of testing

Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:06 +01:00
Matt LaPlante
d6bc8ac9e1 Fix typos in Documentation/: 'Q'-'R'
This patch fixes typos in various Documentation txts. The patch addresses
some words starting with the letters 'Q'-'R'.

Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 22:54:15 +02:00
Andi Kleen
d802ab981d [PATCH] Document iommu=panic
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Andi Kleen
ece6684012 [PATCH] Allow disabling DAC using command line options
Might or might not work around some reported bugs on VIA systems.

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Andi Kleen
352f7bae81 [PATCH] Add stack documentation document from Keith Owens
Describes the stack organization on x86-64.

I changed it a bit and removed some obsolete information and the
questions.

Cc: kaos@sgi.com
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:31 +02:00