Commit Graph

161844 Commits

Author SHA1 Message Date
Andi Kleen
4db96cf077 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
This allows processes to override their early/late kill
behaviour on hardware memory errors.

Typically applications which are memory error aware is
better of with early kill (see the error as soon
as possible), all others with late kill (only
see the error when the error is really impacting execution)

There's a global sysctl, but this way an application
can set its specific policy.

We're using two bits, one to signify that the process
stated its intention and that

I also made the prctl future proof by enforcing
the unused arguments are 0.

The state is inherited to children.

Note this makes us officially run out of process flags
on 32bit, but the next patch can easily add another field.

Manpage patch will be supplied separately.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:14 +02:00
Wu Fengguang
6746aff74d HWPOISON: shmem: call set_page_dirty() with locked page
The dirtying of page and set_page_dirty() can be moved into the page lock.

- In shmem_write_end(), the page was dirtied while the page lock was held,
  but it's being marked dirty just after dropping the page lock.
- In shmem_symlink(), both dirtying and marking can be moved into page lock.

It's valuable for the hwpoison code to know whether one bad page can be dropped
without losing data. It mainly judges by testing the PG_dirty bit after taking
the page lock. So it becomes important that the dirtying of page and the
marking of dirtiness are both done inside the page lock. Which is a common
practice, but sadly not a rule.

The noticeable exceptions are
- mapped pages
- pages with buffer_heads
The above pages could go dirty at any time. Fortunately the hwpoison will
unmap the page and release the buffer_heads beforehand anyway.

Many other types of pages (eg. metadata pages) can also be dirtied at will by
their owners, the hwpoison code cannot do meaningful things to them anyway.
Only the dirtiness of pagecache pages owned by regular files are interested.

v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:14 +02:00
Andi Kleen
2571873621 HWPOISON: Define a new error_remove_page address space op for async truncation
Truncating metadata pages is not safe right now before
we haven't audited all file systems.

To enable truncation only for data address space define
a new address_space callback error_remove_page.

This is used for memory_failure.c memory error handling.

This can be then set to truncate_inode_page()

This patch just defines the new operation and adds documentation.

Callers and users come in followon patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:13 +02:00
Wu Fengguang
83f786680a HWPOISON: Add invalidate_inode_page
Add a simple way to invalidate a single page
This is just a refactoring of the truncate.c code.
Originally from Fengguang, modified by Andi Kleen.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:13 +02:00
Nick Piggin
750b4987b0 HWPOISON: Refactor truncate to allow direct truncating of page v2
Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c

[AK: description, headers, fix typos]
v2: Some white space changes from Fengguang Wu

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:12 +02:00
Wu Fengguang
2a7684a23e HWPOISON: check and isolate corrupted free pages v2
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:12 +02:00
Andi Kleen
888b9f7c58 HWPOISON: Handle hardware poisoned pages in try_to_unmap
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.

v2: add a new flag to not do that (needed for the memory-failure handler
later) (Fengguang)
v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:11 +02:00
Andi Kleen
14fa31b89c HWPOISON: Use bitmask/action code for try_to_unmap behaviour
try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:10 +02:00
Andi Kleen
a6e04aa929 HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

v2: Make the printk more verbose/unique

Cc: x86@kernel.org

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:09 +02:00
Andi Kleen
a3b947eacf HWPOISON: Add poison check to page fault handling
Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

v2: Do a page unlock if needed (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:08 +02:00
Andi Kleen
d1737fdbec HWPOISON: Add basic support for poisoned pages in fault handler v3
- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:06 +02:00
Andi Kleen
ad5fa91399 HWPOISON: Add new SIGBUS error codes for hardware poison signals
Add new SIGBUS codes for reporting machine checks as signals. When
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:06 +02:00
Andi Kleen
a7420aa54d HWPOISON: Add support for poison swap entries v2
Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:05 +02:00
Andi Kleen
10be22dfe1 HWPOISON: Export some rmap vma locking to outside world
Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:04 +02:00
Andi Kleen
d466f2fcb3 HWPOISON: Add page flag for poisoned pages
Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:03 +02:00
Linus Torvalds
0cb583fd28 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-next-2.6:
  ide: fixup for fujitsu disk
  ide: convert to ->proc_fops
  at91_ide: remove headers specific for at91sam9263
  IDE: palm_bk3710: convert clock usage after clkdev conversion
  ide: fix races in handling of user-space SET XFER commands
  ide: allow ide_dev_read_id() to be called from the IRQ context
  ide: ide-taskfile.c fix style problems
  drivers/ide/ide-cd.c: Use DIV_ROUND_CLOSEST
  ide-tape: fix handling of postponed rqs
  ide-tape: convert to ide_debug_log macro
  ide-tape: fix debug call
  ide: Fix annoying warning in ide_pio_bytes().
  IDE: Save a call to PageHighMem()
2009-09-15 10:01:16 -07:00
Linus Torvalds
723e9db7a4 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (134 commits)
  powerpc/nvram: Enable use Generic NVRAM driver for different size chips
  powerpc/iseries: Fix oops reading from /proc/iSeries/mf/*/cmdline
  powerpc/ps3: Workaround for flash memory I/O error
  powerpc/booke: Don't set DABR on 64-bit BookE, use DAC1 instead
  powerpc/perf_counters: Reduce stack usage of power_check_constraints
  powerpc: Fix bug where perf_counters breaks oprofile
  powerpc/85xx: Fix SMP compile error and allow NULL for smp_ops
  powerpc/irq: Improve nanodoc
  powerpc: Fix some late PowerMac G5 with PCIe ATI graphics
  powerpc/fsl-booke: Use HW PTE format if CONFIG_PTE_64BIT
  powerpc/book3e: Add missing page sizes
  powerpc/pseries: Fix to handle slb resize across migration
  powerpc/powermac: Thermal control turns system off too eagerly
  powerpc/pci: Merge ppc32 and ppc64 versions of phb_scan()
  powerpc/405ex: support cuImage via included dtb
  powerpc/405ex: provide necessary fixup function to support cuImage
  powerpc/40x: Add support for the ESTeem 195E (PPC405EP) SBC
  powerpc/44x: Add Eiger AMCC (AppliedMicro) PPC460SX evaluation board support.
  powerpc/44x: Update Arches defconfig
  powerpc/44x: Update Arches dts
  ...

Fix up conflicts in drivers/char/agp/uninorth-agp.c
2009-09-15 09:51:09 -07:00
Linus Torvalds
ada3fa1505 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
  powerpc64: convert to dynamic percpu allocator
  sparc64: use embedding percpu first chunk allocator
  percpu: kill lpage first chunk allocator
  x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
  percpu: update embedding first chunk allocator to handle sparse units
  percpu: use group information to allocate vmap areas sparsely
  vmalloc: implement pcpu_get_vm_areas()
  vmalloc: separate out insert_vmalloc_vm()
  percpu: add chunk->base_addr
  percpu: add pcpu_unit_offsets[]
  percpu: introduce pcpu_alloc_info and pcpu_group_info
  percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
  percpu: add @align to pcpu_fc_alloc_fn_t
  percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
  percpu: drop @static_size from first chunk allocators
  percpu: generalize first chunk allocator selection
  percpu: build first chunk allocators selectively
  percpu: rename 4k first chunk allocator to page
  percpu: improve boot messages
  percpu: fix pcpu_reclaim() locking
  ...

Fix trivial conflict as by Tejun Heo in kernel/sched.c
2009-09-15 09:39:44 -07:00
Nicolas Pitre
2f82af08fc Nicolas Pitre has a new email address
Due to problems at cam.org, my nico@cam.org email address is no longer
valid.  FRom now on, nico@fluxnic.net should be used instead.

Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-15 09:37:12 -07:00
Linus Torvalds
f199fd9906 Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf_counter: Fix buffer overflow in perf_copy_attr()
2009-09-15 09:34:27 -07:00
Linus Torvalds
043fe50f80 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (213 commits)
  V4L/DVB (12720): em28xx-cards: Add vendor/product id for Kworld DVD Maker 2
  V4L/DVB (12713): em28xx: Cleanups at ir_i2c handler
  V4L/DVB (12712): em28xx: properly load ir-kbd-i2c when needed
  V4L/DVB (12701): saa7134: ir-kbd-i2c init data needs a persistent object
  V4L/DVB (12699): cx18: ir-kbd-i2c initialization data should point to a persistent object
  V4L/DVB (12698): em28xx: ir-kbd-i2c init data needs a persistent object
  V4L/DVB (12707): gspca - sn9c20x: Add SXGA support to MT9M111
  V4L/DVB (12706): gspca - sn9c20x: disable exposure/gain controls for MT9M111 sensors.
  V4L/DVB (12705): gspca - sn9c20x: Add SXGA support to SOI968
  V4L/DVB (12703): gspca - sn9c20x: Reduces size of object
  V4L/DVB (12704): gspca - sn9c20x: Fix exposure on SOI968 sensors
  V4L/DVB (12696): gspca - sonixj / sn9c102: Two drivers for 0c45:60fc and 0c45:613e.
  V4L/DVB (12695): gspca - vc032x: Do the LED work with the sensor hv7131r.
  V4L/DVB (12694): gspca - vc032x: Change the start exchanges of the sensor hv7131r.
  V4L/DVB (12693): gspca - sunplus: The brightness is signed.
  V4L/DVB (12692): gspca - sunplus: Optimize code.
  V4L/DVB (12691): gspca - sonixj: Don't use mdelay().
  V4L/DVB (12690): gspca - pac7311: Webcam 06f8:3009 added.
  V4L/DVB (12686): dvb-core: check supported QAM modulations
  V4L/DVB (12685): dvb-core: check fe->ops.set_frontend return value
  ...
2009-09-15 09:22:18 -07:00
Linus Torvalds
227423904c Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, pat: Fix cacheflush address in change_page_attr_set_clr()
  mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set
  x86: Fix earlyprintk=dbgp for machines without NX
  x86, pat: Sanity check remap_pfn_range for RAM region
  x86, pat: Lookup the protection from memtype list on vm_insert_pfn()
  x86, pat: Add lookup_memtype to get the current memtype of a paddr
  x86, pat: Use page flags to track memtypes of RAM pages
  x86, pat: Generalize the use of page flag PG_uncached
  x86, pat: Add rbtree to do quick lookup in memtype tracking
  x86, pat: Add PAT reserve free to io_mapping* APIs
  x86, pat: New i/f for driver to request memtype for IO regions
  x86, pat: ioremap to follow same PAT restrictions as other PAT users
  x86, pat: Keep identity maps consistent with mmaps even when pat_disabled
  x86, mtrr: make mtrr_aps_delayed_init static bool
  x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
  generic-ipi: Allow cpus not yet online to call smp_call_function with irqs disabled
  x86: Fix an incorrect argument of reserve_bootmem()
  x86: Fix system crash when loading with "reservetop" parameter
2009-09-15 09:19:38 -07:00
Linus Torvalds
1aaf2e5913 Merge branch 'x86-txt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-txt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, intel_txt: clean up the impact on generic code, unbreak non-x86
  x86, intel_txt: Handle ACPI_SLEEP without X86_TRAMPOLINE
  x86, intel_txt: Fix typos in Kconfig help
  x86, intel_txt: Factor out the code for S3 setup
  x86, intel_txt: tboot.c needs <asm/fixmap.h>
  intel_txt: Force IOMMU on for Intel TXT launch
  x86, intel_txt: Intel TXT Sx shutdown support
  x86, intel_txt: Intel TXT reboot/halt shutdown support
  x86, intel_txt: Intel TXT boot support
2009-09-15 09:19:20 -07:00
Linus Torvalds
66a4fe0cb8 Merge branch 'agp-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6
* 'agp-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
  agp/intel: remove restore in resume
  agp: fix uninorth build
  intel-agp: Set dma mask for i915
  agp: kill phys_to_gart() and gart_to_phys()
  intel-agp: fix sglist allocation to avoid vmalloc()
  intel-agp: Move repeated sglist free into separate function
  agp: Switch agp_{un,}map_page() to take struct page * argument
  agp: tidy up handling of scratch pages w.r.t. DMA API
  intel_agp: Use PCI DMA API correctly on chipsets new enough to have IOMMU
  agp: Add generic support for graphics dma remapping
  agp: Switch mask_memory() method to take address argument again, not page
2009-09-15 09:18:07 -07:00
Wu Zhangjin
a2d10568fd ide: fixup for fujitsu disk
This patch will fix the following problem on Yeeloong netbook with
fujitsu disk.

irq 14: nobody cared (try booting with the "irqpoll" option)
Call Trace:
[<ffffffff8020d438>] dump_stack+0x8/0x40
[<ffffffff8027ec64>] __report_bad_irq+0x58/0xe4
[<ffffffff8027ee6c>] note_interrupt+0x17c/0x23c
[<ffffffff8027f9b8>] handle_level_irq+0xcc/0x134
[<ffffffff802125b0>] mach_irq_dispatch+0xb8/0x1e0
[<ffffffff8020041c>] ret_from_irq+0x0/0x4
[<ffffffff8029e678>] free_hot_cold_page+0x224/0x2a0
[<ffffffff8026f794>] swsusp_free+0xb0/0x14c
[<ffffffff8026ec08>] hibernate+0x198/0x218
[<ffffffff8026cfa8>] state_store+0x90/0x138
[<ffffffff8032b5a4>] sysfs_write_file+0x130/0x194
[<ffffffff802c94fc>] vfs_write+0xb8/0x180
[<ffffffff802c96b8>] SyS_write+0x50/0x98
[<ffffffff80203fd8>] handle_sys+0x158/0x174

handlers:
[<ffffffff80429670>] (ide_intr+0x0/0x300)
Disabling IRQ #14

References:

1. commit 1fde02e7146d4a1bab80fd1506f9018fe71e8521 of
git://dev.lemote.com/linux_loongson.git
2. 8bc1e5aa06 (ide: respect quirk_drives[]
list on all controllers)

Signed-off-by: Yan Hua <yanh@lemote.com>
Signed-off-by: Wu Zhangjin <wuzhangjin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-15 01:36:25 -07:00