Merge branch 'pending-misc' (early part) into devel

This commit is contained in:
Russell King
2009-12-04 14:59:47 +00:00
1222 changed files with 22215 additions and 12662 deletions
+1
View File
@@ -25,6 +25,7 @@
*.elf
*.bin
*.gz
*.bz2
*.lzma
*.patch
*.gcno
@@ -1,18 +0,0 @@
What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
Date: August 2008
KernelVersion: 2.6.27
Contact: mark.langsdorf@amd.com
Description: These files exist in every cpu's cache index directories.
There are currently 2 cache_disable_# files in each
directory. Reading from these files on a supported
processor will return that cache disable index value
for that processor and node. Writing to one of these
files will cause the specificed cache index to be disabled.
Currently, only AMD Family 10h Processors support cache index
disable, and only for their L3 caches. See the BIOS and
Kernel Developer's Guide at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
for formatting information and other details on the
cache index disable.
Users: joachim.deguara@amd.com
@@ -0,0 +1,156 @@
What: /sys/devices/system/cpu/
Date: pre-git history
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:
A collection of both global and individual CPU attributes
Individual CPU attributes are contained in subdirectories
named by the kernel's logical CPU number, e.g.:
/sys/devices/system/cpu/cpu#/
What: /sys/devices/system/cpu/sched_mc_power_savings
/sys/devices/system/cpu/sched_smt_power_savings
Date: June 2006
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: Discover and adjust the kernel's multi-core scheduler support.
Possible values are:
0 - No power saving load balance (default value)
1 - Fill one thread/core/package first for long running threads
2 - Also bias task wakeups to semi-idle cpu package for power
savings
sched_mc_power_savings is dependent upon SCHED_MC, which is
itself architecture dependent.
sched_smt_power_savings is dependent upon SCHED_SMT, which
is itself architecture dependent.
The two files are independent of each other. It is possible
that one file may be present without the other.
Introduced by git commit 5c45bf27.
What: /sys/devices/system/cpu/kernel_max
/sys/devices/system/cpu/offline
/sys/devices/system/cpu/online
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/present
Date: December 2008
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: CPU topology files that describe kernel limits related to
hotplug. Briefly:
kernel_max: the maximum cpu index allowed by the kernel
configuration.
offline: cpus that are not online because they have been
HOTPLUGGED off or exceed the limit of cpus allowed by the
kernel configuration (kernel_max above).
online: cpus that are online and being scheduled.
possible: cpus that have been allocated resources and can be
brought online if they are present.
present: cpus that have been identified as being present in
the system.
See Documentation/cputopology.txt for more information.
What: /sys/devices/system/cpu/cpu#/node
Date: October 2009
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Discover NUMA node a CPU belongs to
When CONFIG_NUMA is enabled, a symbolic link that points
to the corresponding NUMA node directory.
For example, the following symlink is created for cpu42
in NUMA node 2:
/sys/devices/system/cpu/cpu42/node2 -> ../../node/node2
What: /sys/devices/system/cpu/cpu#/topology/core_id
/sys/devices/system/cpu/cpu#/topology/core_siblings
/sys/devices/system/cpu/cpu#/topology/core_siblings_list
/sys/devices/system/cpu/cpu#/topology/physical_package_id
/sys/devices/system/cpu/cpu#/topology/thread_siblings
/sys/devices/system/cpu/cpu#/topology/thread_siblings_list
Date: December 2008
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: CPU topology files that describe a logical CPU's relationship
to other cores and threads in the same physical package.
One cpu# directory is created per logical CPU in the system,
e.g. /sys/devices/system/cpu/cpu42/.
Briefly, the files above are:
core_id: the CPU core ID of cpu#. Typically it is the
hardware platform's identifier (rather than the kernel's).
The actual value is architecture and platform dependent.
core_siblings: internal kernel map of cpu#'s hardware threads
within the same physical_package_id.
core_siblings_list: human-readable list of the logical CPU
numbers within the same physical_package_id as cpu#.
physical_package_id: physical package id of cpu#. Typically
corresponds to a physical socket number, but the actual value
is architecture and platform dependent.
thread_siblings: internel kernel map of cpu#'s hardware
threads within the same core as cpu#
thread_siblings_list: human-readable list of cpu#'s hardware
threads within the same core as cpu#
See Documentation/cputopology.txt for more information.
What: /sys/devices/system/cpu/cpuidle/current_driver
/sys/devices/system/cpu/cpuidle/current_governer_ro
Date: September 2007
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: Discover cpuidle policy and mechanism
Various CPUs today support multiple idle levels that are
differentiated by varying exit latencies and power
consumption during idle.
Idle policy (governor) is differentiated from idle mechanism
(driver)
current_driver: displays current idle mechanism
current_governor_ro: displays current idle policy
See files in Documentation/cpuidle/ for more information.
What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
Date: August 2008
KernelVersion: 2.6.27
Contact: mark.langsdorf@amd.com
Description: These files exist in every cpu's cache index directories.
There are currently 2 cache_disable_# files in each
directory. Reading from these files on a supported
processor will return that cache disable index value
for that processor and node. Writing to one of these
files will cause the specificed cache index to be disabled.
Currently, only AMD Family 10h Processors support cache index
disable, and only for their L3 caches. See the BIOS and
Kernel Developer's Guide at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
for formatting information and other details on the
cache index disable.
Users: joachim.deguara@amd.com
+30 -17
View File
@@ -1,15 +1,28 @@
Export cpu topology info via sysfs. Items (attributes) are similar
Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo.
1) /sys/devices/system/cpu/cpuX/topology/physical_package_id:
represent the physical package id of cpu X;
physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
2) /sys/devices/system/cpu/cpuX/topology/core_id:
represent the cpu core id to cpu X;
the CPU core ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
3) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
represent the thread siblings to cpu X in the same core;
internel kernel map of cpuX's hardware threads within the same
core as cpuX
4) /sys/devices/system/cpu/cpuX/topology/core_siblings:
represent the thread siblings to cpu X in the same physical package;
internal kernel map of cpuX's hardware threads within the same
physical_package_id.
To implement it in an architecture-neutral way, a new source file,
drivers/base/topology.c, is to export the 4 attributes.
@@ -32,32 +45,32 @@ not defined by include/asm-XXX/topology.h:
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
Additionally, cpu topology information is provided under
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]").
kernel_max: the maximum cpu index allowed by the kernel configuration.
kernel_max: the maximum CPU index allowed by the kernel configuration.
[NR_CPUS-1]
offline: cpus that are not online because they have been
offline: CPUs that are not online because they have been
HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
of cpus allowed by the kernel configuration (kernel_max
of CPUs allowed by the kernel configuration (kernel_max
above). [~cpu_online_mask + cpus >= NR_CPUS]
online: cpus that are online and being scheduled [cpu_online_mask]
online: CPUs that are online and being scheduled [cpu_online_mask]
possible: cpus that have been allocated resources and can be
possible: CPUs that have been allocated resources and can be
brought online if they are present. [cpu_possible_mask]
present: cpus that have been identified as being present in the
present: CPUs that have been identified as being present in the
system. [cpu_present_mask]
The format for the above output is compatible with cpulist_parse()
[see <linux/cpumask.h>]. Some examples follow.
In this example, there are 64 cpus in the system but cpus 32-63 exceed
In this example, there are 64 CPUs in the system but cpus 32-63 exceed
the kernel max which is limited to 0..31 by the NR_CPUS config option
being 32. Note also that cpus 2 and 4-31 are not online but could be
being 32. Note also that CPUs 2 and 4-31 are not online but could be
brought online as they are both present and possible.
kernel_max: 31
@@ -67,8 +80,8 @@ brought online as they are both present and possible.
present: 0-31
In this example, the NR_CPUS config option is 128, but the kernel was
started with possible_cpus=144. There are 4 cpus in the system and cpu2
was manually taken offline (and is the only cpu that can be brought
started with possible_cpus=144. There are 4 CPUs in the system and cpu2
was manually taken offline (and is the only CPU that can be brought
online.)
kernel_max: 127
@@ -78,4 +91,4 @@ online.)
present: 0-3
See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
as well as more information on the various cpumask's.
as well as more information on the various cpumasks.
+2 -4
View File
@@ -312,10 +312,8 @@ and to the following documentation:
8. Mailing list
---------------
There are several frame buffer device related mailing lists at SourceForge:
- linux-fbdev-announce@lists.sourceforge.net, for announcements,
- linux-fbdev-user@lists.sourceforge.net, for generic user support,
- linux-fbdev-devel@lists.sourceforge.net, for project developers.
There is a frame buffer device related mailing list at kernel.org:
linux-fbdev@vger.kernel.org.
Point your web browser to http://sourceforge.net/projects/linux-fbdev/ for
subscription information and archive browsing.
@@ -418,6 +418,14 @@ When: 2.6.33
Why: Should be implemented in userspace, policy daemon.
Who: Johannes Berg <johannes@sipsolutions.net>
---------------------------
What: CONFIG_INOTIFY
When: 2.6.33
Why: last user (audit) will be converted to the newer more generic
and more easily maintained fsnotify subsystem
Who: Eric Paris <eparis@redhat.com>
----------------------------
What: lock_policy_rwsem_* and unlock_policy_rwsem_* will not be
@@ -235,6 +235,7 @@ proc files.
neg=N Number of negative lookups made
pos=N Number of positive lookups made
crt=N Number of objects created by lookup
tmo=N Number of lookups timed out and requeued
Updates n=N Number of update cookie requests seen
nul=N Number of upd reqs given a NULL parent
run=N Number of upd reqs granted CPU time
@@ -250,8 +251,10 @@ proc files.
ok=N Number of successful alloc reqs
wt=N Number of alloc reqs that waited on lookup completion
nbf=N Number of alloc reqs rejected -ENOBUFS
int=N Number of alloc reqs aborted -ERESTARTSYS
ops=N Number of alloc reqs submitted
owt=N Number of alloc reqs waited for CPU time
abt=N Number of alloc reqs aborted due to object death
Retrvls n=N Number of retrieval (read) requests seen
ok=N Number of successful retr reqs
wt=N Number of retr reqs that waited on lookup completion
@@ -261,6 +264,7 @@ proc files.
oom=N Number of retr reqs failed -ENOMEM
ops=N Number of retr reqs submitted
owt=N Number of retr reqs waited for CPU time
abt=N Number of retr reqs aborted due to object death
Stores n=N Number of storage (write) requests seen
ok=N Number of successful store reqs
agn=N Number of store reqs on a page already pending storage
@@ -268,12 +272,37 @@ proc files.
oom=N Number of store reqs failed -ENOMEM
ops=N Number of store reqs submitted
run=N Number of store reqs granted CPU time
pgs=N Number of pages given store req processing time
rxd=N Number of store reqs deleted from tracking tree
olm=N Number of store reqs over store limit
VmScan nos=N Number of release reqs against pages with no pending store
gon=N Number of release reqs against pages stored by time lock granted
bsy=N Number of release reqs ignored due to in-progress store
can=N Number of page stores cancelled due to release req
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
can=N Number of async ops cancelled
rej=N Number of async ops rejected due to object lookup/create failure
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released
gc=N Number of deferred-release async ops garbage collected
CacheOp alo=N Number of in-progress alloc_object() cache ops
luo=N Number of in-progress lookup_object() cache ops
luc=N Number of in-progress lookup_complete() cache ops
gro=N Number of in-progress grab_object() cache ops
upo=N Number of in-progress update_object() cache ops
dro=N Number of in-progress drop_object() cache ops
pto=N Number of in-progress put_object() cache ops
syn=N Number of in-progress sync_cache() cache ops
atc=N Number of in-progress attr_changed() cache ops
rap=N Number of in-progress read_or_alloc_page() cache ops
ras=N Number of in-progress read_or_alloc_pages() cache ops
alp=N Number of in-progress allocate_page() cache ops
als=N Number of in-progress allocate_pages() cache ops
wrp=N Number of in-progress write_page() cache ops
ucp=N Number of in-progress uncache_page() cache ops
dsp=N Number of in-progress dissociate_pages() cache ops
(*) /proc/fs/fscache/histogram
@@ -299,6 +328,87 @@ proc files.
jiffy range covered, and the SECS field the equivalent number of seconds.
===========
OBJECT LIST
===========
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
list of all the objects currently allocated and allow them to be viewed
through:
/proc/fs/fscache/objects
This will look something like:
[root@andromeda ~]# head /proc/fs/fscache/objects
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 8 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 8 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
where the first set of columns before the '|' describe the object:
COLUMN DESCRIPTION
======= ===============================================================
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
PARENT Debugging ID of parent object
STAT Object state
CHLDN Number of child objects of this object
OPS Number of outstanding operations on this object
OOP Number of outstanding child object management operations
IPR
EX Number of outstanding exclusive operations
READS Number of outstanding read operations
EM Object's event mask
EV Events raised on this object
F Object flags
S Object slow-work work item flags
and the second set of columns describe the object's cookie, if present:
COLUMN DESCRIPTION
=============== =======================================================
NETFS_COOKIE_DEF Name of netfs cookie definition
TY Cookie type (IX - index, DT - data, hex - special)
FL Cookie flags
NETFS_DATA Netfs private data stored in the cookie
OBJECT_KEY Object key } 1 column, with separating comma
AUX_DATA Object aux data } presence may be configured
The data shown may be filtered by attaching the a key to an appropriate keyring
before viewing the file. Something like:
keyctl add user fscache:objlist <restrictions> @s
where <restrictions> are a selection of the following letters:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
and the following paired letters:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have slow work queued
s Show objects that don't have slow work queued
If neither side of a letter pair is given, then both are implied. For example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
By default all objects and all fields will be shown.
=========
DEBUGGING
=========
@@ -641,7 +641,7 @@ data file must be retired (see the relinquish cookie function below).
Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions, so the page
invalidation and release functions must use:
invalidation functions must use:
bool fscache_check_page_write(struct fscache_cookie *cookie,
struct page *page);
@@ -654,6 +654,25 @@ to see if a page is being written to the cache, and:
to wait for it to finish if it is.
When releasepage() is being implemented, a special FS-Cache function exists to
manage the heuristics of coping with vmscan trying to eject pages, which may
conflict with the cache trying to write pages to the cache (which may itself
need to allocate memory):
bool fscache_maybe_release_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp);
This takes the netfs cookie, and the page and gfp arguments as supplied to
releasepage(). It will return false if the page cannot be released yet for
some reason and if it returns true, the page has been uncached and can now be
released.
To make a page available for release, this function may wait for an outstanding
storage request to complete, or it may attempt to cancel the storage request -
in which case the page will not be stored in the cache this time.
==========================
INDEX AND DATA FILE UPDATE
==========================
+7 -1
View File
@@ -134,9 +134,15 @@ ro Mount filesystem read only. Note that ext4 will
mount options "ro,noload" can be used to prevent
writes to the filesystem.
journal_checksum Enable checksumming of the journal transactions.
This will allow the recovery code in e2fsck and the
kernel to detect corruption in the kernel. It is a
compatible change and will be ignored by older kernels.
journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot
mount the device.
mount the device. This will enable 'journal_checksum'
internally.
journal=update Update the ext4 file system's journal to the current
format.
+3 -3
View File
@@ -20,15 +20,16 @@ Lots of code taken from ext3 and other projects.
Authors in alphabetical order:
Joel Becker <joel.becker@oracle.com>
Zach Brown <zach.brown@oracle.com>
Mark Fasheh <mark.fasheh@oracle.com>
Mark Fasheh <mfasheh@suse.com>
Kurt Hackel <kurt.hackel@oracle.com>
Tao Ma <tao.ma@oracle.com>
Sunil Mushran <sunil.mushran@oracle.com>
Manish Singh <manish.singh@oracle.com>
Tiger Yang <tiger.yang@oracle.com>
Caveats
=======
Features which OCFS2 does not support yet:
- quotas
- Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
@@ -70,7 +71,6 @@ commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata
performance.
localalloc=8(*) Allows custom localalloc size in MB. If the value is too
large, the fs will silently revert it to the default.
Localalloc is not enabled for local mounts.
localflocks This disables cluster aware flock.
inode64 Indicates that Ocfs2 is allowed to create inodes at
any location in the filesystem, including those which
+56 -1
View File
@@ -353,10 +353,20 @@ power[1-*]_average Average power use
Unit: microWatt
RO
power[1-*]_average_interval Power use averaging interval
power[1-*]_average_interval Power use averaging interval. A poll
notification is sent to this file if the
hardware changes the averaging interval.
Unit: milliseconds
RW
power[1-*]_average_interval_max Maximum power use averaging interval
Unit: milliseconds
RO
power[1-*]_average_interval_min Minimum power use averaging interval
Unit: milliseconds
RO
power[1-*]_average_highest Historical average maximum power use
Unit: microWatt
RO
@@ -365,6 +375,18 @@ power[1-*]_average_lowest Historical average minimum power use
Unit: microWatt
RO
power[1-*]_average_max A poll notification is sent to
power[1-*]_average when power use
rises above this value.
Unit: microWatt
RW
power[1-*]_average_min A poll notification is sent to
power[1-*]_average when power use
sinks below this value.
Unit: microWatt
RW
power[1-*]_input Instantaneous power use
Unit: microWatt
RO
@@ -381,6 +403,39 @@ power[1-*]_reset_history Reset input_highest, input_lowest,
average_highest and average_lowest.
WO
power[1-*]_accuracy Accuracy of the power meter.
Unit: Percent
RO
power[1-*]_alarm 1 if the system is drawing more power than the
cap allows; 0 otherwise. A poll notification is
sent to this file when the power use exceeds the
cap. This file only appears if the cap is known
to be enforced by hardware.
RO
power[1-*]_cap If power use rises above this limit, the
system should take action to reduce power use.
A poll notification is sent to this file if the
cap is changed by the hardware. The *_cap
files only appear if the cap is known to be
enforced by hardware.
Unit: microWatt
RW
power[1-*]_cap_hyst Margin of hysteresis built around capping and
notification.
Unit: microWatt
RW
power[1-*]_cap_max Maximum cap that can be set.
Unit: microWatt
RO
power[1-*]_cap_min Minimum cap that can be set.
Unit: microWatt
RO
**********
* Energy *
**********
+1 -1
View File
@@ -8,7 +8,7 @@ Supported adapters:
Datasheet: Only available via NDA from ServerWorks
* ATI IXP200, IXP300, IXP400, SB600, SB700 and SB800 southbridges
Datasheet: Not publicly available
* AMD SB900
* AMD Hudson-2
Datasheet: Not publicly available
* Standard Microsystems (SMSC) SLC90E66 (Victory66) southbridge
Datasheet: Publicly available at the SMSC website http://www.smsc.com
-1
View File
@@ -42,7 +42,6 @@
#include <signal.h>
#include "linux/lguest_launcher.h"
#include "linux/virtio_config.h"
#include <linux/virtio_ids.h>
#include "linux/virtio_net.h"
#include "linux/virtio_blk.h"
#include "linux/virtio_console.h"
+154 -6
View File
@@ -41,6 +41,13 @@ expand files, provided the time taken to do so isn't too long.
Operations of both types may sleep during execution, thus tying up the thread
loaned to it.
A further class of work item is available, based on the slow work item class:
(*) Delayed slow work items.
These are slow work items that have a timer to defer queueing of the item for
a while.
THREAD-TO-CLASS ALLOCATION
--------------------------
@@ -64,9 +71,11 @@ USING SLOW WORK ITEMS
Firstly, a module or subsystem wanting to make use of slow work items must
register its interest:
int ret = slow_work_register_user();
int ret = slow_work_register_user(struct module *module);
This will return 0 if successful, or a -ve error upon failure.
This will return 0 if successful, or a -ve error upon failure. The module
pointer should be the module interested in using this facility (almost
certainly THIS_MODULE).
Slow work items may then be set up by:
@@ -91,6 +100,10 @@ Slow work items may then be set up by:
slow_work_init(&myitem, &myitem_ops);
or:
delayed_slow_work_init(&myitem, &myitem_ops);
or:
vslow_work_init(&myitem, &myitem_ops);
@@ -102,15 +115,92 @@ A suitably set up work item can then be enqueued for processing:
int ret = slow_work_enqueue(&myitem);
This will return a -ve error if the thread pool is unable to gain a reference
on the item, 0 otherwise.
on the item, 0 otherwise, or (for delayed work):
int ret = delayed_slow_work_enqueue(&myitem, my_jiffy_delay);
The items are reference counted, so there ought to be no need for a flush
operation. When all a module's slow work items have been processed, and the
operation. But as the reference counting is optional, means to cancel
existing work items are also included:
cancel_slow_work(&myitem);
cancel_delayed_slow_work(&myitem);
can be used to cancel pending work. The above cancel function waits for
existing work to have been executed (or prevent execution of them, depending
on timing).
When all a module's slow work items have been processed, and the
module has no further interest in the facility, it should unregister its
interest:
slow_work_unregister_user();
slow_work_unregister_user(struct module *module);
The module pointer is used to wait for all outstanding work items for that
module before completing the unregistration. This prevents the put_ref() code
from being taken away before it completes. module should almost certainly be
THIS_MODULE.
================
HELPER FUNCTIONS
================
The slow-work facility provides a function by which it can be determined
whether or not an item is queued for later execution:
bool queued = slow_work_is_queued(struct slow_work *work);
If it returns false, then the item is not on the queue (it may be executing
with a requeue pending). This can be used to work out whether an item on which
another depends is on the queue, thus allowing a dependent item to be queued
after it.
If the above shows an item on which another depends not to be queued, then the
owner of the dependent item might need to wait. However, to avoid locking up
the threads unnecessarily be sleeping in them, it can make sense under some
circumstances to return the work item to the queue, thus deferring it until
some other items have had a chance to make use of the yielded thread.
To yield a thread and defer an item, the work function should simply enqueue
the work item again and return. However, this doesn't work if there's nothing
actually on the queue, as the thread just vacated will jump straight back into
the item's work function, thus busy waiting on a CPU.
Instead, the item should use the thread to wait for the dependency to go away,
but rather than using schedule() or schedule_timeout() to sleep, it should use
the following function:
bool requeue = slow_work_sleep_till_thread_needed(
struct slow_work *work,
signed long *_timeout);
This will add a second wait and then sleep, such that it will be woken up if
either something appears on the queue that could usefully make use of the
thread - and behind which this item can be queued, or if the event the caller
set up to wait for happens. True will be returned if something else appeared
on the queue and this work function should perhaps return, of false if
something else woke it up. The timeout is as for schedule_timeout().
For example:
wq = bit_waitqueue(&my_flags, MY_BIT);
init_wait(&wait);
requeue = false;
do {
prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
if (!test_bit(MY_BIT, &my_flags))
break;
requeue = slow_work_sleep_till_thread_needed(&my_work,
&timeout);
} while (timeout > 0 && !requeue);
finish_wait(wq, &wait);
if (!test_bit(MY_BIT, &my_flags)
goto do_my_thing;
if (requeue)
return; // to slow_work
===============
@@ -118,7 +208,8 @@ ITEM OPERATIONS
===============
Each work item requires a table of operations of type struct slow_work_ops.
All members are required:
Only ->execute() is required; the getting and putting of a reference and the
describing of an item are all optional.
(*) Get a reference on an item:
@@ -148,6 +239,16 @@ All members are required:
This should perform the work required of the item. It may sleep, it may
perform disk I/O and it may wait for locks.
(*) View an item through /proc:
void (*desc)(struct slow_work *work, struct seq_file *m);
If supplied, this should print to 'm' a small string describing the work
the item is to do. This should be no more than about 40 characters, and
shouldn't include a newline character.
See the 'Viewing executing and queued items' section below.
==================
POOL CONFIGURATION
@@ -172,3 +273,50 @@ The slow-work thread pool has a number of configurables:
is bounded to between 1 and one fewer than the number of active threads.
This ensures there is always at least one thread that can process very
slow work items, and always at least one thread that won't.
==================================
VIEWING EXECUTING AND QUEUED ITEMS
==================================
If CONFIG_SLOW_WORK_PROC is enabled, a proc file is made available:
/proc/slow_work_rq
through which the list of work items being executed and the queues of items to
be executed may be viewed. The owner of a work item is given the chance to
add some information of its own.
The contents look something like the following:
THR PID ITEM ADDR FL MARK DESC
=== ===== ================ == ===== ==========
0 3005 ffff880023f52348 a 952ms FSC: OBJ17d3: LOOK
1 3006 ffff880024e33668 2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2
2 3165 ffff8800296dd180 a 424ms FSC: OBJ17e4: LOOK
3 4089 ffff8800262c8d78 a 212ms FSC: OBJ17ea: CRTN
4 4090 ffff88002792bed8 2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2
5 4092 ffff88002a0ef308 2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2
6 4094 ffff88002abaf4b8 2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2
7 4095 ffff88002bb188e0 a 388ms FSC: OBJ17e9: CRTN
vsq - ffff880023d99668 1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2
vsq - ffff8800295d1740 1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2
vsq - ffff880025ba3308 1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2
vsq - ffff880024ec83e0 1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2
vsq - ffff880026618e00 1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2
vsq - ffff880025a2a4b8 1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2
vsq - ffff880023cbe6d8 9 212ms FSC: OBJ17eb: LOOK
vsq - ffff880024d37590 9 212ms FSC: OBJ17ec: LOOK
vsq - ffff880027746cb0 9 212ms FSC: OBJ17ed: LOOK
vsq - ffff880024d37ae8 9 212ms FSC: OBJ17ee: LOOK
vsq - ffff880024d37cb0 9 212ms FSC: OBJ17ef: LOOK
vsq - ffff880025036550 9 212ms FSC: OBJ17f0: LOOK
vsq - ffff8800250368e0 9 212ms FSC: OBJ17f1: LOOK
vsq - ffff880025036aa8 9 212ms FSC: OBJ17f2: LOOK
In the 'THR' column, executing items show the thread they're occupying and
queued threads indicate which queue they're on. 'PID' shows the process ID of
a slow-work thread that's executing something. 'FL' shows the work item flags.
'MARK' indicates how long since an item was queued or began executing. Lastly,
the 'DESC' column permits the owner of an item to give some information.
@@ -522,7 +522,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
pcm_devs - Number of PCM devices assigned to each card
(default = 1, up to 4)
pcm_substreams - Number of PCM substreams assigned to each PCM
(default = 8, up to 16)
(default = 8, up to 128)
hrtimer - Use hrtimer (=1, default) or system timer (=0)
fake_buffer - Fake buffer allocations (default = 1)
+188 -175
View File
@@ -1,5 +1,5 @@
Generic Thermal Sysfs driver How To
=========================
===================================
Written by Sujith Thomas <sujith.thomas@intel.com>, Zhang Rui <rui.zhang@intel.com>
@@ -10,20 +10,20 @@ Copyright (c) 2008 Intel Corporation
0. Introduction
The generic thermal sysfs provides a set of interfaces for thermal zone devices (sensors)
and thermal cooling devices (fan, processor...) to register with the thermal management
solution and to be a part of it.
The generic thermal sysfs provides a set of interfaces for thermal zone
devices (sensors) and thermal cooling devices (fan, processor...) to register
with the thermal management solution and to be a part of it.
This how-to focuses on enabling new thermal zone and cooling devices to participate
in thermal management.
This solution is platform independent and any type of thermal zone devices and
cooling devices should be able to make use of the infrastructure.
This how-to focuses on enabling new thermal zone and cooling devices to
participate in thermal management.
This solution is platform independent and any type of thermal zone devices
and cooling devices should be able to make use of the infrastructure.
The main task of the thermal sysfs driver is to expose thermal zone attributes as well
as cooling device attributes to the user space.
An intelligent thermal management application can make decisions based on inputs
from thermal zone attributes (the current temperature and trip point temperature)
and throttle appropriate devices.
The main task of the thermal sysfs driver is to expose thermal zone attributes
as well as cooling device attributes to the user space.
An intelligent thermal management application can make decisions based on
inputs from thermal zone attributes (the current temperature and trip point
temperature) and throttle appropriate devices.
[0-*] denotes any positive number starting from 0
[1-*] denotes any positive number starting from 1
@@ -31,77 +31,77 @@ and throttle appropriate devices.
1. thermal sysfs driver interface functions
1.1 thermal zone device interface
1.1.1 struct thermal_zone_device *thermal_zone_device_register(char *name, int trips,
void *devdata, struct thermal_zone_device_ops *ops)
1.1.1 struct thermal_zone_device *thermal_zone_device_register(char *name,
int trips, void *devdata, struct thermal_zone_device_ops *ops)
This interface function adds a new thermal zone device (sensor) to
/sys/class/thermal folder as thermal_zone[0-*].
It tries to bind all the thermal cooling devices registered at the same time.
This interface function adds a new thermal zone device (sensor) to
/sys/class/thermal folder as thermal_zone[0-*]. It tries to bind all the
thermal cooling devices registered at the same time.
name: the thermal zone name.
trips: the total number of trip points this thermal zone supports.
devdata: device private data
ops: thermal zone device call-backs.
.bind: bind the thermal zone device with a thermal cooling device.
.unbind: unbind the thermal zone device with a thermal cooling device.
.get_temp: get the current temperature of the thermal zone.
.get_mode: get the current mode (user/kernel) of the thermal zone.
"kernel" means thermal management is done in kernel.
"user" will prevent kernel thermal driver actions upon trip points
so that user applications can take charge of thermal management.
.set_mode: set the mode (user/kernel) of the thermal zone.
.get_trip_type: get the type of certain trip point.
.get_trip_temp: get the temperature above which the certain trip point
will be fired.
name: the thermal zone name.
trips: the total number of trip points this thermal zone supports.
devdata: device private data
ops: thermal zone device call-backs.
.bind: bind the thermal zone device with a thermal cooling device.
.unbind: unbind the thermal zone device with a thermal cooling device.
.get_temp: get the current temperature of the thermal zone.
.get_mode: get the current mode (user/kernel) of the thermal zone.
- "kernel" means thermal management is done in kernel.
- "user" will prevent kernel thermal driver actions upon trip points
so that user applications can take charge of thermal management.
.set_mode: set the mode (user/kernel) of the thermal zone.
.get_trip_type: get the type of certain trip point.
.get_trip_temp: get the temperature above which the certain trip point
will be fired.
1.1.2 void thermal_zone_device_unregister(struct thermal_zone_device *tz)
This interface function removes the thermal zone device.
It deletes the corresponding entry form /sys/class/thermal folder and unbind all
the thermal cooling devices it uses.
This interface function removes the thermal zone device.
It deletes the corresponding entry form /sys/class/thermal folder and
unbind all the thermal cooling devices it uses.
1.2 thermal cooling device interface
1.2.1 struct thermal_cooling_device *thermal_cooling_device_register(char *name,
void *devdata, struct thermal_cooling_device_ops *)
void *devdata, struct thermal_cooling_device_ops *)
This interface function adds a new thermal cooling device (fan/processor/...) to
/sys/class/thermal/ folder as cooling_device[0-*].
It tries to bind itself to all the thermal zone devices register at the same time.
name: the cooling device name.
devdata: device private data.
ops: thermal cooling devices call-backs.
.get_max_state: get the Maximum throttle state of the cooling device.
.get_cur_state: get the Current throttle state of the cooling device.
.set_cur_state: set the Current throttle state of the cooling device.
This interface function adds a new thermal cooling device (fan/processor/...)
to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
to all the thermal zone devices register at the same time.
name: the cooling device name.
devdata: device private data.
ops: thermal cooling devices call-backs.
.get_max_state: get the Maximum throttle state of the cooling device.
.get_cur_state: get the Current throttle state of the cooling device.
.set_cur_state: set the Current throttle state of the cooling device.
1.2.2 void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
This interface function remove the thermal cooling device.
It deletes the corresponding entry form /sys/class/thermal folder and unbind
itself from all the thermal zone devices using it.
This interface function remove the thermal cooling device.
It deletes the corresponding entry form /sys/class/thermal folder and
unbind itself from all the thermal zone devices using it.
1.3 interface for binding a thermal zone device with a thermal cooling device
1.3.1 int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
int trip, struct thermal_cooling_device *cdev);
int trip, struct thermal_cooling_device *cdev);
This interface function bind a thermal cooling device to the certain trip point
of a thermal zone device.
This function is usually called in the thermal zone device .bind callback.
tz: the thermal zone device
cdev: thermal cooling device
trip: indicates which trip point the cooling devices is associated with
in this thermal zone.
This interface function bind a thermal cooling device to the certain trip
point of a thermal zone device.
This function is usually called in the thermal zone device .bind callback.
tz: the thermal zone device
cdev: thermal cooling device
trip: indicates which trip point the cooling devices is associated with
in this thermal zone.
1.3.2 int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz,
int trip, struct thermal_cooling_device *cdev);
int trip, struct thermal_cooling_device *cdev);
This interface function unbind a thermal cooling device from the certain trip point
of a thermal zone device.
This function is usually called in the thermal zone device .unbind callback.
tz: the thermal zone device
cdev: thermal cooling device
trip: indicates which trip point the cooling devices is associated with
in this thermal zone.
This interface function unbind a thermal cooling device from the certain
trip point of a thermal zone device. This function is usually called in
the thermal zone device .unbind callback.
tz: the thermal zone device
cdev: thermal cooling device
trip: indicates which trip point the cooling devices is associated with
in this thermal zone.
2. sysfs attributes structure
@@ -114,153 +114,166 @@ if hwmon is compiled in or built as a module.
Thermal zone device sys I/F, created once it's registered:
/sys/class/thermal/thermal_zone[0-*]:
|-----type: Type of the thermal zone
|-----temp: Current temperature
|-----mode: Working mode of the thermal zone
|-----trip_point_[0-*]_temp: Trip point temperature
|-----trip_point_[0-*]_type: Trip point type
|---type: Type of the thermal zone
|---temp: Current temperature
|---mode: Working mode of the thermal zone
|---trip_point_[0-*]_temp: Trip point temperature
|---trip_point_[0-*]_type: Trip point type
Thermal cooling device sys I/F, created once it's registered:
/sys/class/thermal/cooling_device[0-*]:
|-----type : Type of the cooling device(processor/fan/...)
|-----max_state: Maximum cooling state of the cooling device
|-----cur_state: Current cooling state of the cooling device
|---type: Type of the cooling device(processor/fan/...)
|---max_state: Maximum cooling state of the cooling device
|---cur_state: Current cooling state of the cooling device
These two dynamic attributes are created/removed in pairs.
They represent the relationship between a thermal zone and its associated cooling device.
They are created/removed for each
thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device successful execution.
Then next two dynamic attributes are created/removed in pairs. They represent
the relationship between a thermal zone and its associated cooling device.
They are created/removed for each successful execution of
thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device.
/sys/class/thermal/thermal_zone[0-*]
|-----cdev[0-*]: The [0-*]th cooling device in the current thermal zone
|-----cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
/sys/class/thermal/thermal_zone[0-*]:
|---cdev[0-*]: [0-*]th cooling device in current thermal zone
|---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
Besides the thermal zone device sysfs I/F and cooling device sysfs I/F,
the generic thermal driver also creates a hwmon sysfs I/F for each _type_ of
thermal zone device. E.g. the generic thermal driver registers one hwmon class device
and build the associated hwmon sysfs I/F for all the registered ACPI thermal zones.
the generic thermal driver also creates a hwmon sysfs I/F for each _type_
of thermal zone device. E.g. the generic thermal driver registers one hwmon
class device and build the associated hwmon sysfs I/F for all the registered
ACPI thermal zones.
/sys/class/hwmon/hwmon[0-*]:
|-----name: The type of the thermal zone devices.
|-----temp[1-*]_input: The current temperature of thermal zone [1-*].
|-----temp[1-*]_critical: The critical trip point of thermal zone [1-*].
|---name: The type of the thermal zone devices
|---temp[1-*]_input: The current temperature of thermal zone [1-*]
|---temp[1-*]_critical: The critical trip point of thermal zone [1-*]
Please read Documentation/hwmon/sysfs-interface for additional information.
***************************
* Thermal zone attributes *
***************************
type Strings which represent the thermal zone type.
This is given by thermal zone driver as part of registration.
Eg: "acpitz" indicates it's an ACPI thermal device.
In order to keep it consistent with hwmon sys attribute,
this should be a short, lowercase string,
not containing spaces nor dashes.
RO
Required
type
Strings which represent the thermal zone type.
This is given by thermal zone driver as part of registration.
E.g: "acpitz" indicates it's an ACPI thermal device.
In order to keep it consistent with hwmon sys attribute; this should
be a short, lowercase string, not containing spaces nor dashes.
RO, Required
temp Current temperature as reported by thermal zone (sensor)
Unit: millidegree Celsius
RO
Required
temp
Current temperature as reported by thermal zone (sensor).
Unit: millidegree Celsius
RO, Required
mode One of the predefined values in [kernel, user]
This file gives information about the algorithm
that is currently managing the thermal zone.
It can be either default kernel based algorithm
or user space application.
RW
Optional
kernel = Thermal management in kernel thermal zone driver.
user = Preventing kernel thermal zone driver actions upon
trip points so that user application can take full
charge of the thermal management.
mode
One of the predefined values in [kernel, user].
This file gives information about the algorithm that is currently
managing the thermal zone. It can be either default kernel based
algorithm or user space application.
kernel = Thermal management in kernel thermal zone driver.
user = Preventing kernel thermal zone driver actions upon
trip points so that user application can take full
charge of the thermal management.
RW, Optional
trip_point_[0-*]_temp The temperature above which trip point will be fired
Unit: millidegree Celsius
RO
Optional
trip_point_[0-*]_temp
The temperature above which trip point will be fired.
Unit: millidegree Celsius
RO, Optional
trip_point_[0-*]_type Strings which indicate the type of the trip point
E.g. it can be one of critical, hot, passive,
active[0-*] for ACPI thermal zone.
RO
Optional
trip_point_[0-*]_type
Strings which indicate the type of the trip point.
E.g. it can be one of critical, hot, passive, active[0-*] for ACPI
thermal zone.
RO, Optional
cdev[0-*] Sysfs link to the thermal cooling device node where the sys I/F
for cooling device throttling control represents.
RO
Optional
cdev[0-*]
Sysfs link to the thermal cooling device node where the sys I/F
for cooling device throttling control represents.
RO, Optional
cdev[0-*]_trip_point The trip point with which cdev[0-*] is associated in this thermal zone
-1 means the cooling device is not associated with any trip point.
RO
Optional
cdev[0-*]_trip_point
The trip point with which cdev[0-*] is associated in this thermal
zone; -1 means the cooling device is not associated with any trip
point.
RO, Optional
******************************
* Cooling device attributes *
******************************
passive
Attribute is only present for zones in which the passive cooling
policy is not supported by native thermal driver. Default is zero
and can be set to a temperature (in millidegrees) to enable a
passive trip point for the zone. Activation is done by polling with
an interval of 1 second.
Unit: millidegrees Celsius
RW, Optional
type String which represents the type of device
eg: For generic ACPI: this should be "Fan",
"Processor" or "LCD"
eg. For memory controller device on intel_menlow platform:
this should be "Memory controller"
RO
Required
*****************************
* Cooling device attributes *
*****************************
max_state The maximum permissible cooling state of this cooling device.
RO
Required
type
String which represents the type of device, e.g:
- for generic ACPI: should be "Fan", "Processor" or "LCD"
- for memory controller device on intel_menlow platform:
should be "Memory controller".
RO, Required
cur_state The current cooling state of this cooling device.
the value can any integer numbers between 0 and max_state,
cur_state == 0 means no cooling
cur_state == max_state means the maximum cooling.
RW
Required
max_state
The maximum permissible cooling state of this cooling device.
RO, Required
cur_state
The current cooling state of this cooling device.
The value can any integer numbers between 0 and max_state:
- cur_state == 0 means no cooling
- cur_state == max_state means the maximum cooling.
RW, Required
3. A simple implementation
ACPI thermal zone may support multiple trip points like critical/hot/passive/active.
If an ACPI thermal zone supports critical, passive, active[0] and active[1] at the same time,
it may register itself as a thermal_zone_device (thermal_zone1) with 4 trip points in all.
It has one processor and one fan, which are both registered as thermal_cooling_device.
If the processor is listed in _PSL method, and the fan is listed in _AL0 method,
the sys I/F structure will be built like this:
ACPI thermal zone may support multiple trip points like critical, hot,
passive, active. If an ACPI thermal zone supports critical, passive,
active[0] and active[1] at the same time, it may register itself as a
thermal_zone_device (thermal_zone1) with 4 trip points in all.
It has one processor and one fan, which are both registered as
thermal_cooling_device.
If the processor is listed in _PSL method, and the fan is listed in _AL0
method, the sys I/F structure will be built like this:
/sys/class/thermal:
|thermal_zone1:
|-----type: acpitz
|-----temp: 37000
|-----mode: kernel
|-----trip_point_0_temp: 100000
|-----trip_point_0_type: critical
|-----trip_point_1_temp: 80000
|-----trip_point_1_type: passive
|-----trip_point_2_temp: 70000
|-----trip_point_2_type: active0
|-----trip_point_3_temp: 60000
|-----trip_point_3_type: active1
|-----cdev0: --->/sys/class/thermal/cooling_device0
|-----cdev0_trip_point: 1 /* cdev0 can be used for passive */
|-----cdev1: --->/sys/class/thermal/cooling_device3
|-----cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
|---type: acpitz
|---temp: 37000
|---mode: kernel
|---trip_point_0_temp: 100000
|---trip_point_0_type: critical
|---trip_point_1_temp: 80000
|---trip_point_1_type: passive
|---trip_point_2_temp: 70000
|---trip_point_2_type: active0
|---trip_point_3_temp: 60000
|---trip_point_3_type: active1
|---cdev0: --->/sys/class/thermal/cooling_device0
|---cdev0_trip_point: 1 /* cdev0 can be used for passive */
|---cdev1: --->/sys/class/thermal/cooling_device3
|---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
|cooling_device0:
|-----type: Processor
|-----max_state: 8
|-----cur_state: 0
|---type: Processor
|---max_state: 8
|---cur_state: 0
|cooling_device3:
|-----type: Fan
|-----max_state: 2
|-----cur_state: 0
|---type: Fan
|---max_state: 2
|---cur_state: 0
/sys/class/hwmon:
|hwmon0:
|-----name: acpitz
|-----temp1_input: 37000
|-----temp1_crit: 100000
|---name: acpitz
|---temp1_input: 37000
|---temp1_crit: 100000
+2
View File
@@ -1231,6 +1231,7 @@ something like this simple program:
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#define _STR(x) #x
#define STR(x) _STR(x)
@@ -1265,6 +1266,7 @@ const char *find_debugfs(void)
return NULL;
}
strcat(debugfs, "/tracing/");
debugfs_found = 1;
return debugfs;
+136
View File
@@ -0,0 +1,136 @@
What is hwpoison?
Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future.
This patchkit implements the necessary infrastructure in the VM.
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focusses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.
The main target right now is KVM guests, but it works for all kinds
of applications. KVM support requires a recent qemu-kvm release.
For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.
---
There are two (actually three) modi memory failure recovery can be in:
vm.memory_failure_recovery sysctl set to zero:
All memory failures cause a panic. Do not attempt recovery.
(on x86 this can be also affected by the tolerant level of the
MCE subsystem)
early kill
(can be controlled globally and per process)
Send SIGBUS to the application as soon as the error is detected
This allows applications who can process memory errors in a gentle
way (e.g. drop affected object)
This is the mode used by KVM qemu.
late kill
Send SIGBUS when the application runs into the corrupted page.
This is best for memory error unaware applications and default
Note some pages are always handled as late kill.
---
User control:
vm.memory_failure_recovery
See sysctl.txt
vm.memory_failure_early_kill
Enable early kill mode globally
PR_MCE_KILL
Set early/late kill mode/revert to system default
arg1: PR_MCE_KILL_CLEAR: Revert to system default
arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
PR_MCE_KILL_EARLY: Early kill
PR_MCE_KILL_LATE: Late kill
PR_MCE_KILL_DEFAULT: Use system global default
PR_MCE_KILL_GET
return current mode
---
Testing:
madvise(MADV_POISON, ....)
(as root)
Poison a page in the process for testing
hwpoison-inject module through debugfs
/sys/debug/hwpoison/corrupt-pfn
Inject hwpoison fault at PFN echoed into this file
Architecture specific MCE injector
x86 has mce-inject, mce-test
Some portable hwpoison test programs in mce-test, see blow.
---
References:
http://halobates.de/mce-lc09-2.pdf
Overview presentation from LinuxCon 09
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
Test suite (hwpoison specific portable tests in tsrc)
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
x86 specific injector
---
Limitations:
- Not all page types are supported and never will. Most kernel internal
objects cannot be recovered, only LRU pages for now.
- Right now hugepage support is missing.
---
Andi Kleen, Oct 2009
+1 -1
View File
@@ -218,7 +218,7 @@ static void fatal(const char *x, ...)
exit(EXIT_FAILURE);
}
int checked_open(const char *pathname, int flags)
static int checked_open(const char *pathname, int flags)
{
int fd = open(pathname, flags);
+153 -107
View File
File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More