Merge branch 'upstream'

This commit is contained in:
Jeff Garzik
2006-01-17 10:29:06 -05:00
5476 changed files with 318127 additions and 155285 deletions
+2
View File
@@ -10,6 +10,7 @@
*.a *.a
*.s *.s
*.ko *.ko
*.so
*.mod.c *.mod.c
# #
@@ -23,6 +24,7 @@ Module.symvers
# Generated include files # Generated include files
# #
include/asm include/asm
include/asm-*/asm-offsets.h
include/config include/config
include/linux/autoconf.h include/linux/autoconf.h
include/linux/compile.h include/linux/compile.h
+2 -1
View File
@@ -1883,6 +1883,7 @@ N: Jaya Kumar
E: jayalk@intworks.biz E: jayalk@intworks.biz
W: http://www.intworks.biz W: http://www.intworks.biz
D: Arc monochrome LCD framebuffer driver, x86 reboot fixups D: Arc monochrome LCD framebuffer driver, x86 reboot fixups
D: pirq addr, CS5535 alsa audio driver
S: Gurgaon, India S: Gurgaon, India
S: Kuala Lumpur, Malaysia S: Kuala Lumpur, Malaysia
@@ -3202,7 +3203,7 @@ N: Eugene Surovegin
E: ebs@ebshome.net E: ebs@ebshome.net
W: http://kernel.ebshome.net/ W: http://kernel.ebshome.net/
P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1 P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1
D: Embedded PowerPC 4xx: I2C, PIC and random hacks/fixes D: Embedded PowerPC 4xx: EMAC, I2C, PIC and random hacks/fixes
S: Sunnyvale, California 94085 S: Sunnyvale, California 94085
S: USA S: USA
+5 -26
View File
@@ -31,8 +31,6 @@ al espa
Eine deutsche Version dieser Datei finden Sie unter Eine deutsche Version dieser Datei finden Sie unter
<http://www.stefan-winter.de/Changes-2.4.0.txt>. <http://www.stefan-winter.de/Changes-2.4.0.txt>.
Last updated: October 29th, 2002
Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu). Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu).
Current Minimal Requirements Current Minimal Requirements
@@ -48,7 +46,7 @@ necessary on all systems; obviously, if you don't have any ISDN
hardware, for example, you probably needn't concern yourself with hardware, for example, you probably needn't concern yourself with
isdn4k-utils. isdn4k-utils.
o Gnu C 2.95.3 # gcc --version o Gnu C 3.2 # gcc --version
o Gnu make 3.79.1 # make --version o Gnu make 3.79.1 # make --version
o binutils 2.12 # ld -v o binutils 2.12 # ld -v
o util-linux 2.10o # fdformat --version o util-linux 2.10o # fdformat --version
@@ -74,26 +72,7 @@ GCC
--- ---
The gcc version requirements may vary depending on the type of CPU in your The gcc version requirements may vary depending on the type of CPU in your
computer. The next paragraph applies to users of x86 CPUs, but not computer.
necessarily to users of other CPUs. Users of other CPUs should obtain
information about their gcc version requirements from another source.
The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it
should be used when you need absolute stability. You may use gcc 3.0.x
instead if you wish, although it may cause problems. Later versions of gcc
have not received much testing for Linux kernel compilation, and there are
almost certainly bugs (mainly, but not exclusively, in the kernel) that
will need to be fixed in order to use these compilers. In any case, using
pgcc instead of plain gcc is just asking for trouble.
The Red Hat gcc 2.96 compiler subtree can also be used to build this tree.
You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build
the kernel correctly.
In addition, please pay attention to compiler optimization. Anything
greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x
or derivatives, be sure not to use -fstrict-aliasing (which, depending on
your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing).
Make Make
---- ----
@@ -322,9 +301,9 @@ Getting updated software
Kernel compilation Kernel compilation
****************** ******************
gcc 2.95.3 gcc
---------- ---
o <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz> o <ftp://ftp.gnu.org/gnu/gcc/>
Make Make
---- ----
+37 -6
View File
@@ -199,7 +199,7 @@ The rationale is:
modifications are prevented modifications are prevented
- saves the compiler work to optimize redundant code away ;) - saves the compiler work to optimize redundant code away ;)
int fun(int ) int fun(int a)
{ {
int result = 0; int result = 0;
char *buffer = kmalloc(SIZE); char *buffer = kmalloc(SIZE);
@@ -344,7 +344,7 @@ Remember: if another thread can find your data structure, and you don't
have a reference count on it, you almost certainly have a bug. have a reference count on it, you almost certainly have a bug.
Chapter 11: Macros, Enums, Inline functions and RTL Chapter 11: Macros, Enums and RTL
Names of macros defining constants and labels in enums are capitalized. Names of macros defining constants and labels in enums are capitalized.
@@ -429,7 +429,35 @@ from void pointer to any other pointer type is guaranteed by the C programming
language. language.
Chapter 14: References Chapter 14: The inline disease
There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called "inline". While the use of inlines can be
appropriate (for example as a means of replacing macros, see Chapter 11), it
very often is not. Abundant use of the inline keyword leads to a much bigger
kernel, which in turn slows the system as a whole down, due to a bigger
icache footprint for the CPU and simply because there is less memory
available for the pagecache. Just think about it; a pagecache miss causes a
disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles
that can go into these 5 miliseconds.
A reasonable rule of thumb is to not put inline at functions that have more
than 3 lines of code in them. An exception to this rule are the cases where
a parameter is known to be a compiletime constant, and as a result of this
constantness you *know* the compiler will be able to optimize most of your
function away at compile time. For a good example of this later case, see
the kmalloc() inline function.
Often people argue that adding inline to functions that are static and used
only once is always a win since there is no space tradeoff. While this is
technically correct, gcc is capable of inlining these automatically without
help, and the maintenance issue of removing the inline when a second user
appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.
Chapter 15: References
The C Programming Language, Second Edition The C Programming Language, Second Edition
by Brian W. Kernighan and Dennis M. Ritchie. by Brian W. Kernighan and Dennis M. Ritchie.
@@ -444,10 +472,13 @@ ISBN 0-201-61586-X.
URL: http://cm.bell-labs.com/cm/cs/tpop/ URL: http://cm.bell-labs.com/cm/cs/tpop/
GNU manuals - where in compliance with K&R and this text - for cpp, gcc, GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
gcc internals and indent, all available from http://www.gnu.org gcc internals and indent, all available from http://www.gnu.org/manual/
WG14 is the international standardization working group for the programming WG14 is the international standardization working group for the programming
language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/ language C, URL: http://www.open-std.org/JTC1/SC22/WG14/
Kernel CodingStyle, by greg@kroah.com at OLS 2002:
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
-- --
Last updated on 16 February 2004 by a community effort on LKML. Last updated on 30 December 2005 by a community effort on LKML.
+6
View File
@@ -0,0 +1,6 @@
*.xml
*.ps
*.pdf
*.html
*.9.gz
*.9
+6
View File
@@ -53,6 +53,11 @@
!Iinclude/linux/sched.h !Iinclude/linux/sched.h
!Ekernel/sched.c !Ekernel/sched.c
!Ekernel/timer.c !Ekernel/timer.c
</sect1>
<sect1><title>High-resolution timers</title>
!Iinclude/linux/ktime.h
!Iinclude/linux/hrtimer.h
!Ekernel/hrtimer.c
</sect1> </sect1>
<sect1><title>Internal Functions</title> <sect1><title>Internal Functions</title>
!Ikernel/exit.c !Ikernel/exit.c
@@ -369,6 +374,7 @@ X!Edrivers/acpi/motherboard.c
X!Edrivers/acpi/bus.c X!Edrivers/acpi/bus.c
--> -->
!Edrivers/acpi/scan.c !Edrivers/acpi/scan.c
!Idrivers/acpi/scan.c
<!-- No correct structured comments <!-- No correct structured comments
X!Edrivers/acpi/pci_bind.c X!Edrivers/acpi/pci_bind.c
--> -->
+14 -8
View File
@@ -222,7 +222,7 @@
<title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title> <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
<para> <para>
There are two main types of kernel locks. The fundamental type There are three main types of kernel locks. The fundamental type
is the spinlock is the spinlock
(<filename class="headerfile">include/asm/spinlock.h</filename>), (<filename class="headerfile">include/asm/spinlock.h</filename>),
which is a very simple single-holder lock: if you can't get the which is a very simple single-holder lock: if you can't get the
@@ -230,16 +230,22 @@
very small and fast, and can be used anywhere. very small and fast, and can be used anywhere.
</para> </para>
<para> <para>
The second type is a semaphore The second type is a mutex
(<filename class="headerfile">include/linux/mutex.h</filename>): it
is like a spinlock, but you may block holding a mutex.
If you can't lock a mutex, your task will suspend itself, and be woken
up when the mutex is released. This means the CPU can do something
else while you are waiting. There are many cases when you simply
can't sleep (see <xref linkend="sleeping-things"/>), and so have to
use a spinlock instead.
</para>
<para>
The third type is a semaphore
(<filename class="headerfile">include/asm/semaphore.h</filename>): it (<filename class="headerfile">include/asm/semaphore.h</filename>): it
can have more than one holder at any time (the number decided at can have more than one holder at any time (the number decided at
initialization time), although it is most commonly used as a initialization time), although it is most commonly used as a
single-holder lock (a mutex). If you can't get a semaphore, single-holder lock (a mutex). If you can't get a semaphore, your
your task will put itself on the queue, and be woken up when the task will be suspended and later on woken up - just like for mutexes.
semaphore is released. This means the CPU will do something
else while you are waiting, but there are many cases when you
simply can't sleep (see <xref linkend="sleeping-things"/>), and so
have to use a spinlock instead.
</para> </para>
<para> <para>
Neither type of lock is recursive: see Neither type of lock is recursive: see
+1
View File
@@ -253,6 +253,7 @@
!Edrivers/usb/core/urb.c !Edrivers/usb/core/urb.c
!Edrivers/usb/core/message.c !Edrivers/usb/core/message.c
!Edrivers/usb/core/file.c !Edrivers/usb/core/file.c
!Edrivers/usb/core/driver.c
!Edrivers/usb/core/usb.c !Edrivers/usb/core/usb.c
!Edrivers/usb/core/hub.c !Edrivers/usb/core/hub.c
</chapter> </chapter>
+2 -2
View File
@@ -229,7 +229,7 @@ int __init myradio_init(struct video_init *v)
static int users = 0; static int users = 0;
static int radio_open(stuct video_device *dev, int flags) static int radio_open(struct video_device *dev, int flags)
{ {
if(users) if(users)
return -EBUSY; return -EBUSY;
@@ -949,7 +949,7 @@ int __init mycamera_init(struct video_init *v)
static int users = 0; static int users = 0;
static int camera_open(stuct video_device *dev, int flags) static int camera_open(struct video_device *dev, int flags)
{ {
if(users) if(users)
return -EBUSY; return -EBUSY;
+40 -47
View File
@@ -1,74 +1,67 @@
Refcounter framework for elements of lists/arrays protected by Refcounter design for elements of lists/arrays protected by RCU.
RCU.
Refcounting on elements of lists which are protected by traditional Refcounting on elements of lists which are protected by traditional
reader/writer spinlocks or semaphores are straight forward as in: reader/writer spinlocks or semaphores are straight forward as in:
1. 2. 1. 2.
add() search_and_reference() add() search_and_reference()
{ { { {
alloc_object read_lock(&list_lock); alloc_object read_lock(&list_lock);
... search_for_element ... search_for_element
atomic_set(&el->rc, 1); atomic_inc(&el->rc); atomic_set(&el->rc, 1); atomic_inc(&el->rc);
write_lock(&list_lock); ... write_lock(&list_lock); ...
add_element read_unlock(&list_lock); add_element read_unlock(&list_lock);
... ... ... ...
write_unlock(&list_lock); } write_unlock(&list_lock); }
} }
3. 4. 3. 4.
release_referenced() delete() release_referenced() delete()
{ { { {
... write_lock(&list_lock); ... write_lock(&list_lock);
atomic_dec(&el->rc, relfunc) ... atomic_dec(&el->rc, relfunc) ...
... delete_element ... delete_element
} write_unlock(&list_lock); } write_unlock(&list_lock);
... ...
if (atomic_dec_and_test(&el->rc)) if (atomic_dec_and_test(&el->rc))
kfree(el); kfree(el);
... ...
} }
If this list/array is made lock free using rcu as in changing the If this list/array is made lock free using rcu as in changing the
write_lock in add() and delete() to spin_lock and changing read_lock write_lock in add() and delete() to spin_lock and changing read_lock
in search_and_reference to rcu_read_lock(), the rcuref_get in in search_and_reference to rcu_read_lock(), the atomic_get in
search_and_reference could potentially hold reference to an element which search_and_reference could potentially hold reference to an element which
has already been deleted from the list/array. rcuref_lf_get_rcu takes has already been deleted from the list/array. atomic_inc_not_zero takes
care of this scenario. search_and_reference should look as; care of this scenario. search_and_reference should look as;
1. 2. 1. 2.
add() search_and_reference() add() search_and_reference()
{ { { {
alloc_object rcu_read_lock(); alloc_object rcu_read_lock();
... search_for_element ... search_for_element
atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) { atomic_set(&el->rc, 1); if (atomic_inc_not_zero(&el->rc)) {
write_lock(&list_lock); rcu_read_unlock(); write_lock(&list_lock); rcu_read_unlock();
return FAIL; return FAIL;
add_element } add_element }
... ... ... ...
write_unlock(&list_lock); rcu_read_unlock(); write_unlock(&list_lock); rcu_read_unlock();
} } } }
3. 4. 3. 4.
release_referenced() delete() release_referenced() delete()
{ { { {
... write_lock(&list_lock); ... write_lock(&list_lock);
rcuref_dec(&el->rc, relfunc) ... atomic_dec(&el->rc, relfunc) ...
... delete_element ... delete_element
} write_unlock(&list_lock); } write_unlock(&list_lock);
... ...
if (rcuref_dec_and_test(&el->rc)) if (atomic_dec_and_test(&el->rc))
call_rcu(&el->head, el_free); call_rcu(&el->head, el_free);
... ...
} }
Sometimes, reference to the element need to be obtained in the Sometimes, reference to the element need to be obtained in the
update (write) stream. In such cases, rcuref_inc_lf might be an overkill update (write) stream. In such cases, atomic_inc_not_zero might be an
since the spinlock serialising list updates are held. rcuref_inc overkill since the spinlock serialising list updates are held. atomic_inc
is to be used in such cases. is to be used in such cases.
For arches which do not have cmpxchg rcuref_inc_lf
api uses a hashed spinlock implementation and the same hashed spinlock
is acquired in all rcuref_xxx primitives to preserve atomicity.
Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
might lead to races. rcuref_inc_lf() must be used in lockfree
RCU critical sections only.
+16 -8
View File
@@ -27,18 +27,17 @@ Who To Submit Drivers To
------------------------ ------------------------
Linux 2.0: Linux 2.0:
No new drivers are accepted for this kernel tree No new drivers are accepted for this kernel tree.
Linux 2.2: Linux 2.2:
No new drivers are accepted for this kernel tree.
Linux 2.4:
If the code area has a general maintainer then please submit it to If the code area has a general maintainer then please submit it to
the maintainer listed in MAINTAINERS in the kernel file. If the the maintainer listed in MAINTAINERS in the kernel file. If the
maintainer does not respond or you cannot find the appropriate maintainer does not respond or you cannot find the appropriate
maintainer then please contact the 2.2 kernel maintainer: maintainer then please contact Marcelo Tosatti
Marc-Christian Petersen <m.c.p@wolk-project.de>. <marcelo.tosatti@cyclades.com>.
Linux 2.4:
The same rules apply as 2.2. The final contact point for Linux 2.4
submissions is Marcelo Tosatti <marcelo.tosatti@cyclades.com>.
Linux 2.6: Linux 2.6:
The same rules apply as 2.4 except that you should follow linux-kernel The same rules apply as 2.4 except that you should follow linux-kernel
@@ -53,6 +52,7 @@ Licensing: The code must be released to us under the
of exclusive GPL licensing, and if you wish the driver of exclusive GPL licensing, and if you wish the driver
to be useful to other communities such as BSD you may well to be useful to other communities such as BSD you may well
wish to release under multiple licenses. wish to release under multiple licenses.
See accepted licenses at include/linux/module.h
Copyright: The copyright owner must agree to use of GPL. Copyright: The copyright owner must agree to use of GPL.
It's best if the submitter and copyright owner It's best if the submitter and copyright owner
@@ -143,5 +143,13 @@ KernelNewbies:
http://kernelnewbies.org/ http://kernelnewbies.org/
Linux USB project: Linux USB project:
http://sourceforge.net/projects/linux-usb/ http://www.linux-usb.org/
How to NOT write kernel driver by arjanv@redhat.com
http://people.redhat.com/arjanv/olspaper.pdf
Kernel Janitor:
http://janitor.kernelnewbies.org/
--
Last updated on 17 Nov 2005.
+48 -20
View File
@@ -78,7 +78,9 @@ Randy Dunlap's patch scripts:
http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz
Andrew Morton's patch scripts: Andrew Morton's patch scripts:
http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.20 http://www.zip.com.au/~akpm/linux/patches/
Instead of these scripts, quilt is the recommended patch management
tool (see above).
@@ -97,7 +99,7 @@ need to split up your patch. See #3, next.
3) Separate your changes. 3) Separate your changes.
Separate each logical change into its own patch. Separate _logical changes_ into a single patch file.
For example, if your changes include both bug fixes and performance For example, if your changes include both bug fixes and performance
enhancements for a single driver, separate those changes into two enhancements for a single driver, separate those changes into two
@@ -112,6 +114,10 @@ If one patch depends on another patch in order for a change to be
complete, that is OK. Simply note "this patch depends on patch X" complete, that is OK. Simply note "this patch depends on patch X"
in your patch description. in your patch description.
If you cannot condense your patch set into a smaller set of patches,
then only post say 15 or so at a time and wait for review and integration.
4) Select e-mail destination. 4) Select e-mail destination.
@@ -124,6 +130,10 @@ your patch to the primary Linux kernel developer's mailing list,
linux-kernel@vger.kernel.org. Most kernel developers monitor this linux-kernel@vger.kernel.org. Most kernel developers monitor this
e-mail list, and can comment on your changes. e-mail list, and can comment on your changes.
Do not send more than 15 patches at once to the vger mailing lists!!!
Linus Torvalds is the final arbiter of all changes accepted into the Linus Torvalds is the final arbiter of all changes accepted into the
Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets
a lot of e-mail, so typically you should do your best to -avoid- sending a lot of e-mail, so typically you should do your best to -avoid- sending
@@ -149,6 +159,9 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the
MAINTAINERS file for a mailing list that relates specifically to MAINTAINERS file for a mailing list that relates specifically to
your change. your change.
Majordomo lists of VGER.KERNEL.ORG at:
<http://vger.kernel.org/vger-lists.html>
If changes affect userland-kernel interfaces, please send If changes affect userland-kernel interfaces, please send
the MAN-PAGES maintainer (as listed in the MAINTAINERS file) the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
a man-pages patch, or at least a notification of the change, a man-pages patch, or at least a notification of the change,
@@ -158,7 +171,7 @@ Even if the maintainer did not respond in step #4, make sure to ALWAYS
copy the maintainer when you change their code. copy the maintainer when you change their code.
For small patches you may want to CC the Trivial Patch Monkey For small patches you may want to CC the Trivial Patch Monkey
trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial" trivial@kernel.org managed by Adrian Bunk; which collects "trivial"
patches. Trivial patches must qualify for one of the following rules: patches. Trivial patches must qualify for one of the following rules:
Spelling fixes in documentation Spelling fixes in documentation
Spelling fixes which could break grep(1). Spelling fixes which could break grep(1).
@@ -171,7 +184,7 @@ patches. Trivial patches must qualify for one of the following rules:
since people copy, as long as it's trivial) since people copy, as long as it's trivial)
Any fix by the author/maintainer of the file. (ie. patch monkey Any fix by the author/maintainer of the file. (ie. patch monkey
in re-transmission mode) in re-transmission mode)
URL: <http://www.kernel.org/pub/linux/kernel/people/rusty/trivial/> URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/>
@@ -373,27 +386,14 @@ a diffstat, to show what files have changed, and the number of inserted
and deleted lines per file. A diffstat is especially useful on bigger and deleted lines per file. A diffstat is especially useful on bigger
patches. Other comments relevant only to the moment or the maintainer, patches. Other comments relevant only to the moment or the maintainer,
not suitable for the permanent changelog, should also go here. not suitable for the permanent changelog, should also go here.
Use diffstat options "-p 1 -w 70" so that filenames are listed from the
top of the kernel source tree and don't use too much horizontal space
(easily fit in 80 columns, maybe with some indentation).
See more details on the proper patch format in the following See more details on the proper patch format in the following
references. references.
13) More references for submitting patches
Andrew Morton, "The perfect patch" (tpp).
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
Jeff Garzik, "Linux kernel patch submission format."
<http://linux.yyz.us/patch-format.html>
Greg KH, "How to piss off a kernel subsystem maintainer"
<http://www.kroah.com/log/2005/03/31/>
Kernel Documentation/CodingStyle
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
Linus Torvald's mail on the canonical patch format:
<http://lkml.org/lkml/2005/4/7/183>
----------------------------------- -----------------------------------
@@ -466,3 +466,31 @@ and 'extern __inline__'.
Don't try to anticipate nebulous future cases which may or may not Don't try to anticipate nebulous future cases which may or may not
be useful: "Make it as simple as you can, and no simpler." be useful: "Make it as simple as you can, and no simpler."
----------------------
SECTION 3 - REFERENCES
----------------------
Andrew Morton, "The perfect patch" (tpp).
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
Jeff Garzik, "Linux kernel patch submission format."
<http://linux.yyz.us/patch-format.html>
Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer".
<http://www.kroah.com/log/2005/03/31/>
<http://www.kroah.com/log/2005/07/08/>
<http://www.kroah.com/log/2005/10/19/>
<http://www.kroah.com/log/2006/01/11/>
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!.
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2>
Kernel Documentation/CodingStyle
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
Linus Torvald's mail on the canonical patch format:
<http://lkml.org/lkml/2005/4/7/183>
--
Last updated on 17 Nov 2005.
+48 -33
View File
@@ -2,8 +2,8 @@
Applying Patches To The Linux Kernel Applying Patches To The Linux Kernel
------------------------------------ ------------------------------------
(Written by Jesper Juhl, August 2005) Original by: Jesper Juhl, August 2005
Last update: 2006-01-05
A frequently asked question on the Linux Kernel Mailing List is how to apply A frequently asked question on the Linux Kernel Mailing List is how to apply
@@ -76,7 +76,7 @@ instead:
If you wish to uncompress the patch file by hand first before applying it If you wish to uncompress the patch file by hand first before applying it
(what I assume you've done in the examples below), then you simply run (what I assume you've done in the examples below), then you simply run
gunzip or bunzip2 on the file - like this: gunzip or bunzip2 on the file -- like this:
gunzip patch-x.y.z.gz gunzip patch-x.y.z.gz
bunzip2 patch-x.y.z.bz2 bunzip2 patch-x.y.z.bz2
@@ -94,7 +94,7 @@ Common errors when patching
--- ---
When patch applies a patch file it attempts to verify the sanity of the When patch applies a patch file it attempts to verify the sanity of the
file in different ways. file in different ways.
Checking that the file looks like a valid patch file, checking the code Checking that the file looks like a valid patch file & checking the code
around the bits being modified matches the context provided in the patch are around the bits being modified matches the context provided in the patch are
just two of the basic sanity checks patch does. just two of the basic sanity checks patch does.
@@ -118,16 +118,16 @@ wrong.
When patch encounters a change that it can't fix up with fuzz it rejects it When patch encounters a change that it can't fix up with fuzz it rejects it
outright and leaves a file with a .rej extension (a reject file). You can outright and leaves a file with a .rej extension (a reject file). You can
read this file to see exactely what change couldn't be applied, so you can read this file to see exactly what change couldn't be applied, so you can
go fix it up by hand if you wish. go fix it up by hand if you wish.
If you don't have any third party patches applied to your kernel source, but If you don't have any third-party patches applied to your kernel source, but
only patches from kernel.org and you apply the patches in the correct order, only patches from kernel.org and you apply the patches in the correct order,
and have made no modifications yourself to the source files, then you should and have made no modifications yourself to the source files, then you should
never see a fuzz or reject message from patch. If you do see such messages never see a fuzz or reject message from patch. If you do see such messages
anyway, then there's a high risk that either your local source tree or the anyway, then there's a high risk that either your local source tree or the
patch file is corrupted in some way. In that case you should probably try patch file is corrupted in some way. In that case you should probably try
redownloading the patch and if things are still not OK then you'd be advised re-downloading the patch and if things are still not OK then you'd be advised
to start with a fresh tree downloaded in full from kernel.org. to start with a fresh tree downloaded in full from kernel.org.
Let's look a bit more at some of the messages patch can produce. Let's look a bit more at some of the messages patch can produce.
@@ -136,7 +136,7 @@ If patch stops and presents a "File to patch:" prompt, then patch could not
find a file to be patched. Most likely you forgot to specify -p1 or you are find a file to be patched. Most likely you forgot to specify -p1 or you are
in the wrong directory. Less often, you'll find patches that need to be in the wrong directory. Less often, you'll find patches that need to be
applied with -p0 instead of -p1 (reading the patch file should reveal if applied with -p0 instead of -p1 (reading the patch file should reveal if
this is the case - if so, then this is an error by the person who created this is the case -- if so, then this is an error by the person who created
the patch but is not fatal). the patch but is not fatal).
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
@@ -167,22 +167,28 @@ the patch will in fact apply it.
A message similar to "patch: **** unexpected end of file in patch" or "patch A message similar to "patch: **** unexpected end of file in patch" or "patch
unexpectedly ends in middle of line" means that patch could make no sense of unexpectedly ends in middle of line" means that patch could make no sense of
the file you fed to it. Either your download is broken or you tried to feed the file you fed to it. Either your download is broken, you tried to feed
patch a compressed patch file without uncompressing it first. patch a compressed patch file without uncompressing it first, or the patch
file that you are using has been mangled by a mail client or mail transfer
agent along the way somewhere, e.g., by splitting a long line into two lines.
Often these warnings can easily be fixed by joining (concatenating) the
two lines that had been split.
As I already mentioned above, these errors should never happen if you apply As I already mentioned above, these errors should never happen if you apply
a patch from kernel.org to the correct version of an unmodified source tree. a patch from kernel.org to the correct version of an unmodified source tree.
So if you get these errors with kernel.org patches then you should probably So if you get these errors with kernel.org patches then you should probably
assume that either your patch file or your tree is broken and I'd advice you assume that either your patch file or your tree is broken and I'd advise you
to start over with a fresh download of a full kernel tree and the patch you to start over with a fresh download of a full kernel tree and the patch you
wish to apply. wish to apply.
Are there any alternatives to `patch'? Are there any alternatives to `patch'?
--- ---
Yes there are alternatives. You can use the `interdiff' program Yes there are alternatives.
(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
differences between two patches and then apply the result. You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to
generate a patch representing the differences between two patches and then
apply the result.
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
step. The -z flag to interdiff will even let you feed it patches in gzip or step. The -z flag to interdiff will even let you feed it patches in gzip or
bzip2 compressed form directly without the use of zcat or bzcat or manual bzip2 compressed form directly without the use of zcat or bzcat or manual
@@ -197,10 +203,10 @@ do the additional steps since interdiff can get things wrong in some cases.
Another alternative is `ketchup', which is a python script for automatic Another alternative is `ketchup', which is a python script for automatic
downloading and applying of patches (http://www.selenic.com/ketchup/). downloading and applying of patches (http://www.selenic.com/ketchup/).
Other nice tools are diffstat which shows a summary of changes made by a Other nice tools are diffstat, which shows a summary of changes made by a
patch, lsdiff which displays a short listing of affected files in a patch patch; lsdiff, which displays a short listing of affected files in a patch
file, along with (optionally) the line numbers of the start of each patch file, along with (optionally) the line numbers of the start of each patch;
and grepdiff which displays a list of the files modified by a patch where and grepdiff, which displays a list of the files modified by a patch where
the patch contains a given regular expression. the patch contains a given regular expression.
@@ -225,8 +231,8 @@ The -mm kernels live at
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
country code. This way you'll be downloading from a mirror site that's most country code. This way you'll be downloading from a mirror site that's most
likely geographically closer to you, resulting in faster downloads for you, likely geographically closer to you, resulting in faster downloads for you,
less bandwidth used globally and less load on the main kernel.org servers - less bandwidth used globally and less load on the main kernel.org servers --
these are good things, do use mirrors when possible. these are good things, so do use mirrors when possible.
The 2.6.x kernels The 2.6.x kernels
@@ -234,14 +240,14 @@ The 2.6.x kernels
These are the base stable releases released by Linus. The highest numbered These are the base stable releases released by Linus. The highest numbered
release is the most recent. release is the most recent.
If regressions or other serious flaws are found then a -stable fix patch If regressions or other serious flaws are found, then a -stable fix patch
will be released (see below) on top of this base. Once a new 2.6.x base will be released (see below) on top of this base. Once a new 2.6.x base
kernel is released, a patch is made available that is a delta between the kernel is released, a patch is made available that is a delta between the
previous 2.6.x kernel and the new one. previous 2.6.x kernel and the new one.
To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to
first revert the 2.6.x.y patch). first revert the 2.6.x.y patch).
Here are some examples: Here are some examples:
@@ -258,12 +264,12 @@ $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
# source dir is now 2.6.11 # source dir is now 2.6.11
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch $ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
$ cd .. $ cd ..
$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir $ mv linux-2.6.11.1 linux-2.6.12 # rename source dir
The 2.6.x.y kernels The 2.6.x.y kernels
--- ---
Kernels with 4 digit versions are -stable kernels. They contain small(ish) Kernels with 4-digit versions are -stable kernels. They contain small(ish)
critical fixes for security problems or significant regressions discovered critical fixes for security problems or significant regressions discovered
in a given 2.6.x kernel. in a given 2.6.x kernel.
@@ -274,9 +280,14 @@ versions.
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
the current stable kernel. the current stable kernel.
note: the -stable team usually do make incremental patches available as well
as patches against the latest mainline release, but I only cover the
non-incremental ones below. The incremental ones can be found at
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/
These patches are not incremental, meaning that for example the 2.6.12.3 These patches are not incremental, meaning that for example the 2.6.12.3
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
of the base 2.6.12 kernel source. of the base 2.6.12 kernel source .
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
source you have to first back out the 2.6.12.2 patch (so you are left with a source you have to first back out the 2.6.12.2 patch (so you are left with a
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch. base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
@@ -342,12 +353,12 @@ The -git kernels
repository, hence the name). repository, hence the name).
These patches are usually released daily and represent the current state of These patches are usually released daily and represent the current state of
Linus' tree. They are more experimental than -rc kernels since they are Linus's tree. They are more experimental than -rc kernels since they are
generated automatically without even a cursory glance to see if they are generated automatically without even a cursory glance to see if they are
sane. sane.
-git patches are not incremental and apply either to a base 2.6.x kernel or -git patches are not incremental and apply either to a base 2.6.x kernel or
a base 2.6.x-rc kernel - you can see which from their name. a base 2.6.x-rc kernel -- you can see which from their name.
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel. named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
@@ -390,12 +401,12 @@ You should generally strive to get your patches into mainline via -mm to
ensure maximum testing. ensure maximum testing.
This branch is in constant flux and contains many experimental features, a This branch is in constant flux and contains many experimental features, a
lot of debugging patches not appropriate for mainline etc and is the most lot of debugging patches not appropriate for mainline etc., and is the most
experimental of the branches described in this document. experimental of the branches described in this document.
These kernels are not appropriate for use on systems that are supposed to be These kernels are not appropriate for use on systems that are supposed to be
stable and they are more risky to run than any of the other branches (make stable and they are more risky to run than any of the other branches (make
sure you have up-to-date backups - that goes for any experimental kernel but sure you have up-to-date backups -- that goes for any experimental kernel but
even more so for -mm kernels). even more so for -mm kernels).
These kernels in addition to all the other experimental patches they contain These kernels in addition to all the other experimental patches they contain
@@ -433,7 +444,11 @@ $ cd ..
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
This concludes this list of explanations of the various kernel trees and I This concludes this list of explanations of the various kernel trees.
hope you are now crystal clear on how to apply the various patches and help I hope you are now clear on how to apply the various patches and help testing
testing the kernel. the kernel.
Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert,
Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have
forgotten for their reviews and contributions to this document.
+271
View File
@@ -0,0 +1,271 @@
I/O Barriers
============
Tejun Heo <htejun@gmail.com>, July 22 2005
I/O barrier requests are used to guarantee ordering around the barrier
requests. Unless you're crazy enough to use disk drives for
implementing synchronization constructs (wow, sounds interesting...),
the ordering is meaningful only for write requests for things like
journal checkpoints. All requests queued before a barrier request
must be finished (made it to the physical medium) before the barrier
request is started, and all requests queued after the barrier request
must be started only after the barrier request is finished (again,
made it to the physical medium).
In other words, I/O barrier requests have the following two properties.
1. Request ordering
Requests cannot pass the barrier request. Preceding requests are
processed before the barrier and following requests after.
Depending on what features a drive supports, this can be done in one
of the following three ways.
i. For devices which have queue depth greater than 1 (TCQ devices) and
support ordered tags, block layer can just issue the barrier as an
ordered request and the lower level driver, controller and drive
itself are responsible for making sure that the ordering contraint is
met. Most modern SCSI controllers/drives should support this.
NOTE: SCSI ordered tag isn't currently used due to limitation in the
SCSI midlayer, see the following random notes section.
ii. For devices which have queue depth greater than 1 but don't
support ordered tags, block layer ensures that the requests preceding
a barrier request finishes before issuing the barrier request. Also,
it defers requests following the barrier until the barrier request is
finished. Older SCSI controllers/drives and SATA drives fall in this
category.
iii. Devices which have queue depth of 1. This is a degenerate case
of ii. Just keeping issue order suffices. Ancient SCSI
controllers/drives and IDE drives are in this category.
2. Forced flushing to physcial medium
Again, if you're not gonna do synchronization with disk drives (dang,
it sounds even more appealing now!), the reason you use I/O barriers
is mainly to protect filesystem integrity when power failure or some
other events abruptly stop the drive from operating and possibly make
the drive lose data in its cache. So, I/O barriers need to guarantee
that requests actually get written to non-volatile medium in order.
There are four cases,
i. No write-back cache. Keeping requests ordered is enough.
ii. Write-back cache but no flush operation. There's no way to
gurantee physical-medium commit order. This kind of devices can't to
I/O barriers.
iii. Write-back cache and flush operation but no FUA (forced unit
access). We need two cache flushes - before and after the barrier
request.
iv. Write-back cache, flush operation and FUA. We still need one
flush to make sure requests preceding a barrier are written to medium,
but post-barrier flush can be avoided by using FUA write on the
barrier itself.
How to support barrier requests in drivers
------------------------------------------
All barrier handling is done inside block layer proper. All low level
drivers have to are implementing its prepare_flush_fn and using one
the following two functions to indicate what barrier type it supports
and how to prepare flush requests. Note that the term 'ordered' is
used to indicate the whole sequence of performing barrier requests
including draining and flushing.
typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
prepare_flush_fn *prepare_flush_fn,
unsigned gfp_mask);
int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
prepare_flush_fn *prepare_flush_fn,
unsigned gfp_mask);
The only difference between the two functions is whether or not the
caller is holding q->queue_lock on entry. The latter expects the
caller is holding the lock.
@q : the queue in question
@ordered : the ordered mode the driver/device supports
@prepare_flush_fn : this function should prepare @rq such that it
flushes cache to physical medium when executed
@gfp_mask : gfp_mask used when allocating data structures
for ordered processing
For example, SCSI disk driver's prepare_flush_fn looks like the
following.
static void sd_prepare_flush(request_queue_t *q, struct request *rq)
{
memset(rq->cmd, 0, sizeof(rq->cmd));
rq->flags |= REQ_BLOCK_PC;
rq->timeout = SD_TIMEOUT;
rq->cmd[0] = SYNCHRONIZE_CACHE;
}
The following seven ordered modes are supported. The following table
shows which mode should be used depending on what features a
device/driver supports. In the leftmost column of table,
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
The table is followed by description of each mode. Note that in the
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
preceding step must be complete before proceeding to the next step.
'->' indicates that the next step can start as soon as the previous
step is issued.
write-back cache ordered tag flush FUA
-----------------------------------------------------------------------
NONE yes/no N/A no N/A
DRAIN no no N/A N/A
DRAIN_FLUSH yes no yes no
DRAIN_FUA yes no yes yes
TAG no yes N/A N/A
TAG_FLUSH yes yes yes no
TAG_FUA yes yes yes yes
QUEUE_ORDERED_NONE
I/O barriers are not needed and/or supported.
Sequence: N/A
QUEUE_ORDERED_DRAIN
Requests are ordered by draining the request queue and cache
flushing isn't needed.
Sequence: drain => barrier
QUEUE_ORDERED_DRAIN_FLUSH
Requests are ordered by draining the request queue and both
pre-barrier and post-barrier cache flushings are needed.
Sequence: drain => preflush => barrier => postflush
QUEUE_ORDERED_DRAIN_FUA
Requests are ordered by draining the request queue and
pre-barrier cache flushing is needed. By using FUA on barrier
request, post-barrier flushing can be skipped.
Sequence: drain => preflush => barrier
QUEUE_ORDERED_TAG
Requests are ordered by ordered tag and cache flushing isn't
needed.
Sequence: barrier
QUEUE_ORDERED_TAG_FLUSH
Requests are ordered by ordered tag and both pre-barrier and
post-barrier cache flushings are needed.
Sequence: preflush -> barrier -> postflush
QUEUE_ORDERED_TAG_FUA
Requests are ordered by ordered tag and pre-barrier cache
flushing is needed. By using FUA on barrier request,
post-barrier flushing can be skipped.
Sequence: preflush -> barrier
Random notes/caveats
--------------------
* SCSI layer currently can't use TAG ordering even if the drive,
controller and driver support it. The problem is that SCSI midlayer
request dispatch function is not atomic. It releases queue lock and
switch to SCSI host lock during issue and it's possible and likely to
happen in time that requests change their relative positions. Once
this problem is solved, TAG ordering can be enabled.
* Currently, no matter which ordered mode is used, there can be only
one barrier request in progress. All I/O barriers are held off by
block layer until the previous I/O barrier is complete. This doesn't
make any difference for DRAIN ordered devices, but, for TAG ordered
devices with very high command latency, passing multiple I/O barriers
to low level *might* be helpful if they are very frequent. Well, this
certainly is a non-issue. I'm writing this just to make clear that no
two I/O barrier is ever passed to low-level driver.
* Completion order. Requests in ordered sequence are issued in order
but not required to finish in order. Barrier implementation can
handle out-of-order completion of ordered sequence. IOW, the requests
MUST be processed in order but the hardware/software completion paths
are allowed to reorder completion notifications - eg. current SCSI
midlayer doesn't preserve completion order during error handling.
* Requeueing order. Low-level drivers are free to requeue any request
after they removed it from the request queue with
blkdev_dequeue_request(). As barrier sequence should be kept in order
when requeued, generic elevator code takes care of putting requests in
order around barrier. See blk_ordered_req_seq() and
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
Note that block drivers must not requeue preceding requests while
completing latter requests in an ordered sequence. Currently, no
error checking is done against this.
* Error handling. Currently, block layer will report error to upper
layer if any of requests in an ordered sequence fails. Unfortunately,
this doesn't seem to be enough. Look at the following request flow.
QUEUE_ORDERED_TAG_FLUSH is in use.
[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
still in elevator
Let's say request [2], [3] are write requests to update file system
metadata (journal or whatever) and [barrier] is used to mark that
those updates are valid. Consider the following sequence.
i. Requests [0] ~ [post] leaves the request queue and enters
low-level driver.
ii. After a while, unfortunately, something goes wrong and the
drive fails [2]. Note that any of [0], [1] and [3] could have
completed by this time, but [pre] couldn't have been finished
as the drive must process it in order and it failed before
processing that command.
iii. Error handling kicks in and determines that the error is
unrecoverable and fails [2], and resumes operation.
iv. [pre] [barrier] [post] gets processed.
v. *BOOM* power fails
The problem here is that the barrier request is *supposed* to indicate
that filesystem update requests [2] and [3] made it safely to the
physical medium and, if the machine crashes after the barrier is
written, filesystem recovery code can depend on that. Sadly, that
isn't true in this case anymore. IOW, the success of a I/O barrier
should also be dependent on success of some of the preceding requests,
where only upper layer (filesystem) knows what 'some' is.
This can be solved by implementing a way to tell the block layer which
requests affect the success of the following barrier request and
making lower lever drivers to resume operation on error only after
block layer tells it to do so.
As the probability of this happening is very low and the drive should
be faulty, implementing the fix is probably an overkill. But, still,
it's there.
* In previous drafts of barrier implementation, there was fallback
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
mode can be selected and the failed barrier request is retried
automatically. The rationale for this feature was that as FUA is
pretty new in ATA world and ordered tag was never used widely, there
could be devices which report to support those features but choke when
actually given such requests.
This was removed for two reasons 1. it's an overkill 2. it's
impossible to implement properly when TAG ordering is used as low
level drivers resume after an error automatically. If it's ever
needed adding it back and modifying low level drivers accordingly
shouldn't be difficult.
+3 -9
View File
@@ -31,7 +31,7 @@ The following people helped with review comments and inputs for this
document: document:
Christoph Hellwig <hch@infradead.org> Christoph Hellwig <hch@infradead.org>
Arjan van de Ven <arjanv@redhat.com> Arjan van de Ven <arjanv@redhat.com>
Randy Dunlap <rddunlap@osdl.org> Randy Dunlap <rdunlap@xenotime.net>
Andre Hedrick <andre@linux-ide.org> Andre Hedrick <andre@linux-ide.org>
The following people helped with fixes/contributions to the bio patches The following people helped with fixes/contributions to the bio patches
@@ -263,14 +263,8 @@ A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
The generic i/o scheduler would make sure that it places the barrier request and The generic i/o scheduler would make sure that it places the barrier request and
all other requests coming after it after all the previous requests in the all other requests coming after it after all the previous requests in the
queue. Barriers may be implemented in different ways depending on the queue. Barriers may be implemented in different ways depending on the
driver. A SCSI driver for example could make use of ordered tags to driver. For more details regarding I/O barriers, please read barrier.txt
preserve the necessary ordering with a lower impact on throughput. For IDE in this directory.
this might be two sync cache flush: a pre and post flush when encountering
a barrier write.
There is a provision for queues to indicate what kind of barriers they
can provide. This is as of yet unmerged, details will be added here once it
is in the kernel.
1.2.2 Request Priority/Latency 1.2.2 Request Priority/Latency
+82
View File
@@ -0,0 +1,82 @@
Block layer statistics in /sys/block/<dev>/stat
===============================================
This file documents the contents of the /sys/block/<dev>/stat file.
The stat file provides several statistics about the state of block
device <dev>.
Q. Why are there multiple statistics in a single file? Doesn't sysfs
normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
represent a consistent snapshot of the state of the device. If the
statistics were exported as multiple files containing one statistic
each, it would be impossible to guarantee that a set of readings
represent a single point in time.
The stat file consists of a single line of text containing 11 decimal
values separated by whitespace. The fields are summarized in the
following table, and described in more detail below.
Name units description
---- ----- -----------
read I/Os requests number of read I/Os processed
read merges requests number of read I/Os merged with in-queue I/O
read sectors sectors number of sectors read
read ticks milliseconds total wait time for read requests
write I/Os requests number of write I/Os processed
write merges requests number of write I/Os merged with in-queue I/O
write sectors sectors number of sectors written
write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
read I/Os, write I/Os
=====================
These values increment when an I/O request completes.
read merges, write merges
=========================
These values increment when an I/O request is merged with an
already-queued I/O request.
read sectors, write sectors
===========================
These values count the number of sectors read from or written to this
block device. The "sectors" in question are the standard UNIX 512-byte
sectors, not any device- or filesystem-specific block size. The
counters are incremented when the I/O completes.
read ticks, write ticks
=======================
These values count the number of milliseconds that I/O requests have
waited on this block device. If there are multiple I/O requests waiting,
these values will increase at a rate greater than 1000/second; for
example, if 60 read requests wait for an average of 30 ms, the read_ticks
field will increase by 60*30 = 1800.
in_flight
=========
This value counts the number of I/O requests that have been issued to
the device driver but have not yet completed. It does not include I/O
requests that are in the queue but not yet issued to the device driver.
io_ticks
========
This value counts the number of milliseconds during which the device has
had I/O requests queued.
time_in_queue
=============
This value counts the number of milliseconds that I/O requests have waited
on this block device. If there are multiple I/O requests waiting, this
value will increase as the product of the number of milliseconds times the
number of requests waiting (see "read ticks" above for an example).
+1 -1
View File
@@ -136,7 +136,7 @@ changes occur:
8) void lazy_mmu_prot_update(pte_t pte) 8) void lazy_mmu_prot_update(pte_t pte)
This interface is called whenever the protection on This interface is called whenever the protection on
any user PTEs change. This interface provides a notification any user PTEs change. This interface provides a notification
to architecture specific code to take appropiate action. to architecture specific code to take appropriate action.
Next, we have the cache flushing interfaces. In general, when Linux Next, we have the cache flushing interfaces. In general, when Linux
+57 -1
View File
@@ -27,6 +27,7 @@ Contents:
2.2 Powersave 2.2 Powersave
2.3 Userspace 2.3 Userspace
2.4 Ondemand 2.4 Ondemand
2.5 Conservative
3. The Governor Interface in the CPUfreq Core 3. The Governor Interface in the CPUfreq Core
@@ -110,9 +111,64 @@ directory.
The CPUfreq govenor "ondemand" sets the CPU depending on the The CPUfreq govenor "ondemand" sets the CPU depending on the
current usage. To do this the CPU must have the capability to current usage. To do this the CPU must have the capability to
switch the frequency very fast. switch the frequency very quickly. There are a number of sysfs file
accessible parameters:
sampling_rate: measured in uS (10^-6 seconds), this is how often you
want the kernel to look at the CPU usage and to make decisions on
what to do about the frequency. Typically this is set to values of
around '10000' or more.
show_sampling_rate_(min|max): the minimum and maximum sampling rates
available that you may set 'sampling_rate' to.
up_threshold: defines what the average CPU usaged between the samplings
of 'sampling_rate' needs to be for the kernel to make a decision on
whether it should increase the frequency. For example when it is set
to its default value of '80' it means that between the checking
intervals the CPU needs to be on average more than 80% in use to then
decide that the CPU frequency needs to be increased.
sampling_down_factor: this parameter controls the rate that the CPU
makes a decision on when to decrease the frequency. When set to its
default value of '5' it means that at 1/5 the sampling_rate the kernel
makes a decision to lower the frequency. Five "lower rate" decisions
have to be made in a row before the CPU frequency is actually lower.
If set to '1' then the frequency decreases as quickly as it increases,
if set to '2' it decreases at half the rate of the increase.
ignore_nice_load: this parameter takes a value of '0' or '1', when set
to '0' (its default) then all processes are counted towards towards the
'cpu utilisation' value. When set to '1' then processes that are
run with a 'nice' value will not count (and thus be ignored) in the
overal usage calculation. This is useful if you are running a CPU
intensive calculation on your laptop that you do not care how long it
takes to complete as you can 'nice' it and prevent it from taking part
in the deciding process of whether to increase your CPU frequency.
2.5 Conservative
----------------
The CPUfreq governor "conservative", much like the "ondemand"
governor, sets the CPU depending on the current usage. It differs in
behaviour in that it gracefully increases and decreases the CPU speed
rather than jumping to max speed the moment there is any load on the
CPU. This behaviour more suitable in a battery powered environment.
The governor is tweaked in the same manner as the "ondemand" governor
through sysfs with the addition of:
freq_step: this describes what percentage steps the cpu freq should be
increased and decreased smoothly by. By default the cpu frequency will
increase in 5% chunks of your maximum cpu frequency. You can change this
value to anywhere between 0 and 100 where '0' will effectively lock your
CPU at a speed regardless of its load whilst '100' will, in theory, make
it behave identically to the "ondemand" governor.
down_threshold: same as the 'up_threshold' found for the "ondemand"
governor but for the opposite direction. For example when set to its
default value of '20' it means that if the CPU usage needs to be below
20% between samples to have the frequency decreased.
3. The Governor Interface in the CPUfreq Core 3. The Governor Interface in the CPUfreq Core
============================================= =============================================
+357
View File
@@ -0,0 +1,357 @@
CPU hotplug Support in Linux(tm) Kernel
Maintainers:
CPU Hotplug Core:
Rusty Russell <rusty@rustycorp.com.au>
Srivatsa Vaddagiri <vatsa@in.ibm.com>
i386:
Zwane Mwaikambo <zwane@arm.linux.org.uk>
ppc64:
Nathan Lynch <nathanl@austin.ibm.com>
Joel Schopp <jschopp@austin.ibm.com>
ia64/x86_64:
Ashok Raj <ashok.raj@intel.com>
Authors: Ashok Raj <ashok.raj@intel.com>
Lots of feedback: Nathan Lynch <nathanl@austin.ibm.com>,
Joel Schopp <jschopp@austin.ibm.com>
Introduction
Modern advances in system architectures have introduced advanced error
reporting and correction capabilities in processors. CPU architectures permit
partitioning support, where compute resources of a single CPU could be made
available to virtual machine environments. There are couple OEMS that
support NUMA hardware which are hot pluggable as well, where physical
node insertion and removal require support for CPU hotplug.
Such advances require CPUs available to a kernel to be removed either for
provisioning reasons, or for RAS purposes to keep an offending CPU off
system execution path. Hence the need for CPU hotplug support in the
Linux kernel.
A more novel use of CPU-hotplug support is its use today in suspend
resume support for SMP. Dual-core and HT support makes even
a laptop run SMP kernels which didn't support these methods. SMP support
for suspend/resume is a work in progress.
General Stuff about CPU Hotplug
--------------------------------
Command Line Switches
---------------------
maxcpus=n Restrict boot time cpus to n. Say if you have 4 cpus, using
maxcpus=2 will only boot 2. You can choose to bring the
other cpus later online, read FAQ's for more info.
additional_cpus=n [x86_64 only] use this to limit hotpluggable cpus.
This option sets
cpu_possible_map = cpu_present_map + additional_cpus
CPU maps and such
-----------------
[More on cpumaps and primitive to manipulate, please check
include/linux/cpumask.h that has more descriptive text.]
cpu_possible_map: Bitmap of possible CPUs that can ever be available in the
system. This is used to allocate some boot time memory for per_cpu variables
that aren't designed to grow/shrink as CPUs are made available or removed.
Once set during boot time discovery phase, the map is static, i.e no bits
are added or removed anytime. Trimming it accurately for your system needs
upfront can save some boot time memory. See below for how we use heuristics
in x86_64 case to keep this under check.
cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up()
after a cpu is available for kernel scheduling and ready to receive
interrupts from devices. Its cleared when a cpu is brought down using
__cpu_disable(), before which all OS services including interrupts are
migrated to another target CPU.
cpu_present_map: Bitmap of CPUs currently present in the system. Not all
of them may be online. When physical hotplug is processed by the relevant
subsystem (e.g ACPI) can change and new bit either be added or removed
from the map depending on the event is hot-add/hot-remove. There are currently
no locking rules as of now. Typical usage is to init topology during boot,
at which time hotplug is disabled.
You really dont need to manipulate any of the system cpu maps. They should
be read-only for most use. When setting up per-cpu resources almost always use
cpu_possible_map/for_each_cpu() to iterate.
Never use anything other than cpumask_t to represent bitmap of CPUs.
#include <linux/cpumask.h>
for_each_cpu - Iterate over cpu_possible_map
for_each_online_cpu - Iterate over cpu_online_map
for_each_present_cpu - Iterate over cpu_present_map
for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask.
#include <linux/cpu.h>
lock_cpu_hotplug() and unlock_cpu_hotplug():
The above calls are used to inhibit cpu hotplug operations. While holding the
cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
cpus going away, you could also use preempt_disable() and preempt_enable()
for those sections. Just remember the critical section cannot call any
function that can sleep or schedule this process away. The preempt_disable()
will work as long as stop_machine_run() is used to take a cpu down.
CPU Hotplug - Frequently Asked Questions.
Q: How to i enable my kernel to support CPU hotplug?
A: When doing make defconfig, Enable CPU hotplug support
"Processor type and Features" -> Support for Hotpluggable CPUs
Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well.
You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support
as well.
Q: What architectures support CPU hotplug?
A: As of 2.6.14, the following architectures support CPU hotplug.
i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64
Q: How to test if hotplug is supported on the newly built kernel?
A: You should now notice an entry in sysfs.
Check if sysfs is mounted, using the "mount" command. You should notice
an entry as shown below in the output.
....
none on /sys type sysfs (rw)
....
if this is not mounted, do the following.
#mkdir /sysfs
#mount -t sysfs sys /sys
now you should see entries for all present cpu, the following is an example
in a 8-way system.
#pwd
#/sys/devices/system/cpu
#ls -l
total 0
drwxr-xr-x 10 root root 0 Sep 19 07:44 .
drwxr-xr-x 13 root root 0 Sep 19 07:45 ..
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu0
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu1
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu2
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu3
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu4
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu5
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu6
drwxr-xr-x 3 root root 0 Sep 19 07:48 cpu7
Under each directory you would find an "online" file which is the control
file to logically online/offline a processor.
Q: Does hot-add/hot-remove refer to physical add/remove of cpus?
A: The usage of hot-add/remove may not be very consistently used in the code.
CONFIG_CPU_HOTPLUG enables logical online/offline capability in the kernel.
To support physical addition/removal, one would need some BIOS hooks and
the platform should have something like an attention button in PCI hotplug.
CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs.
Q: How do i logically offline a CPU?
A: Do the following.
#echo 0 > /sys/devices/system/cpu/cpuX/online
once the logical offline is successful, check
#cat /proc/interrupts
you should now not see the CPU that you removed. Also online file will report
the state as 0 when a cpu if offline and 1 when its online.
#To display the current cpu state.
#cat /sys/devices/system/cpu/cpuX/online
Q: Why cant i remove CPU0 on some systems?
A: Some architectures may have some special dependency on a certain CPU.
For e.g in IA64 platforms we have ability to sent platform interrupts to the
OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI
specifications, we didn't have a way to change the target CPU. Hence if the
current ACPI version doesn't support such re-direction, we disable that CPU
by making it not-removable.
In such cases you will also notice that the online file is missing under cpu0.
Q: How do i find out if a particular CPU is not removable?
A: Depending on the implementation, some architectures may show this by the
absence of the "online" file. This is done if it can be determined ahead of
time that this CPU cannot be removed.
In some situations, this can be a run time check, i.e if you try to remove the
last CPU, this will not be permitted. You can find such failures by
investigating the return value of the "echo" command.
Q: What happens when a CPU is being logically offlined?
A: The following happen, listed in no particular order :-)
- A notification is sent to in-kernel registered modules by sending an event
CPU_DOWN_PREPARE
- All process is migrated away from this outgoing CPU to a new CPU
- All interrupts targeted to this CPU is migrated to a new CPU
- timers/bottom half/task lets are also migrated to a new CPU
- Once all services are migrated, kernel calls an arch specific routine
__cpu_disable() to perform arch specific cleanup.
- Once this is successful, an event for successful cleanup is sent by an event
CPU_DEAD.
"It is expected that each service cleans up when the CPU_DOWN_PREPARE
notifier is called, when CPU_DEAD is called its expected there is nothing
running on behalf of this CPU that was offlined"
Q: If i have some kernel code that needs to be aware of CPU arrival and
departure, how to i arrange for proper notification?
A: This is what you would need in your kernel code to receive notifications.
#include <linux/cpu.h>
static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
unsigned int cpu = (unsigned long)hcpu;
switch (action) {
case CPU_ONLINE:
foobar_online_action(cpu);
break;
case CPU_DEAD:
foobar_dead_action(cpu);
break;
}
return NOTIFY_OK;
}
static struct notifier_block foobar_cpu_notifer =
{
.notifier_call = foobar_cpu_callback,
};
In your init function,
register_cpu_notifier(&foobar_cpu_notifier);
You can fail PREPARE notifiers if something doesn't work to prepare resources.
This will stop the activity and send a following CANCELED event back.
CPU_DEAD should not be failed, its just a goodness indication, but bad
things will happen if a notifier in path sent a BAD notify code.
Q: I don't see my action being called for all CPUs already up and running?
A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined.
If you need to perform some action for each cpu already in the system, then
for_each_online_cpu(i) {
foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i);
foobar_cpu_callback(&foobar-cpu_notifier, CPU_ONLINE, i);
}
Q: If i would like to develop cpu hotplug support for a new architecture,
what do i need at a minimum?
A: The following are what is required for CPU hotplug infrastructure to work
correctly.
- Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU
- __cpu_up() - Arch interface to bring up a CPU
- __cpu_disable() - Arch interface to shutdown a CPU, no more interrupts
can be handled by the kernel after the routine
returns. Including local APIC timers etc are
shutdown.
- __cpu_die() - This actually supposed to ensure death of the CPU.
Actually look at some example code in other arch
that implement CPU hotplug. The processor is taken
down from the idle() loop for that specific
architecture. __cpu_die() typically waits for some
per_cpu state to be set, to ensure the processor
dead routine is called to be sure positively.
Q: I need to ensure that a particular cpu is not removed when there is some
work specific to this cpu is in progress.
A: First switch the current thread context to preferred cpu
int my_func_on_cpu(int cpu)
{
cpumask_t saved_mask, new_mask = CPU_MASK_NONE;
int curr_cpu, err = 0;
saved_mask = current->cpus_allowed;
cpu_set(cpu, new_mask);
err = set_cpus_allowed(current, new_mask);
if (err)
return err;
/*
* If we got scheduled out just after the return from
* set_cpus_allowed() before running the work, this ensures
* we stay locked.
*/
curr_cpu = get_cpu();
if (curr_cpu != cpu) {
err = -EAGAIN;
goto ret;
} else {
/*
* Do work : But cant sleep, since get_cpu() disables preempt
*/
}
ret:
put_cpu();
set_cpus_allowed(current, saved_mask);
return err;
}
Q: How do we determine how many CPUs are available for hotplug.
A: There is no clear spec defined way from ACPI that can give us that
information today. Based on some input from Natalie of Unisys,
that the ACPI MADT (Multiple APIC Description Tables) marks those possible
CPUs in a system with disabled status.
Andi implemented some simple heuristics that count the number of disabled
CPUs in MADT as hotpluggable CPUS. In the case there are no disabled CPUS
we assume 1/2 the number of CPUs currently present can be hotplugged.
Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field
in MADT is only 8 bits.
User Space Notification
Hotplug support for devices is common in Linux today. Its being used today to
support automatic configuration of network, usb and pci devices. A hotplug
event can be used to invoke an agent script to perform the configuration task.
You can add /etc/hotplug/cpu.agent to handle hotplug notification user space
scripts.
#!/bin/bash
# $Id: cpu.agent
# Kernel hotplug params include:
#ACTION=%s [online or offline]
#DEVPATH=%s
#
cd /etc/hotplug
. ./hotplug.functions
case $ACTION in
online)
echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt
;;
offline)
echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt
;;
*)
debug_mesg CPU $ACTION event not supported
exit 1
;;
esac
+138 -27
View File
@@ -14,7 +14,10 @@ CONTENTS:
1.1 What are cpusets ? 1.1 What are cpusets ?
1.2 Why are cpusets needed ? 1.2 Why are cpusets needed ?
1.3 How are cpusets implemented ? 1.3 How are cpusets implemented ?
1.4 How do I use cpusets ? 1.4 What are exclusive cpusets ?
1.5 What does notify_on_release do ?
1.6 What is memory_pressure ?
1.7 How do I use cpusets ?
2. Usage Examples and Syntax 2. Usage Examples and Syntax
2.1 Basic Usage 2.1 Basic Usage
2.2 Adding/removing cpus 2.2 Adding/removing cpus
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not
allocate a page on a node that is not allowed in the requesting tasks allocate a page on a node that is not allowed in the requesting tasks
mems_allowed vector. mems_allowed vector.
If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
A cpuset that is cpu exclusive has a sched domain associated with it.
The sched domain consists of all cpus in the current cpuset that are not
part of any exclusive child cpusets.
This ensures that the scheduler load balacing code only balances
against the cpus that are in the sched domain as defined above and not
all of the cpus in the system. This removes any overhead due to
load balancing code trying to pull tasks outside of the cpu exclusive
cpuset only to be prevented by the tasks' cpus_allowed mask.
A cpuset that is mem_exclusive restricts kernel allocations for
page, buffer and other data commonly shared by the kernel across
multiple users. All cpusets, whether mem_exclusive or not, restrict
allocations of memory for user space. This enables configuring a
system so that several independent jobs can share common kernel
data, such as file system pages, while isolating each jobs user
allocation in its own cpuset. To do this, construct a large
mem_exclusive cpuset to hold all the jobs, and construct child,
non-mem_exclusive cpusets for each individual job. Only a small
amount of typical kernel memory, such as requests from interrupt
handlers, is allowed to be taken outside even a mem_exclusive cpuset.
User level code may create and destroy cpusets by name in the cpuset User level code may create and destroy cpusets by name in the cpuset
virtual file system, manage the attributes and permissions of these virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset, cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
@@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows:
The implementation of cpusets requires a few, simple hooks The implementation of cpusets requires a few, simple hooks
into the rest of the kernel, none in performance critical paths: into the rest of the kernel, none in performance critical paths:
- in main/init.c, to initialize the root cpuset at system boot. - in init/main.c, to initialize the root cpuset at system boot.
- in fork and exit, to attach and detach a task from its cpuset. - in fork and exit, to attach and detach a task from its cpuset.
- in sched_setaffinity, to mask the requested CPUs by what's - in sched_setaffinity, to mask the requested CPUs by what's
allowed in that tasks cpuset. allowed in that tasks cpuset.
@@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths:
and related changes in both sched.c and arch/ia64/kernel/domain.c and related changes in both sched.c and arch/ia64/kernel/domain.c
- in the mbind and set_mempolicy system calls, to mask the requested - in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that tasks cpuset. Memory Nodes by what's allowed in that tasks cpuset.
- in page_alloc, to restrict memory to allowed nodes. - in page_alloc.c, to restrict memory to allowed nodes.
- in vmscan.c, to restrict page recovery to the current cpuset. - in vmscan.c, to restrict page recovery to the current cpuset.
In addition a new file system, of type "cpuset" may be mounted, In addition a new file system, of type "cpuset" may be mounted,
@@ -192,9 +172,15 @@ containing the following files describing that cpuset:
- cpus: list of CPUs in that cpuset - cpus: list of CPUs in that cpuset
- mems: list of Memory Nodes in that cpuset - mems: list of Memory Nodes in that cpuset
- memory_migrate flag: if set, move pages to cpusets nodes
- cpu_exclusive flag: is cpu placement exclusive? - cpu_exclusive flag: is cpu placement exclusive?
- mem_exclusive flag: is memory placement exclusive? - mem_exclusive flag: is memory placement exclusive?
- tasks: list of tasks (by pid) attached to that cpuset - tasks: list of tasks (by pid) attached to that cpuset
- notify_on_release flag: run /sbin/cpuset_release_agent on exit?
- memory_pressure: measure of how much paging pressure in cpuset
In addition, the root cpuset only has the following file:
- memory_pressure_enabled flag: compute memory_pressure?
New cpusets are created using the mkdir system call or shell New cpusets are created using the mkdir system call or shell
command. The properties of a cpuset, such as its flags, allowed command. The properties of a cpuset, such as its flags, allowed
@@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
to represent the cpuset hierarchy provides for a familiar permission to represent the cpuset hierarchy provides for a familiar permission
and name space for cpusets, with a minimum of additional kernel code. and name space for cpusets, with a minimum of additional kernel code.
1.4 How do I use cpusets ?
1.4 What are exclusive cpusets ?
--------------------------------
If a cpuset is cpu or mem exclusive, no other cpuset, other than
a direct ancestor or descendent, may share any of the same CPUs or
Memory Nodes.
A cpuset that is cpu_exclusive has a scheduler (sched) domain
associated with it. The sched domain consists of all CPUs in the
current cpuset that are not part of any exclusive child cpusets.
This ensures that the scheduler load balancing code only balances
against the CPUs that are in the sched domain as defined above and
not all of the CPUs in the system. This removes any overhead due to
load balancing code trying to pull tasks outside of the cpu_exclusive
cpuset only to be prevented by the tasks' cpus_allowed mask.
A cpuset that is mem_exclusive restricts kernel allocations for
page, buffer and other data commonly shared by the kernel across
multiple users. All cpusets, whether mem_exclusive or not, restrict
allocations of memory for user space. This enables configuring a
system so that several independent jobs can share common kernel data,
such as file system pages, while isolating each jobs user allocation in
its own cpuset. To do this, construct a large mem_exclusive cpuset to
hold all the jobs, and construct child, non-mem_exclusive cpusets for
each individual job. Only a small amount of typical kernel memory,
such as requests from interrupt handlers, is allowed to be taken
outside even a mem_exclusive cpuset.
1.5 What does notify_on_release do ?
------------------------------------
If the notify_on_release flag is enabled (1) in a cpuset, then whenever
the last task in the cpuset leaves (exits or attaches to some other
cpuset) and the last child cpuset of that cpuset is removed, then
the kernel runs the command /sbin/cpuset_release_agent, supplying the
pathname (relative to the mount point of the cpuset file system) of the
abandoned cpuset. This enables automatic removal of abandoned cpusets.
The default value of notify_on_release in the root cpuset at system
boot is disabled (0). The default value of other cpusets at creation
is the current value of their parents notify_on_release setting.
1.6 What is memory_pressure ?
-----------------------------
The memory_pressure of a cpuset provides a simple per-cpuset metric
of the rate that the tasks in a cpuset are attempting to free up in
use memory on the nodes of the cpuset to satisfy additional memory
requests.
This enables batch managers monitoring jobs running in dedicated
cpusets to efficiently detect what level of memory pressure that job
is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or re-prioritize jobs that
are trying to use more memory than allowed on the nodes assigned them,
and with tightly coupled, long running, massively parallel scientific
computing jobs that will dramatically fail to meet required performance
goals if they start to use more memory than allowed to them.
This mechanism provides a very economical way for the batch manager
to monitor a cpuset for signs of memory pressure. It's up to the
batch manager or other user code to decide what to do about it and
take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm,
the system load imposed by a batch scheduler monitoring this
metric is sharply reduced on large systems, because a scan of
the tasklist can be avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a
single read, instead of having to read and accumulate results
for a period of time.
Because this meter is per-cpuset rather than per-task or mm,
the batch scheduler can obtain the key information, memory
pressure in a cpuset, with a single read, rather than having to
query and accumulate results over all the (dynamically changing)
set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words
of data per-cpuset) is kept, and updated by any task attached to that
cpuset, if it enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by
the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.
1.7 How do I use cpusets ?
-------------------------- --------------------------
In order to minimize the impact of cpusets on critical kernel In order to minimize the impact of cpusets on critical kernel
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
impacting the scheduler code in the kernel with a check for changes impacting the scheduler code in the kernel with a check for changes
in a tasks processor placement. in a tasks processor placement.
Normally, once a page is allocated (given a physical page
of main memory) then that page stays on whatever node it
was allocated, so long as it remains allocated, even if the
cpusets memory placement policy 'mems' subsequently changes.
If the cpuset flag file 'memory_migrate' is set true, then when
tasks are attached to that cpuset, any pages that task had
allocated to it on nodes in its previous cpuset are migrated
to the tasks new cpuset. Depending on the implementation,
this migration may either be done by swapping the page out,
so that the next time the page is referenced, it will be paged
into the tasks new cpuset, usually on the node where it was
referenced, or this migration may be done by directly copying
the pages from the tasks previous cpuset to the new cpuset,
where possible to the same node, relative to the new cpuset,
as the node that held the page, relative to the old cpuset.
Also if 'memory_migrate' is set true, then if that cpusets
'mems' file is modified, pages allocated to tasks in that
cpuset, that were on nodes in the previous setting of 'mems',
will be moved to nodes in the new setting of 'mems.' Again,
depending on the implementation, this might be done by swapping,
or by direct copying. In either case, pages that were not in
the tasks prior cpuset, or in the cpusets prior 'mems' setting,
will not be moved.
There is an exception to the above. If hotplug functionality is used There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset, to remove all the CPUs that are currently assigned to a cpuset,
then the kernel will automatically update the cpus_allowed of all then the kernel will automatically update the cpus_allowed of all

Some files were not shown because too many files have changed in this diff Show More