You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge branch 'upstream'
This commit is contained in:
@@ -10,6 +10,7 @@
|
|||||||
*.a
|
*.a
|
||||||
*.s
|
*.s
|
||||||
*.ko
|
*.ko
|
||||||
|
*.so
|
||||||
*.mod.c
|
*.mod.c
|
||||||
|
|
||||||
#
|
#
|
||||||
@@ -23,6 +24,7 @@ Module.symvers
|
|||||||
# Generated include files
|
# Generated include files
|
||||||
#
|
#
|
||||||
include/asm
|
include/asm
|
||||||
|
include/asm-*/asm-offsets.h
|
||||||
include/config
|
include/config
|
||||||
include/linux/autoconf.h
|
include/linux/autoconf.h
|
||||||
include/linux/compile.h
|
include/linux/compile.h
|
||||||
|
|||||||
@@ -1883,6 +1883,7 @@ N: Jaya Kumar
|
|||||||
E: jayalk@intworks.biz
|
E: jayalk@intworks.biz
|
||||||
W: http://www.intworks.biz
|
W: http://www.intworks.biz
|
||||||
D: Arc monochrome LCD framebuffer driver, x86 reboot fixups
|
D: Arc monochrome LCD framebuffer driver, x86 reboot fixups
|
||||||
|
D: pirq addr, CS5535 alsa audio driver
|
||||||
S: Gurgaon, India
|
S: Gurgaon, India
|
||||||
S: Kuala Lumpur, Malaysia
|
S: Kuala Lumpur, Malaysia
|
||||||
|
|
||||||
@@ -3202,7 +3203,7 @@ N: Eugene Surovegin
|
|||||||
E: ebs@ebshome.net
|
E: ebs@ebshome.net
|
||||||
W: http://kernel.ebshome.net/
|
W: http://kernel.ebshome.net/
|
||||||
P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1
|
P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1
|
||||||
D: Embedded PowerPC 4xx: I2C, PIC and random hacks/fixes
|
D: Embedded PowerPC 4xx: EMAC, I2C, PIC and random hacks/fixes
|
||||||
S: Sunnyvale, California 94085
|
S: Sunnyvale, California 94085
|
||||||
S: USA
|
S: USA
|
||||||
|
|
||||||
|
|||||||
+5
-26
@@ -31,8 +31,6 @@ al espa
|
|||||||
Eine deutsche Version dieser Datei finden Sie unter
|
Eine deutsche Version dieser Datei finden Sie unter
|
||||||
<http://www.stefan-winter.de/Changes-2.4.0.txt>.
|
<http://www.stefan-winter.de/Changes-2.4.0.txt>.
|
||||||
|
|
||||||
Last updated: October 29th, 2002
|
|
||||||
|
|
||||||
Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu).
|
Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu).
|
||||||
|
|
||||||
Current Minimal Requirements
|
Current Minimal Requirements
|
||||||
@@ -48,7 +46,7 @@ necessary on all systems; obviously, if you don't have any ISDN
|
|||||||
hardware, for example, you probably needn't concern yourself with
|
hardware, for example, you probably needn't concern yourself with
|
||||||
isdn4k-utils.
|
isdn4k-utils.
|
||||||
|
|
||||||
o Gnu C 2.95.3 # gcc --version
|
o Gnu C 3.2 # gcc --version
|
||||||
o Gnu make 3.79.1 # make --version
|
o Gnu make 3.79.1 # make --version
|
||||||
o binutils 2.12 # ld -v
|
o binutils 2.12 # ld -v
|
||||||
o util-linux 2.10o # fdformat --version
|
o util-linux 2.10o # fdformat --version
|
||||||
@@ -74,26 +72,7 @@ GCC
|
|||||||
---
|
---
|
||||||
|
|
||||||
The gcc version requirements may vary depending on the type of CPU in your
|
The gcc version requirements may vary depending on the type of CPU in your
|
||||||
computer. The next paragraph applies to users of x86 CPUs, but not
|
computer.
|
||||||
necessarily to users of other CPUs. Users of other CPUs should obtain
|
|
||||||
information about their gcc version requirements from another source.
|
|
||||||
|
|
||||||
The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it
|
|
||||||
should be used when you need absolute stability. You may use gcc 3.0.x
|
|
||||||
instead if you wish, although it may cause problems. Later versions of gcc
|
|
||||||
have not received much testing for Linux kernel compilation, and there are
|
|
||||||
almost certainly bugs (mainly, but not exclusively, in the kernel) that
|
|
||||||
will need to be fixed in order to use these compilers. In any case, using
|
|
||||||
pgcc instead of plain gcc is just asking for trouble.
|
|
||||||
|
|
||||||
The Red Hat gcc 2.96 compiler subtree can also be used to build this tree.
|
|
||||||
You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build
|
|
||||||
the kernel correctly.
|
|
||||||
|
|
||||||
In addition, please pay attention to compiler optimization. Anything
|
|
||||||
greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x
|
|
||||||
or derivatives, be sure not to use -fstrict-aliasing (which, depending on
|
|
||||||
your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing).
|
|
||||||
|
|
||||||
Make
|
Make
|
||||||
----
|
----
|
||||||
@@ -322,9 +301,9 @@ Getting updated software
|
|||||||
Kernel compilation
|
Kernel compilation
|
||||||
******************
|
******************
|
||||||
|
|
||||||
gcc 2.95.3
|
gcc
|
||||||
----------
|
---
|
||||||
o <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz>
|
o <ftp://ftp.gnu.org/gnu/gcc/>
|
||||||
|
|
||||||
Make
|
Make
|
||||||
----
|
----
|
||||||
|
|||||||
@@ -199,7 +199,7 @@ The rationale is:
|
|||||||
modifications are prevented
|
modifications are prevented
|
||||||
- saves the compiler work to optimize redundant code away ;)
|
- saves the compiler work to optimize redundant code away ;)
|
||||||
|
|
||||||
int fun(int )
|
int fun(int a)
|
||||||
{
|
{
|
||||||
int result = 0;
|
int result = 0;
|
||||||
char *buffer = kmalloc(SIZE);
|
char *buffer = kmalloc(SIZE);
|
||||||
@@ -344,7 +344,7 @@ Remember: if another thread can find your data structure, and you don't
|
|||||||
have a reference count on it, you almost certainly have a bug.
|
have a reference count on it, you almost certainly have a bug.
|
||||||
|
|
||||||
|
|
||||||
Chapter 11: Macros, Enums, Inline functions and RTL
|
Chapter 11: Macros, Enums and RTL
|
||||||
|
|
||||||
Names of macros defining constants and labels in enums are capitalized.
|
Names of macros defining constants and labels in enums are capitalized.
|
||||||
|
|
||||||
@@ -429,7 +429,35 @@ from void pointer to any other pointer type is guaranteed by the C programming
|
|||||||
language.
|
language.
|
||||||
|
|
||||||
|
|
||||||
Chapter 14: References
|
Chapter 14: The inline disease
|
||||||
|
|
||||||
|
There appears to be a common misperception that gcc has a magic "make me
|
||||||
|
faster" speedup option called "inline". While the use of inlines can be
|
||||||
|
appropriate (for example as a means of replacing macros, see Chapter 11), it
|
||||||
|
very often is not. Abundant use of the inline keyword leads to a much bigger
|
||||||
|
kernel, which in turn slows the system as a whole down, due to a bigger
|
||||||
|
icache footprint for the CPU and simply because there is less memory
|
||||||
|
available for the pagecache. Just think about it; a pagecache miss causes a
|
||||||
|
disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles
|
||||||
|
that can go into these 5 miliseconds.
|
||||||
|
|
||||||
|
A reasonable rule of thumb is to not put inline at functions that have more
|
||||||
|
than 3 lines of code in them. An exception to this rule are the cases where
|
||||||
|
a parameter is known to be a compiletime constant, and as a result of this
|
||||||
|
constantness you *know* the compiler will be able to optimize most of your
|
||||||
|
function away at compile time. For a good example of this later case, see
|
||||||
|
the kmalloc() inline function.
|
||||||
|
|
||||||
|
Often people argue that adding inline to functions that are static and used
|
||||||
|
only once is always a win since there is no space tradeoff. While this is
|
||||||
|
technically correct, gcc is capable of inlining these automatically without
|
||||||
|
help, and the maintenance issue of removing the inline when a second user
|
||||||
|
appears outweighs the potential value of the hint that tells gcc to do
|
||||||
|
something it would have done anyway.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Chapter 15: References
|
||||||
|
|
||||||
The C Programming Language, Second Edition
|
The C Programming Language, Second Edition
|
||||||
by Brian W. Kernighan and Dennis M. Ritchie.
|
by Brian W. Kernighan and Dennis M. Ritchie.
|
||||||
@@ -444,10 +472,13 @@ ISBN 0-201-61586-X.
|
|||||||
URL: http://cm.bell-labs.com/cm/cs/tpop/
|
URL: http://cm.bell-labs.com/cm/cs/tpop/
|
||||||
|
|
||||||
GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
|
GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
|
||||||
gcc internals and indent, all available from http://www.gnu.org
|
gcc internals and indent, all available from http://www.gnu.org/manual/
|
||||||
|
|
||||||
WG14 is the international standardization working group for the programming
|
WG14 is the international standardization working group for the programming
|
||||||
language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/
|
language C, URL: http://www.open-std.org/JTC1/SC22/WG14/
|
||||||
|
|
||||||
|
Kernel CodingStyle, by greg@kroah.com at OLS 2002:
|
||||||
|
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
|
||||||
|
|
||||||
--
|
--
|
||||||
Last updated on 16 February 2004 by a community effort on LKML.
|
Last updated on 30 December 2005 by a community effort on LKML.
|
||||||
|
|||||||
@@ -0,0 +1,6 @@
|
|||||||
|
*.xml
|
||||||
|
*.ps
|
||||||
|
*.pdf
|
||||||
|
*.html
|
||||||
|
*.9.gz
|
||||||
|
*.9
|
||||||
@@ -53,6 +53,11 @@
|
|||||||
!Iinclude/linux/sched.h
|
!Iinclude/linux/sched.h
|
||||||
!Ekernel/sched.c
|
!Ekernel/sched.c
|
||||||
!Ekernel/timer.c
|
!Ekernel/timer.c
|
||||||
|
</sect1>
|
||||||
|
<sect1><title>High-resolution timers</title>
|
||||||
|
!Iinclude/linux/ktime.h
|
||||||
|
!Iinclude/linux/hrtimer.h
|
||||||
|
!Ekernel/hrtimer.c
|
||||||
</sect1>
|
</sect1>
|
||||||
<sect1><title>Internal Functions</title>
|
<sect1><title>Internal Functions</title>
|
||||||
!Ikernel/exit.c
|
!Ikernel/exit.c
|
||||||
@@ -369,6 +374,7 @@ X!Edrivers/acpi/motherboard.c
|
|||||||
X!Edrivers/acpi/bus.c
|
X!Edrivers/acpi/bus.c
|
||||||
-->
|
-->
|
||||||
!Edrivers/acpi/scan.c
|
!Edrivers/acpi/scan.c
|
||||||
|
!Idrivers/acpi/scan.c
|
||||||
<!-- No correct structured comments
|
<!-- No correct structured comments
|
||||||
X!Edrivers/acpi/pci_bind.c
|
X!Edrivers/acpi/pci_bind.c
|
||||||
-->
|
-->
|
||||||
|
|||||||
@@ -222,7 +222,7 @@
|
|||||||
<title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
|
<title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
There are two main types of kernel locks. The fundamental type
|
There are three main types of kernel locks. The fundamental type
|
||||||
is the spinlock
|
is the spinlock
|
||||||
(<filename class="headerfile">include/asm/spinlock.h</filename>),
|
(<filename class="headerfile">include/asm/spinlock.h</filename>),
|
||||||
which is a very simple single-holder lock: if you can't get the
|
which is a very simple single-holder lock: if you can't get the
|
||||||
@@ -230,16 +230,22 @@
|
|||||||
very small and fast, and can be used anywhere.
|
very small and fast, and can be used anywhere.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
The second type is a semaphore
|
The second type is a mutex
|
||||||
|
(<filename class="headerfile">include/linux/mutex.h</filename>): it
|
||||||
|
is like a spinlock, but you may block holding a mutex.
|
||||||
|
If you can't lock a mutex, your task will suspend itself, and be woken
|
||||||
|
up when the mutex is released. This means the CPU can do something
|
||||||
|
else while you are waiting. There are many cases when you simply
|
||||||
|
can't sleep (see <xref linkend="sleeping-things"/>), and so have to
|
||||||
|
use a spinlock instead.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
The third type is a semaphore
|
||||||
(<filename class="headerfile">include/asm/semaphore.h</filename>): it
|
(<filename class="headerfile">include/asm/semaphore.h</filename>): it
|
||||||
can have more than one holder at any time (the number decided at
|
can have more than one holder at any time (the number decided at
|
||||||
initialization time), although it is most commonly used as a
|
initialization time), although it is most commonly used as a
|
||||||
single-holder lock (a mutex). If you can't get a semaphore,
|
single-holder lock (a mutex). If you can't get a semaphore, your
|
||||||
your task will put itself on the queue, and be woken up when the
|
task will be suspended and later on woken up - just like for mutexes.
|
||||||
semaphore is released. This means the CPU will do something
|
|
||||||
else while you are waiting, but there are many cases when you
|
|
||||||
simply can't sleep (see <xref linkend="sleeping-things"/>), and so
|
|
||||||
have to use a spinlock instead.
|
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Neither type of lock is recursive: see
|
Neither type of lock is recursive: see
|
||||||
|
|||||||
@@ -253,6 +253,7 @@
|
|||||||
!Edrivers/usb/core/urb.c
|
!Edrivers/usb/core/urb.c
|
||||||
!Edrivers/usb/core/message.c
|
!Edrivers/usb/core/message.c
|
||||||
!Edrivers/usb/core/file.c
|
!Edrivers/usb/core/file.c
|
||||||
|
!Edrivers/usb/core/driver.c
|
||||||
!Edrivers/usb/core/usb.c
|
!Edrivers/usb/core/usb.c
|
||||||
!Edrivers/usb/core/hub.c
|
!Edrivers/usb/core/hub.c
|
||||||
</chapter>
|
</chapter>
|
||||||
|
|||||||
@@ -229,7 +229,7 @@ int __init myradio_init(struct video_init *v)
|
|||||||
|
|
||||||
static int users = 0;
|
static int users = 0;
|
||||||
|
|
||||||
static int radio_open(stuct video_device *dev, int flags)
|
static int radio_open(struct video_device *dev, int flags)
|
||||||
{
|
{
|
||||||
if(users)
|
if(users)
|
||||||
return -EBUSY;
|
return -EBUSY;
|
||||||
@@ -949,7 +949,7 @@ int __init mycamera_init(struct video_init *v)
|
|||||||
|
|
||||||
static int users = 0;
|
static int users = 0;
|
||||||
|
|
||||||
static int camera_open(stuct video_device *dev, int flags)
|
static int camera_open(struct video_device *dev, int flags)
|
||||||
{
|
{
|
||||||
if(users)
|
if(users)
|
||||||
return -EBUSY;
|
return -EBUSY;
|
||||||
|
|||||||
@@ -1,74 +1,67 @@
|
|||||||
Refcounter framework for elements of lists/arrays protected by
|
Refcounter design for elements of lists/arrays protected by RCU.
|
||||||
RCU.
|
|
||||||
|
|
||||||
Refcounting on elements of lists which are protected by traditional
|
Refcounting on elements of lists which are protected by traditional
|
||||||
reader/writer spinlocks or semaphores are straight forward as in:
|
reader/writer spinlocks or semaphores are straight forward as in:
|
||||||
|
|
||||||
1. 2.
|
1. 2.
|
||||||
add() search_and_reference()
|
add() search_and_reference()
|
||||||
{ {
|
{ {
|
||||||
alloc_object read_lock(&list_lock);
|
alloc_object read_lock(&list_lock);
|
||||||
... search_for_element
|
... search_for_element
|
||||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||||
write_lock(&list_lock); ...
|
write_lock(&list_lock); ...
|
||||||
add_element read_unlock(&list_lock);
|
add_element read_unlock(&list_lock);
|
||||||
... ...
|
... ...
|
||||||
write_unlock(&list_lock); }
|
write_unlock(&list_lock); }
|
||||||
}
|
}
|
||||||
|
|
||||||
3. 4.
|
3. 4.
|
||||||
release_referenced() delete()
|
release_referenced() delete()
|
||||||
{ {
|
{ {
|
||||||
... write_lock(&list_lock);
|
... write_lock(&list_lock);
|
||||||
atomic_dec(&el->rc, relfunc) ...
|
atomic_dec(&el->rc, relfunc) ...
|
||||||
... delete_element
|
... delete_element
|
||||||
} write_unlock(&list_lock);
|
} write_unlock(&list_lock);
|
||||||
...
|
...
|
||||||
if (atomic_dec_and_test(&el->rc))
|
if (atomic_dec_and_test(&el->rc))
|
||||||
kfree(el);
|
kfree(el);
|
||||||
...
|
...
|
||||||
}
|
}
|
||||||
|
|
||||||
If this list/array is made lock free using rcu as in changing the
|
If this list/array is made lock free using rcu as in changing the
|
||||||
write_lock in add() and delete() to spin_lock and changing read_lock
|
write_lock in add() and delete() to spin_lock and changing read_lock
|
||||||
in search_and_reference to rcu_read_lock(), the rcuref_get in
|
in search_and_reference to rcu_read_lock(), the atomic_get in
|
||||||
search_and_reference could potentially hold reference to an element which
|
search_and_reference could potentially hold reference to an element which
|
||||||
has already been deleted from the list/array. rcuref_lf_get_rcu takes
|
has already been deleted from the list/array. atomic_inc_not_zero takes
|
||||||
care of this scenario. search_and_reference should look as;
|
care of this scenario. search_and_reference should look as;
|
||||||
|
|
||||||
1. 2.
|
1. 2.
|
||||||
add() search_and_reference()
|
add() search_and_reference()
|
||||||
{ {
|
{ {
|
||||||
alloc_object rcu_read_lock();
|
alloc_object rcu_read_lock();
|
||||||
... search_for_element
|
... search_for_element
|
||||||
atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) {
|
atomic_set(&el->rc, 1); if (atomic_inc_not_zero(&el->rc)) {
|
||||||
write_lock(&list_lock); rcu_read_unlock();
|
write_lock(&list_lock); rcu_read_unlock();
|
||||||
return FAIL;
|
return FAIL;
|
||||||
add_element }
|
add_element }
|
||||||
... ...
|
... ...
|
||||||
write_unlock(&list_lock); rcu_read_unlock();
|
write_unlock(&list_lock); rcu_read_unlock();
|
||||||
} }
|
} }
|
||||||
3. 4.
|
3. 4.
|
||||||
release_referenced() delete()
|
release_referenced() delete()
|
||||||
{ {
|
{ {
|
||||||
... write_lock(&list_lock);
|
... write_lock(&list_lock);
|
||||||
rcuref_dec(&el->rc, relfunc) ...
|
atomic_dec(&el->rc, relfunc) ...
|
||||||
... delete_element
|
... delete_element
|
||||||
} write_unlock(&list_lock);
|
} write_unlock(&list_lock);
|
||||||
...
|
...
|
||||||
if (rcuref_dec_and_test(&el->rc))
|
if (atomic_dec_and_test(&el->rc))
|
||||||
call_rcu(&el->head, el_free);
|
call_rcu(&el->head, el_free);
|
||||||
...
|
...
|
||||||
}
|
}
|
||||||
|
|
||||||
Sometimes, reference to the element need to be obtained in the
|
Sometimes, reference to the element need to be obtained in the
|
||||||
update (write) stream. In such cases, rcuref_inc_lf might be an overkill
|
update (write) stream. In such cases, atomic_inc_not_zero might be an
|
||||||
since the spinlock serialising list updates are held. rcuref_inc
|
overkill since the spinlock serialising list updates are held. atomic_inc
|
||||||
is to be used in such cases.
|
is to be used in such cases.
|
||||||
For arches which do not have cmpxchg rcuref_inc_lf
|
|
||||||
api uses a hashed spinlock implementation and the same hashed spinlock
|
|
||||||
is acquired in all rcuref_xxx primitives to preserve atomicity.
|
|
||||||
Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
|
|
||||||
refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
|
|
||||||
might lead to races. rcuref_inc_lf() must be used in lockfree
|
|
||||||
RCU critical sections only.
|
|
||||||
|
|||||||
@@ -27,18 +27,17 @@ Who To Submit Drivers To
|
|||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
Linux 2.0:
|
Linux 2.0:
|
||||||
No new drivers are accepted for this kernel tree
|
No new drivers are accepted for this kernel tree.
|
||||||
|
|
||||||
Linux 2.2:
|
Linux 2.2:
|
||||||
|
No new drivers are accepted for this kernel tree.
|
||||||
|
|
||||||
|
Linux 2.4:
|
||||||
If the code area has a general maintainer then please submit it to
|
If the code area has a general maintainer then please submit it to
|
||||||
the maintainer listed in MAINTAINERS in the kernel file. If the
|
the maintainer listed in MAINTAINERS in the kernel file. If the
|
||||||
maintainer does not respond or you cannot find the appropriate
|
maintainer does not respond or you cannot find the appropriate
|
||||||
maintainer then please contact the 2.2 kernel maintainer:
|
maintainer then please contact Marcelo Tosatti
|
||||||
Marc-Christian Petersen <m.c.p@wolk-project.de>.
|
<marcelo.tosatti@cyclades.com>.
|
||||||
|
|
||||||
Linux 2.4:
|
|
||||||
The same rules apply as 2.2. The final contact point for Linux 2.4
|
|
||||||
submissions is Marcelo Tosatti <marcelo.tosatti@cyclades.com>.
|
|
||||||
|
|
||||||
Linux 2.6:
|
Linux 2.6:
|
||||||
The same rules apply as 2.4 except that you should follow linux-kernel
|
The same rules apply as 2.4 except that you should follow linux-kernel
|
||||||
@@ -53,6 +52,7 @@ Licensing: The code must be released to us under the
|
|||||||
of exclusive GPL licensing, and if you wish the driver
|
of exclusive GPL licensing, and if you wish the driver
|
||||||
to be useful to other communities such as BSD you may well
|
to be useful to other communities such as BSD you may well
|
||||||
wish to release under multiple licenses.
|
wish to release under multiple licenses.
|
||||||
|
See accepted licenses at include/linux/module.h
|
||||||
|
|
||||||
Copyright: The copyright owner must agree to use of GPL.
|
Copyright: The copyright owner must agree to use of GPL.
|
||||||
It's best if the submitter and copyright owner
|
It's best if the submitter and copyright owner
|
||||||
@@ -143,5 +143,13 @@ KernelNewbies:
|
|||||||
http://kernelnewbies.org/
|
http://kernelnewbies.org/
|
||||||
|
|
||||||
Linux USB project:
|
Linux USB project:
|
||||||
http://sourceforge.net/projects/linux-usb/
|
http://www.linux-usb.org/
|
||||||
|
|
||||||
|
How to NOT write kernel driver by arjanv@redhat.com
|
||||||
|
http://people.redhat.com/arjanv/olspaper.pdf
|
||||||
|
|
||||||
|
Kernel Janitor:
|
||||||
|
http://janitor.kernelnewbies.org/
|
||||||
|
|
||||||
|
--
|
||||||
|
Last updated on 17 Nov 2005.
|
||||||
|
|||||||
@@ -78,7 +78,9 @@ Randy Dunlap's patch scripts:
|
|||||||
http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz
|
http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz
|
||||||
|
|
||||||
Andrew Morton's patch scripts:
|
Andrew Morton's patch scripts:
|
||||||
http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.20
|
http://www.zip.com.au/~akpm/linux/patches/
|
||||||
|
Instead of these scripts, quilt is the recommended patch management
|
||||||
|
tool (see above).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -97,7 +99,7 @@ need to split up your patch. See #3, next.
|
|||||||
|
|
||||||
3) Separate your changes.
|
3) Separate your changes.
|
||||||
|
|
||||||
Separate each logical change into its own patch.
|
Separate _logical changes_ into a single patch file.
|
||||||
|
|
||||||
For example, if your changes include both bug fixes and performance
|
For example, if your changes include both bug fixes and performance
|
||||||
enhancements for a single driver, separate those changes into two
|
enhancements for a single driver, separate those changes into two
|
||||||
@@ -112,6 +114,10 @@ If one patch depends on another patch in order for a change to be
|
|||||||
complete, that is OK. Simply note "this patch depends on patch X"
|
complete, that is OK. Simply note "this patch depends on patch X"
|
||||||
in your patch description.
|
in your patch description.
|
||||||
|
|
||||||
|
If you cannot condense your patch set into a smaller set of patches,
|
||||||
|
then only post say 15 or so at a time and wait for review and integration.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
4) Select e-mail destination.
|
4) Select e-mail destination.
|
||||||
|
|
||||||
@@ -124,6 +130,10 @@ your patch to the primary Linux kernel developer's mailing list,
|
|||||||
linux-kernel@vger.kernel.org. Most kernel developers monitor this
|
linux-kernel@vger.kernel.org. Most kernel developers monitor this
|
||||||
e-mail list, and can comment on your changes.
|
e-mail list, and can comment on your changes.
|
||||||
|
|
||||||
|
|
||||||
|
Do not send more than 15 patches at once to the vger mailing lists!!!
|
||||||
|
|
||||||
|
|
||||||
Linus Torvalds is the final arbiter of all changes accepted into the
|
Linus Torvalds is the final arbiter of all changes accepted into the
|
||||||
Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets
|
Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets
|
||||||
a lot of e-mail, so typically you should do your best to -avoid- sending
|
a lot of e-mail, so typically you should do your best to -avoid- sending
|
||||||
@@ -149,6 +159,9 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the
|
|||||||
MAINTAINERS file for a mailing list that relates specifically to
|
MAINTAINERS file for a mailing list that relates specifically to
|
||||||
your change.
|
your change.
|
||||||
|
|
||||||
|
Majordomo lists of VGER.KERNEL.ORG at:
|
||||||
|
<http://vger.kernel.org/vger-lists.html>
|
||||||
|
|
||||||
If changes affect userland-kernel interfaces, please send
|
If changes affect userland-kernel interfaces, please send
|
||||||
the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
|
the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
|
||||||
a man-pages patch, or at least a notification of the change,
|
a man-pages patch, or at least a notification of the change,
|
||||||
@@ -158,7 +171,7 @@ Even if the maintainer did not respond in step #4, make sure to ALWAYS
|
|||||||
copy the maintainer when you change their code.
|
copy the maintainer when you change their code.
|
||||||
|
|
||||||
For small patches you may want to CC the Trivial Patch Monkey
|
For small patches you may want to CC the Trivial Patch Monkey
|
||||||
trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial"
|
trivial@kernel.org managed by Adrian Bunk; which collects "trivial"
|
||||||
patches. Trivial patches must qualify for one of the following rules:
|
patches. Trivial patches must qualify for one of the following rules:
|
||||||
Spelling fixes in documentation
|
Spelling fixes in documentation
|
||||||
Spelling fixes which could break grep(1).
|
Spelling fixes which could break grep(1).
|
||||||
@@ -171,7 +184,7 @@ patches. Trivial patches must qualify for one of the following rules:
|
|||||||
since people copy, as long as it's trivial)
|
since people copy, as long as it's trivial)
|
||||||
Any fix by the author/maintainer of the file. (ie. patch monkey
|
Any fix by the author/maintainer of the file. (ie. patch monkey
|
||||||
in re-transmission mode)
|
in re-transmission mode)
|
||||||
URL: <http://www.kernel.org/pub/linux/kernel/people/rusty/trivial/>
|
URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -373,27 +386,14 @@ a diffstat, to show what files have changed, and the number of inserted
|
|||||||
and deleted lines per file. A diffstat is especially useful on bigger
|
and deleted lines per file. A diffstat is especially useful on bigger
|
||||||
patches. Other comments relevant only to the moment or the maintainer,
|
patches. Other comments relevant only to the moment or the maintainer,
|
||||||
not suitable for the permanent changelog, should also go here.
|
not suitable for the permanent changelog, should also go here.
|
||||||
|
Use diffstat options "-p 1 -w 70" so that filenames are listed from the
|
||||||
|
top of the kernel source tree and don't use too much horizontal space
|
||||||
|
(easily fit in 80 columns, maybe with some indentation).
|
||||||
|
|
||||||
See more details on the proper patch format in the following
|
See more details on the proper patch format in the following
|
||||||
references.
|
references.
|
||||||
|
|
||||||
|
|
||||||
13) More references for submitting patches
|
|
||||||
|
|
||||||
Andrew Morton, "The perfect patch" (tpp).
|
|
||||||
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
|
||||||
|
|
||||||
Jeff Garzik, "Linux kernel patch submission format."
|
|
||||||
<http://linux.yyz.us/patch-format.html>
|
|
||||||
|
|
||||||
Greg KH, "How to piss off a kernel subsystem maintainer"
|
|
||||||
<http://www.kroah.com/log/2005/03/31/>
|
|
||||||
|
|
||||||
Kernel Documentation/CodingStyle
|
|
||||||
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
|
||||||
|
|
||||||
Linus Torvald's mail on the canonical patch format:
|
|
||||||
<http://lkml.org/lkml/2005/4/7/183>
|
|
||||||
|
|
||||||
|
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
@@ -466,3 +466,31 @@ and 'extern __inline__'.
|
|||||||
Don't try to anticipate nebulous future cases which may or may not
|
Don't try to anticipate nebulous future cases which may or may not
|
||||||
be useful: "Make it as simple as you can, and no simpler."
|
be useful: "Make it as simple as you can, and no simpler."
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
----------------------
|
||||||
|
SECTION 3 - REFERENCES
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Andrew Morton, "The perfect patch" (tpp).
|
||||||
|
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
||||||
|
|
||||||
|
Jeff Garzik, "Linux kernel patch submission format."
|
||||||
|
<http://linux.yyz.us/patch-format.html>
|
||||||
|
|
||||||
|
Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer".
|
||||||
|
<http://www.kroah.com/log/2005/03/31/>
|
||||||
|
<http://www.kroah.com/log/2005/07/08/>
|
||||||
|
<http://www.kroah.com/log/2005/10/19/>
|
||||||
|
<http://www.kroah.com/log/2006/01/11/>
|
||||||
|
|
||||||
|
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!.
|
||||||
|
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2>
|
||||||
|
|
||||||
|
Kernel Documentation/CodingStyle
|
||||||
|
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
||||||
|
|
||||||
|
Linus Torvald's mail on the canonical patch format:
|
||||||
|
<http://lkml.org/lkml/2005/4/7/183>
|
||||||
|
--
|
||||||
|
Last updated on 17 Nov 2005.
|
||||||
|
|||||||
@@ -2,8 +2,8 @@
|
|||||||
Applying Patches To The Linux Kernel
|
Applying Patches To The Linux Kernel
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
||||||
(Written by Jesper Juhl, August 2005)
|
Original by: Jesper Juhl, August 2005
|
||||||
|
Last update: 2006-01-05
|
||||||
|
|
||||||
|
|
||||||
A frequently asked question on the Linux Kernel Mailing List is how to apply
|
A frequently asked question on the Linux Kernel Mailing List is how to apply
|
||||||
@@ -76,7 +76,7 @@ instead:
|
|||||||
|
|
||||||
If you wish to uncompress the patch file by hand first before applying it
|
If you wish to uncompress the patch file by hand first before applying it
|
||||||
(what I assume you've done in the examples below), then you simply run
|
(what I assume you've done in the examples below), then you simply run
|
||||||
gunzip or bunzip2 on the file - like this:
|
gunzip or bunzip2 on the file -- like this:
|
||||||
gunzip patch-x.y.z.gz
|
gunzip patch-x.y.z.gz
|
||||||
bunzip2 patch-x.y.z.bz2
|
bunzip2 patch-x.y.z.bz2
|
||||||
|
|
||||||
@@ -94,7 +94,7 @@ Common errors when patching
|
|||||||
---
|
---
|
||||||
When patch applies a patch file it attempts to verify the sanity of the
|
When patch applies a patch file it attempts to verify the sanity of the
|
||||||
file in different ways.
|
file in different ways.
|
||||||
Checking that the file looks like a valid patch file, checking the code
|
Checking that the file looks like a valid patch file & checking the code
|
||||||
around the bits being modified matches the context provided in the patch are
|
around the bits being modified matches the context provided in the patch are
|
||||||
just two of the basic sanity checks patch does.
|
just two of the basic sanity checks patch does.
|
||||||
|
|
||||||
@@ -118,16 +118,16 @@ wrong.
|
|||||||
|
|
||||||
When patch encounters a change that it can't fix up with fuzz it rejects it
|
When patch encounters a change that it can't fix up with fuzz it rejects it
|
||||||
outright and leaves a file with a .rej extension (a reject file). You can
|
outright and leaves a file with a .rej extension (a reject file). You can
|
||||||
read this file to see exactely what change couldn't be applied, so you can
|
read this file to see exactly what change couldn't be applied, so you can
|
||||||
go fix it up by hand if you wish.
|
go fix it up by hand if you wish.
|
||||||
|
|
||||||
If you don't have any third party patches applied to your kernel source, but
|
If you don't have any third-party patches applied to your kernel source, but
|
||||||
only patches from kernel.org and you apply the patches in the correct order,
|
only patches from kernel.org and you apply the patches in the correct order,
|
||||||
and have made no modifications yourself to the source files, then you should
|
and have made no modifications yourself to the source files, then you should
|
||||||
never see a fuzz or reject message from patch. If you do see such messages
|
never see a fuzz or reject message from patch. If you do see such messages
|
||||||
anyway, then there's a high risk that either your local source tree or the
|
anyway, then there's a high risk that either your local source tree or the
|
||||||
patch file is corrupted in some way. In that case you should probably try
|
patch file is corrupted in some way. In that case you should probably try
|
||||||
redownloading the patch and if things are still not OK then you'd be advised
|
re-downloading the patch and if things are still not OK then you'd be advised
|
||||||
to start with a fresh tree downloaded in full from kernel.org.
|
to start with a fresh tree downloaded in full from kernel.org.
|
||||||
|
|
||||||
Let's look a bit more at some of the messages patch can produce.
|
Let's look a bit more at some of the messages patch can produce.
|
||||||
@@ -136,7 +136,7 @@ If patch stops and presents a "File to patch:" prompt, then patch could not
|
|||||||
find a file to be patched. Most likely you forgot to specify -p1 or you are
|
find a file to be patched. Most likely you forgot to specify -p1 or you are
|
||||||
in the wrong directory. Less often, you'll find patches that need to be
|
in the wrong directory. Less often, you'll find patches that need to be
|
||||||
applied with -p0 instead of -p1 (reading the patch file should reveal if
|
applied with -p0 instead of -p1 (reading the patch file should reveal if
|
||||||
this is the case - if so, then this is an error by the person who created
|
this is the case -- if so, then this is an error by the person who created
|
||||||
the patch but is not fatal).
|
the patch but is not fatal).
|
||||||
|
|
||||||
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
|
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
|
||||||
@@ -167,22 +167,28 @@ the patch will in fact apply it.
|
|||||||
|
|
||||||
A message similar to "patch: **** unexpected end of file in patch" or "patch
|
A message similar to "patch: **** unexpected end of file in patch" or "patch
|
||||||
unexpectedly ends in middle of line" means that patch could make no sense of
|
unexpectedly ends in middle of line" means that patch could make no sense of
|
||||||
the file you fed to it. Either your download is broken or you tried to feed
|
the file you fed to it. Either your download is broken, you tried to feed
|
||||||
patch a compressed patch file without uncompressing it first.
|
patch a compressed patch file without uncompressing it first, or the patch
|
||||||
|
file that you are using has been mangled by a mail client or mail transfer
|
||||||
|
agent along the way somewhere, e.g., by splitting a long line into two lines.
|
||||||
|
Often these warnings can easily be fixed by joining (concatenating) the
|
||||||
|
two lines that had been split.
|
||||||
|
|
||||||
As I already mentioned above, these errors should never happen if you apply
|
As I already mentioned above, these errors should never happen if you apply
|
||||||
a patch from kernel.org to the correct version of an unmodified source tree.
|
a patch from kernel.org to the correct version of an unmodified source tree.
|
||||||
So if you get these errors with kernel.org patches then you should probably
|
So if you get these errors with kernel.org patches then you should probably
|
||||||
assume that either your patch file or your tree is broken and I'd advice you
|
assume that either your patch file or your tree is broken and I'd advise you
|
||||||
to start over with a fresh download of a full kernel tree and the patch you
|
to start over with a fresh download of a full kernel tree and the patch you
|
||||||
wish to apply.
|
wish to apply.
|
||||||
|
|
||||||
|
|
||||||
Are there any alternatives to `patch'?
|
Are there any alternatives to `patch'?
|
||||||
---
|
---
|
||||||
Yes there are alternatives. You can use the `interdiff' program
|
Yes there are alternatives.
|
||||||
(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
|
|
||||||
differences between two patches and then apply the result.
|
You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to
|
||||||
|
generate a patch representing the differences between two patches and then
|
||||||
|
apply the result.
|
||||||
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
|
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
|
||||||
step. The -z flag to interdiff will even let you feed it patches in gzip or
|
step. The -z flag to interdiff will even let you feed it patches in gzip or
|
||||||
bzip2 compressed form directly without the use of zcat or bzcat or manual
|
bzip2 compressed form directly without the use of zcat or bzcat or manual
|
||||||
@@ -197,10 +203,10 @@ do the additional steps since interdiff can get things wrong in some cases.
|
|||||||
Another alternative is `ketchup', which is a python script for automatic
|
Another alternative is `ketchup', which is a python script for automatic
|
||||||
downloading and applying of patches (http://www.selenic.com/ketchup/).
|
downloading and applying of patches (http://www.selenic.com/ketchup/).
|
||||||
|
|
||||||
Other nice tools are diffstat which shows a summary of changes made by a
|
Other nice tools are diffstat, which shows a summary of changes made by a
|
||||||
patch, lsdiff which displays a short listing of affected files in a patch
|
patch; lsdiff, which displays a short listing of affected files in a patch
|
||||||
file, along with (optionally) the line numbers of the start of each patch
|
file, along with (optionally) the line numbers of the start of each patch;
|
||||||
and grepdiff which displays a list of the files modified by a patch where
|
and grepdiff, which displays a list of the files modified by a patch where
|
||||||
the patch contains a given regular expression.
|
the patch contains a given regular expression.
|
||||||
|
|
||||||
|
|
||||||
@@ -225,8 +231,8 @@ The -mm kernels live at
|
|||||||
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
|
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
|
||||||
country code. This way you'll be downloading from a mirror site that's most
|
country code. This way you'll be downloading from a mirror site that's most
|
||||||
likely geographically closer to you, resulting in faster downloads for you,
|
likely geographically closer to you, resulting in faster downloads for you,
|
||||||
less bandwidth used globally and less load on the main kernel.org servers -
|
less bandwidth used globally and less load on the main kernel.org servers --
|
||||||
these are good things, do use mirrors when possible.
|
these are good things, so do use mirrors when possible.
|
||||||
|
|
||||||
|
|
||||||
The 2.6.x kernels
|
The 2.6.x kernels
|
||||||
@@ -234,14 +240,14 @@ The 2.6.x kernels
|
|||||||
These are the base stable releases released by Linus. The highest numbered
|
These are the base stable releases released by Linus. The highest numbered
|
||||||
release is the most recent.
|
release is the most recent.
|
||||||
|
|
||||||
If regressions or other serious flaws are found then a -stable fix patch
|
If regressions or other serious flaws are found, then a -stable fix patch
|
||||||
will be released (see below) on top of this base. Once a new 2.6.x base
|
will be released (see below) on top of this base. Once a new 2.6.x base
|
||||||
kernel is released, a patch is made available that is a delta between the
|
kernel is released, a patch is made available that is a delta between the
|
||||||
previous 2.6.x kernel and the new one.
|
previous 2.6.x kernel and the new one.
|
||||||
|
|
||||||
To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
|
To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note
|
||||||
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
|
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
|
||||||
base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
||||||
first revert the 2.6.x.y patch).
|
first revert the 2.6.x.y patch).
|
||||||
|
|
||||||
Here are some examples:
|
Here are some examples:
|
||||||
@@ -258,12 +264,12 @@ $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
|
|||||||
# source dir is now 2.6.11
|
# source dir is now 2.6.11
|
||||||
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
|
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
|
||||||
$ cd ..
|
$ cd ..
|
||||||
$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir
|
$ mv linux-2.6.11.1 linux-2.6.12 # rename source dir
|
||||||
|
|
||||||
|
|
||||||
The 2.6.x.y kernels
|
The 2.6.x.y kernels
|
||||||
---
|
---
|
||||||
Kernels with 4 digit versions are -stable kernels. They contain small(ish)
|
Kernels with 4-digit versions are -stable kernels. They contain small(ish)
|
||||||
critical fixes for security problems or significant regressions discovered
|
critical fixes for security problems or significant regressions discovered
|
||||||
in a given 2.6.x kernel.
|
in a given 2.6.x kernel.
|
||||||
|
|
||||||
@@ -274,9 +280,14 @@ versions.
|
|||||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
|
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
|
||||||
the current stable kernel.
|
the current stable kernel.
|
||||||
|
|
||||||
|
note: the -stable team usually do make incremental patches available as well
|
||||||
|
as patches against the latest mainline release, but I only cover the
|
||||||
|
non-incremental ones below. The incremental ones can be found at
|
||||||
|
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/
|
||||||
|
|
||||||
These patches are not incremental, meaning that for example the 2.6.12.3
|
These patches are not incremental, meaning that for example the 2.6.12.3
|
||||||
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
|
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
|
||||||
of the base 2.6.12 kernel source.
|
of the base 2.6.12 kernel source .
|
||||||
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
|
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
|
||||||
source you have to first back out the 2.6.12.2 patch (so you are left with a
|
source you have to first back out the 2.6.12.2 patch (so you are left with a
|
||||||
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
|
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
|
||||||
@@ -342,12 +353,12 @@ The -git kernels
|
|||||||
repository, hence the name).
|
repository, hence the name).
|
||||||
|
|
||||||
These patches are usually released daily and represent the current state of
|
These patches are usually released daily and represent the current state of
|
||||||
Linus' tree. They are more experimental than -rc kernels since they are
|
Linus's tree. They are more experimental than -rc kernels since they are
|
||||||
generated automatically without even a cursory glance to see if they are
|
generated automatically without even a cursory glance to see if they are
|
||||||
sane.
|
sane.
|
||||||
|
|
||||||
-git patches are not incremental and apply either to a base 2.6.x kernel or
|
-git patches are not incremental and apply either to a base 2.6.x kernel or
|
||||||
a base 2.6.x-rc kernel - you can see which from their name.
|
a base 2.6.x-rc kernel -- you can see which from their name.
|
||||||
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
|
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
|
||||||
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
|
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
|
||||||
|
|
||||||
@@ -390,12 +401,12 @@ You should generally strive to get your patches into mainline via -mm to
|
|||||||
ensure maximum testing.
|
ensure maximum testing.
|
||||||
|
|
||||||
This branch is in constant flux and contains many experimental features, a
|
This branch is in constant flux and contains many experimental features, a
|
||||||
lot of debugging patches not appropriate for mainline etc and is the most
|
lot of debugging patches not appropriate for mainline etc., and is the most
|
||||||
experimental of the branches described in this document.
|
experimental of the branches described in this document.
|
||||||
|
|
||||||
These kernels are not appropriate for use on systems that are supposed to be
|
These kernels are not appropriate for use on systems that are supposed to be
|
||||||
stable and they are more risky to run than any of the other branches (make
|
stable and they are more risky to run than any of the other branches (make
|
||||||
sure you have up-to-date backups - that goes for any experimental kernel but
|
sure you have up-to-date backups -- that goes for any experimental kernel but
|
||||||
even more so for -mm kernels).
|
even more so for -mm kernels).
|
||||||
|
|
||||||
These kernels in addition to all the other experimental patches they contain
|
These kernels in addition to all the other experimental patches they contain
|
||||||
@@ -433,7 +444,11 @@ $ cd ..
|
|||||||
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
|
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
|
||||||
|
|
||||||
|
|
||||||
This concludes this list of explanations of the various kernel trees and I
|
This concludes this list of explanations of the various kernel trees.
|
||||||
hope you are now crystal clear on how to apply the various patches and help
|
I hope you are now clear on how to apply the various patches and help testing
|
||||||
testing the kernel.
|
the kernel.
|
||||||
|
|
||||||
|
Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert,
|
||||||
|
Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have
|
||||||
|
forgotten for their reviews and contributions to this document.
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,271 @@
|
|||||||
|
I/O Barriers
|
||||||
|
============
|
||||||
|
Tejun Heo <htejun@gmail.com>, July 22 2005
|
||||||
|
|
||||||
|
I/O barrier requests are used to guarantee ordering around the barrier
|
||||||
|
requests. Unless you're crazy enough to use disk drives for
|
||||||
|
implementing synchronization constructs (wow, sounds interesting...),
|
||||||
|
the ordering is meaningful only for write requests for things like
|
||||||
|
journal checkpoints. All requests queued before a barrier request
|
||||||
|
must be finished (made it to the physical medium) before the barrier
|
||||||
|
request is started, and all requests queued after the barrier request
|
||||||
|
must be started only after the barrier request is finished (again,
|
||||||
|
made it to the physical medium).
|
||||||
|
|
||||||
|
In other words, I/O barrier requests have the following two properties.
|
||||||
|
|
||||||
|
1. Request ordering
|
||||||
|
|
||||||
|
Requests cannot pass the barrier request. Preceding requests are
|
||||||
|
processed before the barrier and following requests after.
|
||||||
|
|
||||||
|
Depending on what features a drive supports, this can be done in one
|
||||||
|
of the following three ways.
|
||||||
|
|
||||||
|
i. For devices which have queue depth greater than 1 (TCQ devices) and
|
||||||
|
support ordered tags, block layer can just issue the barrier as an
|
||||||
|
ordered request and the lower level driver, controller and drive
|
||||||
|
itself are responsible for making sure that the ordering contraint is
|
||||||
|
met. Most modern SCSI controllers/drives should support this.
|
||||||
|
|
||||||
|
NOTE: SCSI ordered tag isn't currently used due to limitation in the
|
||||||
|
SCSI midlayer, see the following random notes section.
|
||||||
|
|
||||||
|
ii. For devices which have queue depth greater than 1 but don't
|
||||||
|
support ordered tags, block layer ensures that the requests preceding
|
||||||
|
a barrier request finishes before issuing the barrier request. Also,
|
||||||
|
it defers requests following the barrier until the barrier request is
|
||||||
|
finished. Older SCSI controllers/drives and SATA drives fall in this
|
||||||
|
category.
|
||||||
|
|
||||||
|
iii. Devices which have queue depth of 1. This is a degenerate case
|
||||||
|
of ii. Just keeping issue order suffices. Ancient SCSI
|
||||||
|
controllers/drives and IDE drives are in this category.
|
||||||
|
|
||||||
|
2. Forced flushing to physcial medium
|
||||||
|
|
||||||
|
Again, if you're not gonna do synchronization with disk drives (dang,
|
||||||
|
it sounds even more appealing now!), the reason you use I/O barriers
|
||||||
|
is mainly to protect filesystem integrity when power failure or some
|
||||||
|
other events abruptly stop the drive from operating and possibly make
|
||||||
|
the drive lose data in its cache. So, I/O barriers need to guarantee
|
||||||
|
that requests actually get written to non-volatile medium in order.
|
||||||
|
|
||||||
|
There are four cases,
|
||||||
|
|
||||||
|
i. No write-back cache. Keeping requests ordered is enough.
|
||||||
|
|
||||||
|
ii. Write-back cache but no flush operation. There's no way to
|
||||||
|
gurantee physical-medium commit order. This kind of devices can't to
|
||||||
|
I/O barriers.
|
||||||
|
|
||||||
|
iii. Write-back cache and flush operation but no FUA (forced unit
|
||||||
|
access). We need two cache flushes - before and after the barrier
|
||||||
|
request.
|
||||||
|
|
||||||
|
iv. Write-back cache, flush operation and FUA. We still need one
|
||||||
|
flush to make sure requests preceding a barrier are written to medium,
|
||||||
|
but post-barrier flush can be avoided by using FUA write on the
|
||||||
|
barrier itself.
|
||||||
|
|
||||||
|
|
||||||
|
How to support barrier requests in drivers
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
All barrier handling is done inside block layer proper. All low level
|
||||||
|
drivers have to are implementing its prepare_flush_fn and using one
|
||||||
|
the following two functions to indicate what barrier type it supports
|
||||||
|
and how to prepare flush requests. Note that the term 'ordered' is
|
||||||
|
used to indicate the whole sequence of performing barrier requests
|
||||||
|
including draining and flushing.
|
||||||
|
|
||||||
|
typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
|
||||||
|
|
||||||
|
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
|
||||||
|
prepare_flush_fn *prepare_flush_fn,
|
||||||
|
unsigned gfp_mask);
|
||||||
|
|
||||||
|
int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
|
||||||
|
prepare_flush_fn *prepare_flush_fn,
|
||||||
|
unsigned gfp_mask);
|
||||||
|
|
||||||
|
The only difference between the two functions is whether or not the
|
||||||
|
caller is holding q->queue_lock on entry. The latter expects the
|
||||||
|
caller is holding the lock.
|
||||||
|
|
||||||
|
@q : the queue in question
|
||||||
|
@ordered : the ordered mode the driver/device supports
|
||||||
|
@prepare_flush_fn : this function should prepare @rq such that it
|
||||||
|
flushes cache to physical medium when executed
|
||||||
|
@gfp_mask : gfp_mask used when allocating data structures
|
||||||
|
for ordered processing
|
||||||
|
|
||||||
|
For example, SCSI disk driver's prepare_flush_fn looks like the
|
||||||
|
following.
|
||||||
|
|
||||||
|
static void sd_prepare_flush(request_queue_t *q, struct request *rq)
|
||||||
|
{
|
||||||
|
memset(rq->cmd, 0, sizeof(rq->cmd));
|
||||||
|
rq->flags |= REQ_BLOCK_PC;
|
||||||
|
rq->timeout = SD_TIMEOUT;
|
||||||
|
rq->cmd[0] = SYNCHRONIZE_CACHE;
|
||||||
|
}
|
||||||
|
|
||||||
|
The following seven ordered modes are supported. The following table
|
||||||
|
shows which mode should be used depending on what features a
|
||||||
|
device/driver supports. In the leftmost column of table,
|
||||||
|
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
|
||||||
|
|
||||||
|
The table is followed by description of each mode. Note that in the
|
||||||
|
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
|
||||||
|
used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
|
||||||
|
preceding step must be complete before proceeding to the next step.
|
||||||
|
'->' indicates that the next step can start as soon as the previous
|
||||||
|
step is issued.
|
||||||
|
|
||||||
|
write-back cache ordered tag flush FUA
|
||||||
|
-----------------------------------------------------------------------
|
||||||
|
NONE yes/no N/A no N/A
|
||||||
|
DRAIN no no N/A N/A
|
||||||
|
DRAIN_FLUSH yes no yes no
|
||||||
|
DRAIN_FUA yes no yes yes
|
||||||
|
TAG no yes N/A N/A
|
||||||
|
TAG_FLUSH yes yes yes no
|
||||||
|
TAG_FUA yes yes yes yes
|
||||||
|
|
||||||
|
|
||||||
|
QUEUE_ORDERED_NONE
|
||||||
|
I/O barriers are not needed and/or supported.
|
||||||
|
|
||||||
|
Sequence: N/A
|
||||||
|
|
||||||
|
QUEUE_ORDERED_DRAIN
|
||||||
|
Requests are ordered by draining the request queue and cache
|
||||||
|
flushing isn't needed.
|
||||||
|
|
||||||
|
Sequence: drain => barrier
|
||||||
|
|
||||||
|
QUEUE_ORDERED_DRAIN_FLUSH
|
||||||
|
Requests are ordered by draining the request queue and both
|
||||||
|
pre-barrier and post-barrier cache flushings are needed.
|
||||||
|
|
||||||
|
Sequence: drain => preflush => barrier => postflush
|
||||||
|
|
||||||
|
QUEUE_ORDERED_DRAIN_FUA
|
||||||
|
Requests are ordered by draining the request queue and
|
||||||
|
pre-barrier cache flushing is needed. By using FUA on barrier
|
||||||
|
request, post-barrier flushing can be skipped.
|
||||||
|
|
||||||
|
Sequence: drain => preflush => barrier
|
||||||
|
|
||||||
|
QUEUE_ORDERED_TAG
|
||||||
|
Requests are ordered by ordered tag and cache flushing isn't
|
||||||
|
needed.
|
||||||
|
|
||||||
|
Sequence: barrier
|
||||||
|
|
||||||
|
QUEUE_ORDERED_TAG_FLUSH
|
||||||
|
Requests are ordered by ordered tag and both pre-barrier and
|
||||||
|
post-barrier cache flushings are needed.
|
||||||
|
|
||||||
|
Sequence: preflush -> barrier -> postflush
|
||||||
|
|
||||||
|
QUEUE_ORDERED_TAG_FUA
|
||||||
|
Requests are ordered by ordered tag and pre-barrier cache
|
||||||
|
flushing is needed. By using FUA on barrier request,
|
||||||
|
post-barrier flushing can be skipped.
|
||||||
|
|
||||||
|
Sequence: preflush -> barrier
|
||||||
|
|
||||||
|
|
||||||
|
Random notes/caveats
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
* SCSI layer currently can't use TAG ordering even if the drive,
|
||||||
|
controller and driver support it. The problem is that SCSI midlayer
|
||||||
|
request dispatch function is not atomic. It releases queue lock and
|
||||||
|
switch to SCSI host lock during issue and it's possible and likely to
|
||||||
|
happen in time that requests change their relative positions. Once
|
||||||
|
this problem is solved, TAG ordering can be enabled.
|
||||||
|
|
||||||
|
* Currently, no matter which ordered mode is used, there can be only
|
||||||
|
one barrier request in progress. All I/O barriers are held off by
|
||||||
|
block layer until the previous I/O barrier is complete. This doesn't
|
||||||
|
make any difference for DRAIN ordered devices, but, for TAG ordered
|
||||||
|
devices with very high command latency, passing multiple I/O barriers
|
||||||
|
to low level *might* be helpful if they are very frequent. Well, this
|
||||||
|
certainly is a non-issue. I'm writing this just to make clear that no
|
||||||
|
two I/O barrier is ever passed to low-level driver.
|
||||||
|
|
||||||
|
* Completion order. Requests in ordered sequence are issued in order
|
||||||
|
but not required to finish in order. Barrier implementation can
|
||||||
|
handle out-of-order completion of ordered sequence. IOW, the requests
|
||||||
|
MUST be processed in order but the hardware/software completion paths
|
||||||
|
are allowed to reorder completion notifications - eg. current SCSI
|
||||||
|
midlayer doesn't preserve completion order during error handling.
|
||||||
|
|
||||||
|
* Requeueing order. Low-level drivers are free to requeue any request
|
||||||
|
after they removed it from the request queue with
|
||||||
|
blkdev_dequeue_request(). As barrier sequence should be kept in order
|
||||||
|
when requeued, generic elevator code takes care of putting requests in
|
||||||
|
order around barrier. See blk_ordered_req_seq() and
|
||||||
|
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
|
||||||
|
|
||||||
|
Note that block drivers must not requeue preceding requests while
|
||||||
|
completing latter requests in an ordered sequence. Currently, no
|
||||||
|
error checking is done against this.
|
||||||
|
|
||||||
|
* Error handling. Currently, block layer will report error to upper
|
||||||
|
layer if any of requests in an ordered sequence fails. Unfortunately,
|
||||||
|
this doesn't seem to be enough. Look at the following request flow.
|
||||||
|
QUEUE_ORDERED_TAG_FLUSH is in use.
|
||||||
|
|
||||||
|
[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
|
||||||
|
still in elevator
|
||||||
|
|
||||||
|
Let's say request [2], [3] are write requests to update file system
|
||||||
|
metadata (journal or whatever) and [barrier] is used to mark that
|
||||||
|
those updates are valid. Consider the following sequence.
|
||||||
|
|
||||||
|
i. Requests [0] ~ [post] leaves the request queue and enters
|
||||||
|
low-level driver.
|
||||||
|
ii. After a while, unfortunately, something goes wrong and the
|
||||||
|
drive fails [2]. Note that any of [0], [1] and [3] could have
|
||||||
|
completed by this time, but [pre] couldn't have been finished
|
||||||
|
as the drive must process it in order and it failed before
|
||||||
|
processing that command.
|
||||||
|
iii. Error handling kicks in and determines that the error is
|
||||||
|
unrecoverable and fails [2], and resumes operation.
|
||||||
|
iv. [pre] [barrier] [post] gets processed.
|
||||||
|
v. *BOOM* power fails
|
||||||
|
|
||||||
|
The problem here is that the barrier request is *supposed* to indicate
|
||||||
|
that filesystem update requests [2] and [3] made it safely to the
|
||||||
|
physical medium and, if the machine crashes after the barrier is
|
||||||
|
written, filesystem recovery code can depend on that. Sadly, that
|
||||||
|
isn't true in this case anymore. IOW, the success of a I/O barrier
|
||||||
|
should also be dependent on success of some of the preceding requests,
|
||||||
|
where only upper layer (filesystem) knows what 'some' is.
|
||||||
|
|
||||||
|
This can be solved by implementing a way to tell the block layer which
|
||||||
|
requests affect the success of the following barrier request and
|
||||||
|
making lower lever drivers to resume operation on error only after
|
||||||
|
block layer tells it to do so.
|
||||||
|
|
||||||
|
As the probability of this happening is very low and the drive should
|
||||||
|
be faulty, implementing the fix is probably an overkill. But, still,
|
||||||
|
it's there.
|
||||||
|
|
||||||
|
* In previous drafts of barrier implementation, there was fallback
|
||||||
|
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
|
||||||
|
mode can be selected and the failed barrier request is retried
|
||||||
|
automatically. The rationale for this feature was that as FUA is
|
||||||
|
pretty new in ATA world and ordered tag was never used widely, there
|
||||||
|
could be devices which report to support those features but choke when
|
||||||
|
actually given such requests.
|
||||||
|
|
||||||
|
This was removed for two reasons 1. it's an overkill 2. it's
|
||||||
|
impossible to implement properly when TAG ordering is used as low
|
||||||
|
level drivers resume after an error automatically. If it's ever
|
||||||
|
needed adding it back and modifying low level drivers accordingly
|
||||||
|
shouldn't be difficult.
|
||||||
@@ -31,7 +31,7 @@ The following people helped with review comments and inputs for this
|
|||||||
document:
|
document:
|
||||||
Christoph Hellwig <hch@infradead.org>
|
Christoph Hellwig <hch@infradead.org>
|
||||||
Arjan van de Ven <arjanv@redhat.com>
|
Arjan van de Ven <arjanv@redhat.com>
|
||||||
Randy Dunlap <rddunlap@osdl.org>
|
Randy Dunlap <rdunlap@xenotime.net>
|
||||||
Andre Hedrick <andre@linux-ide.org>
|
Andre Hedrick <andre@linux-ide.org>
|
||||||
|
|
||||||
The following people helped with fixes/contributions to the bio patches
|
The following people helped with fixes/contributions to the bio patches
|
||||||
@@ -263,14 +263,8 @@ A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
|
|||||||
The generic i/o scheduler would make sure that it places the barrier request and
|
The generic i/o scheduler would make sure that it places the barrier request and
|
||||||
all other requests coming after it after all the previous requests in the
|
all other requests coming after it after all the previous requests in the
|
||||||
queue. Barriers may be implemented in different ways depending on the
|
queue. Barriers may be implemented in different ways depending on the
|
||||||
driver. A SCSI driver for example could make use of ordered tags to
|
driver. For more details regarding I/O barriers, please read barrier.txt
|
||||||
preserve the necessary ordering with a lower impact on throughput. For IDE
|
in this directory.
|
||||||
this might be two sync cache flush: a pre and post flush when encountering
|
|
||||||
a barrier write.
|
|
||||||
|
|
||||||
There is a provision for queues to indicate what kind of barriers they
|
|
||||||
can provide. This is as of yet unmerged, details will be added here once it
|
|
||||||
is in the kernel.
|
|
||||||
|
|
||||||
1.2.2 Request Priority/Latency
|
1.2.2 Request Priority/Latency
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,82 @@
|
|||||||
|
Block layer statistics in /sys/block/<dev>/stat
|
||||||
|
===============================================
|
||||||
|
|
||||||
|
This file documents the contents of the /sys/block/<dev>/stat file.
|
||||||
|
|
||||||
|
The stat file provides several statistics about the state of block
|
||||||
|
device <dev>.
|
||||||
|
|
||||||
|
Q. Why are there multiple statistics in a single file? Doesn't sysfs
|
||||||
|
normally contain a single value per file?
|
||||||
|
A. By having a single file, the kernel can guarantee that the statistics
|
||||||
|
represent a consistent snapshot of the state of the device. If the
|
||||||
|
statistics were exported as multiple files containing one statistic
|
||||||
|
each, it would be impossible to guarantee that a set of readings
|
||||||
|
represent a single point in time.
|
||||||
|
|
||||||
|
The stat file consists of a single line of text containing 11 decimal
|
||||||
|
values separated by whitespace. The fields are summarized in the
|
||||||
|
following table, and described in more detail below.
|
||||||
|
|
||||||
|
Name units description
|
||||||
|
---- ----- -----------
|
||||||
|
read I/Os requests number of read I/Os processed
|
||||||
|
read merges requests number of read I/Os merged with in-queue I/O
|
||||||
|
read sectors sectors number of sectors read
|
||||||
|
read ticks milliseconds total wait time for read requests
|
||||||
|
write I/Os requests number of write I/Os processed
|
||||||
|
write merges requests number of write I/Os merged with in-queue I/O
|
||||||
|
write sectors sectors number of sectors written
|
||||||
|
write ticks milliseconds total wait time for write requests
|
||||||
|
in_flight requests number of I/Os currently in flight
|
||||||
|
io_ticks milliseconds total time this block device has been active
|
||||||
|
time_in_queue milliseconds total wait time for all requests
|
||||||
|
|
||||||
|
read I/Os, write I/Os
|
||||||
|
=====================
|
||||||
|
|
||||||
|
These values increment when an I/O request completes.
|
||||||
|
|
||||||
|
read merges, write merges
|
||||||
|
=========================
|
||||||
|
|
||||||
|
These values increment when an I/O request is merged with an
|
||||||
|
already-queued I/O request.
|
||||||
|
|
||||||
|
read sectors, write sectors
|
||||||
|
===========================
|
||||||
|
|
||||||
|
These values count the number of sectors read from or written to this
|
||||||
|
block device. The "sectors" in question are the standard UNIX 512-byte
|
||||||
|
sectors, not any device- or filesystem-specific block size. The
|
||||||
|
counters are incremented when the I/O completes.
|
||||||
|
|
||||||
|
read ticks, write ticks
|
||||||
|
=======================
|
||||||
|
|
||||||
|
These values count the number of milliseconds that I/O requests have
|
||||||
|
waited on this block device. If there are multiple I/O requests waiting,
|
||||||
|
these values will increase at a rate greater than 1000/second; for
|
||||||
|
example, if 60 read requests wait for an average of 30 ms, the read_ticks
|
||||||
|
field will increase by 60*30 = 1800.
|
||||||
|
|
||||||
|
in_flight
|
||||||
|
=========
|
||||||
|
|
||||||
|
This value counts the number of I/O requests that have been issued to
|
||||||
|
the device driver but have not yet completed. It does not include I/O
|
||||||
|
requests that are in the queue but not yet issued to the device driver.
|
||||||
|
|
||||||
|
io_ticks
|
||||||
|
========
|
||||||
|
|
||||||
|
This value counts the number of milliseconds during which the device has
|
||||||
|
had I/O requests queued.
|
||||||
|
|
||||||
|
time_in_queue
|
||||||
|
=============
|
||||||
|
|
||||||
|
This value counts the number of milliseconds that I/O requests have waited
|
||||||
|
on this block device. If there are multiple I/O requests waiting, this
|
||||||
|
value will increase as the product of the number of milliseconds times the
|
||||||
|
number of requests waiting (see "read ticks" above for an example).
|
||||||
@@ -136,7 +136,7 @@ changes occur:
|
|||||||
8) void lazy_mmu_prot_update(pte_t pte)
|
8) void lazy_mmu_prot_update(pte_t pte)
|
||||||
This interface is called whenever the protection on
|
This interface is called whenever the protection on
|
||||||
any user PTEs change. This interface provides a notification
|
any user PTEs change. This interface provides a notification
|
||||||
to architecture specific code to take appropiate action.
|
to architecture specific code to take appropriate action.
|
||||||
|
|
||||||
|
|
||||||
Next, we have the cache flushing interfaces. In general, when Linux
|
Next, we have the cache flushing interfaces. In general, when Linux
|
||||||
|
|||||||
@@ -27,6 +27,7 @@ Contents:
|
|||||||
2.2 Powersave
|
2.2 Powersave
|
||||||
2.3 Userspace
|
2.3 Userspace
|
||||||
2.4 Ondemand
|
2.4 Ondemand
|
||||||
|
2.5 Conservative
|
||||||
|
|
||||||
3. The Governor Interface in the CPUfreq Core
|
3. The Governor Interface in the CPUfreq Core
|
||||||
|
|
||||||
@@ -110,9 +111,64 @@ directory.
|
|||||||
|
|
||||||
The CPUfreq govenor "ondemand" sets the CPU depending on the
|
The CPUfreq govenor "ondemand" sets the CPU depending on the
|
||||||
current usage. To do this the CPU must have the capability to
|
current usage. To do this the CPU must have the capability to
|
||||||
switch the frequency very fast.
|
switch the frequency very quickly. There are a number of sysfs file
|
||||||
|
accessible parameters:
|
||||||
|
|
||||||
|
sampling_rate: measured in uS (10^-6 seconds), this is how often you
|
||||||
|
want the kernel to look at the CPU usage and to make decisions on
|
||||||
|
what to do about the frequency. Typically this is set to values of
|
||||||
|
around '10000' or more.
|
||||||
|
|
||||||
|
show_sampling_rate_(min|max): the minimum and maximum sampling rates
|
||||||
|
available that you may set 'sampling_rate' to.
|
||||||
|
|
||||||
|
up_threshold: defines what the average CPU usaged between the samplings
|
||||||
|
of 'sampling_rate' needs to be for the kernel to make a decision on
|
||||||
|
whether it should increase the frequency. For example when it is set
|
||||||
|
to its default value of '80' it means that between the checking
|
||||||
|
intervals the CPU needs to be on average more than 80% in use to then
|
||||||
|
decide that the CPU frequency needs to be increased.
|
||||||
|
|
||||||
|
sampling_down_factor: this parameter controls the rate that the CPU
|
||||||
|
makes a decision on when to decrease the frequency. When set to its
|
||||||
|
default value of '5' it means that at 1/5 the sampling_rate the kernel
|
||||||
|
makes a decision to lower the frequency. Five "lower rate" decisions
|
||||||
|
have to be made in a row before the CPU frequency is actually lower.
|
||||||
|
If set to '1' then the frequency decreases as quickly as it increases,
|
||||||
|
if set to '2' it decreases at half the rate of the increase.
|
||||||
|
|
||||||
|
ignore_nice_load: this parameter takes a value of '0' or '1', when set
|
||||||
|
to '0' (its default) then all processes are counted towards towards the
|
||||||
|
'cpu utilisation' value. When set to '1' then processes that are
|
||||||
|
run with a 'nice' value will not count (and thus be ignored) in the
|
||||||
|
overal usage calculation. This is useful if you are running a CPU
|
||||||
|
intensive calculation on your laptop that you do not care how long it
|
||||||
|
takes to complete as you can 'nice' it and prevent it from taking part
|
||||||
|
in the deciding process of whether to increase your CPU frequency.
|
||||||
|
|
||||||
|
|
||||||
|
2.5 Conservative
|
||||||
|
----------------
|
||||||
|
|
||||||
|
The CPUfreq governor "conservative", much like the "ondemand"
|
||||||
|
governor, sets the CPU depending on the current usage. It differs in
|
||||||
|
behaviour in that it gracefully increases and decreases the CPU speed
|
||||||
|
rather than jumping to max speed the moment there is any load on the
|
||||||
|
CPU. This behaviour more suitable in a battery powered environment.
|
||||||
|
The governor is tweaked in the same manner as the "ondemand" governor
|
||||||
|
through sysfs with the addition of:
|
||||||
|
|
||||||
|
freq_step: this describes what percentage steps the cpu freq should be
|
||||||
|
increased and decreased smoothly by. By default the cpu frequency will
|
||||||
|
increase in 5% chunks of your maximum cpu frequency. You can change this
|
||||||
|
value to anywhere between 0 and 100 where '0' will effectively lock your
|
||||||
|
CPU at a speed regardless of its load whilst '100' will, in theory, make
|
||||||
|
it behave identically to the "ondemand" governor.
|
||||||
|
|
||||||
|
down_threshold: same as the 'up_threshold' found for the "ondemand"
|
||||||
|
governor but for the opposite direction. For example when set to its
|
||||||
|
default value of '20' it means that if the CPU usage needs to be below
|
||||||
|
20% between samples to have the frequency decreased.
|
||||||
|
|
||||||
3. The Governor Interface in the CPUfreq Core
|
3. The Governor Interface in the CPUfreq Core
|
||||||
=============================================
|
=============================================
|
||||||
|
|||||||
@@ -0,0 +1,357 @@
|
|||||||
|
CPU hotplug Support in Linux(tm) Kernel
|
||||||
|
|
||||||
|
Maintainers:
|
||||||
|
CPU Hotplug Core:
|
||||||
|
Rusty Russell <rusty@rustycorp.com.au>
|
||||||
|
Srivatsa Vaddagiri <vatsa@in.ibm.com>
|
||||||
|
i386:
|
||||||
|
Zwane Mwaikambo <zwane@arm.linux.org.uk>
|
||||||
|
ppc64:
|
||||||
|
Nathan Lynch <nathanl@austin.ibm.com>
|
||||||
|
Joel Schopp <jschopp@austin.ibm.com>
|
||||||
|
ia64/x86_64:
|
||||||
|
Ashok Raj <ashok.raj@intel.com>
|
||||||
|
|
||||||
|
Authors: Ashok Raj <ashok.raj@intel.com>
|
||||||
|
Lots of feedback: Nathan Lynch <nathanl@austin.ibm.com>,
|
||||||
|
Joel Schopp <jschopp@austin.ibm.com>
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
|
||||||
|
Modern advances in system architectures have introduced advanced error
|
||||||
|
reporting and correction capabilities in processors. CPU architectures permit
|
||||||
|
partitioning support, where compute resources of a single CPU could be made
|
||||||
|
available to virtual machine environments. There are couple OEMS that
|
||||||
|
support NUMA hardware which are hot pluggable as well, where physical
|
||||||
|
node insertion and removal require support for CPU hotplug.
|
||||||
|
|
||||||
|
Such advances require CPUs available to a kernel to be removed either for
|
||||||
|
provisioning reasons, or for RAS purposes to keep an offending CPU off
|
||||||
|
system execution path. Hence the need for CPU hotplug support in the
|
||||||
|
Linux kernel.
|
||||||
|
|
||||||
|
A more novel use of CPU-hotplug support is its use today in suspend
|
||||||
|
resume support for SMP. Dual-core and HT support makes even
|
||||||
|
a laptop run SMP kernels which didn't support these methods. SMP support
|
||||||
|
for suspend/resume is a work in progress.
|
||||||
|
|
||||||
|
General Stuff about CPU Hotplug
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Command Line Switches
|
||||||
|
---------------------
|
||||||
|
maxcpus=n Restrict boot time cpus to n. Say if you have 4 cpus, using
|
||||||
|
maxcpus=2 will only boot 2. You can choose to bring the
|
||||||
|
other cpus later online, read FAQ's for more info.
|
||||||
|
|
||||||
|
additional_cpus=n [x86_64 only] use this to limit hotpluggable cpus.
|
||||||
|
This option sets
|
||||||
|
cpu_possible_map = cpu_present_map + additional_cpus
|
||||||
|
|
||||||
|
CPU maps and such
|
||||||
|
-----------------
|
||||||
|
[More on cpumaps and primitive to manipulate, please check
|
||||||
|
include/linux/cpumask.h that has more descriptive text.]
|
||||||
|
|
||||||
|
cpu_possible_map: Bitmap of possible CPUs that can ever be available in the
|
||||||
|
system. This is used to allocate some boot time memory for per_cpu variables
|
||||||
|
that aren't designed to grow/shrink as CPUs are made available or removed.
|
||||||
|
Once set during boot time discovery phase, the map is static, i.e no bits
|
||||||
|
are added or removed anytime. Trimming it accurately for your system needs
|
||||||
|
upfront can save some boot time memory. See below for how we use heuristics
|
||||||
|
in x86_64 case to keep this under check.
|
||||||
|
|
||||||
|
cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up()
|
||||||
|
after a cpu is available for kernel scheduling and ready to receive
|
||||||
|
interrupts from devices. Its cleared when a cpu is brought down using
|
||||||
|
__cpu_disable(), before which all OS services including interrupts are
|
||||||
|
migrated to another target CPU.
|
||||||
|
|
||||||
|
cpu_present_map: Bitmap of CPUs currently present in the system. Not all
|
||||||
|
of them may be online. When physical hotplug is processed by the relevant
|
||||||
|
subsystem (e.g ACPI) can change and new bit either be added or removed
|
||||||
|
from the map depending on the event is hot-add/hot-remove. There are currently
|
||||||
|
no locking rules as of now. Typical usage is to init topology during boot,
|
||||||
|
at which time hotplug is disabled.
|
||||||
|
|
||||||
|
You really dont need to manipulate any of the system cpu maps. They should
|
||||||
|
be read-only for most use. When setting up per-cpu resources almost always use
|
||||||
|
cpu_possible_map/for_each_cpu() to iterate.
|
||||||
|
|
||||||
|
Never use anything other than cpumask_t to represent bitmap of CPUs.
|
||||||
|
|
||||||
|
#include <linux/cpumask.h>
|
||||||
|
|
||||||
|
for_each_cpu - Iterate over cpu_possible_map
|
||||||
|
for_each_online_cpu - Iterate over cpu_online_map
|
||||||
|
for_each_present_cpu - Iterate over cpu_present_map
|
||||||
|
for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask.
|
||||||
|
|
||||||
|
#include <linux/cpu.h>
|
||||||
|
lock_cpu_hotplug() and unlock_cpu_hotplug():
|
||||||
|
|
||||||
|
The above calls are used to inhibit cpu hotplug operations. While holding the
|
||||||
|
cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
|
||||||
|
cpus going away, you could also use preempt_disable() and preempt_enable()
|
||||||
|
for those sections. Just remember the critical section cannot call any
|
||||||
|
function that can sleep or schedule this process away. The preempt_disable()
|
||||||
|
will work as long as stop_machine_run() is used to take a cpu down.
|
||||||
|
|
||||||
|
CPU Hotplug - Frequently Asked Questions.
|
||||||
|
|
||||||
|
Q: How to i enable my kernel to support CPU hotplug?
|
||||||
|
A: When doing make defconfig, Enable CPU hotplug support
|
||||||
|
|
||||||
|
"Processor type and Features" -> Support for Hotpluggable CPUs
|
||||||
|
|
||||||
|
Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well.
|
||||||
|
|
||||||
|
You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support
|
||||||
|
as well.
|
||||||
|
|
||||||
|
Q: What architectures support CPU hotplug?
|
||||||
|
A: As of 2.6.14, the following architectures support CPU hotplug.
|
||||||
|
|
||||||
|
i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64
|
||||||
|
|
||||||
|
Q: How to test if hotplug is supported on the newly built kernel?
|
||||||
|
A: You should now notice an entry in sysfs.
|
||||||
|
|
||||||
|
Check if sysfs is mounted, using the "mount" command. You should notice
|
||||||
|
an entry as shown below in the output.
|
||||||
|
|
||||||
|
....
|
||||||
|
none on /sys type sysfs (rw)
|
||||||
|
....
|
||||||
|
|
||||||
|
if this is not mounted, do the following.
|
||||||
|
|
||||||
|
#mkdir /sysfs
|
||||||
|
#mount -t sysfs sys /sys
|
||||||
|
|
||||||
|
now you should see entries for all present cpu, the following is an example
|
||||||
|
in a 8-way system.
|
||||||
|
|
||||||
|
#pwd
|
||||||
|
#/sys/devices/system/cpu
|
||||||
|
#ls -l
|
||||||
|
total 0
|
||||||
|
drwxr-xr-x 10 root root 0 Sep 19 07:44 .
|
||||||
|
drwxr-xr-x 13 root root 0 Sep 19 07:45 ..
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu0
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu1
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu2
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu3
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu4
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu5
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu6
|
||||||
|
drwxr-xr-x 3 root root 0 Sep 19 07:48 cpu7
|
||||||
|
|
||||||
|
Under each directory you would find an "online" file which is the control
|
||||||
|
file to logically online/offline a processor.
|
||||||
|
|
||||||
|
Q: Does hot-add/hot-remove refer to physical add/remove of cpus?
|
||||||
|
A: The usage of hot-add/remove may not be very consistently used in the code.
|
||||||
|
CONFIG_CPU_HOTPLUG enables logical online/offline capability in the kernel.
|
||||||
|
To support physical addition/removal, one would need some BIOS hooks and
|
||||||
|
the platform should have something like an attention button in PCI hotplug.
|
||||||
|
CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs.
|
||||||
|
|
||||||
|
Q: How do i logically offline a CPU?
|
||||||
|
A: Do the following.
|
||||||
|
|
||||||
|
#echo 0 > /sys/devices/system/cpu/cpuX/online
|
||||||
|
|
||||||
|
once the logical offline is successful, check
|
||||||
|
|
||||||
|
#cat /proc/interrupts
|
||||||
|
|
||||||
|
you should now not see the CPU that you removed. Also online file will report
|
||||||
|
the state as 0 when a cpu if offline and 1 when its online.
|
||||||
|
|
||||||
|
#To display the current cpu state.
|
||||||
|
#cat /sys/devices/system/cpu/cpuX/online
|
||||||
|
|
||||||
|
Q: Why cant i remove CPU0 on some systems?
|
||||||
|
A: Some architectures may have some special dependency on a certain CPU.
|
||||||
|
|
||||||
|
For e.g in IA64 platforms we have ability to sent platform interrupts to the
|
||||||
|
OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI
|
||||||
|
specifications, we didn't have a way to change the target CPU. Hence if the
|
||||||
|
current ACPI version doesn't support such re-direction, we disable that CPU
|
||||||
|
by making it not-removable.
|
||||||
|
|
||||||
|
In such cases you will also notice that the online file is missing under cpu0.
|
||||||
|
|
||||||
|
Q: How do i find out if a particular CPU is not removable?
|
||||||
|
A: Depending on the implementation, some architectures may show this by the
|
||||||
|
absence of the "online" file. This is done if it can be determined ahead of
|
||||||
|
time that this CPU cannot be removed.
|
||||||
|
|
||||||
|
In some situations, this can be a run time check, i.e if you try to remove the
|
||||||
|
last CPU, this will not be permitted. You can find such failures by
|
||||||
|
investigating the return value of the "echo" command.
|
||||||
|
|
||||||
|
Q: What happens when a CPU is being logically offlined?
|
||||||
|
A: The following happen, listed in no particular order :-)
|
||||||
|
|
||||||
|
- A notification is sent to in-kernel registered modules by sending an event
|
||||||
|
CPU_DOWN_PREPARE
|
||||||
|
- All process is migrated away from this outgoing CPU to a new CPU
|
||||||
|
- All interrupts targeted to this CPU is migrated to a new CPU
|
||||||
|
- timers/bottom half/task lets are also migrated to a new CPU
|
||||||
|
- Once all services are migrated, kernel calls an arch specific routine
|
||||||
|
__cpu_disable() to perform arch specific cleanup.
|
||||||
|
- Once this is successful, an event for successful cleanup is sent by an event
|
||||||
|
CPU_DEAD.
|
||||||
|
|
||||||
|
"It is expected that each service cleans up when the CPU_DOWN_PREPARE
|
||||||
|
notifier is called, when CPU_DEAD is called its expected there is nothing
|
||||||
|
running on behalf of this CPU that was offlined"
|
||||||
|
|
||||||
|
Q: If i have some kernel code that needs to be aware of CPU arrival and
|
||||||
|
departure, how to i arrange for proper notification?
|
||||||
|
A: This is what you would need in your kernel code to receive notifications.
|
||||||
|
|
||||||
|
#include <linux/cpu.h>
|
||||||
|
static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb,
|
||||||
|
unsigned long action, void *hcpu)
|
||||||
|
{
|
||||||
|
unsigned int cpu = (unsigned long)hcpu;
|
||||||
|
|
||||||
|
switch (action) {
|
||||||
|
case CPU_ONLINE:
|
||||||
|
foobar_online_action(cpu);
|
||||||
|
break;
|
||||||
|
case CPU_DEAD:
|
||||||
|
foobar_dead_action(cpu);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
return NOTIFY_OK;
|
||||||
|
}
|
||||||
|
|
||||||
|
static struct notifier_block foobar_cpu_notifer =
|
||||||
|
{
|
||||||
|
.notifier_call = foobar_cpu_callback,
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
In your init function,
|
||||||
|
|
||||||
|
register_cpu_notifier(&foobar_cpu_notifier);
|
||||||
|
|
||||||
|
You can fail PREPARE notifiers if something doesn't work to prepare resources.
|
||||||
|
This will stop the activity and send a following CANCELED event back.
|
||||||
|
|
||||||
|
CPU_DEAD should not be failed, its just a goodness indication, but bad
|
||||||
|
things will happen if a notifier in path sent a BAD notify code.
|
||||||
|
|
||||||
|
Q: I don't see my action being called for all CPUs already up and running?
|
||||||
|
A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined.
|
||||||
|
If you need to perform some action for each cpu already in the system, then
|
||||||
|
|
||||||
|
for_each_online_cpu(i) {
|
||||||
|
foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i);
|
||||||
|
foobar_cpu_callback(&foobar-cpu_notifier, CPU_ONLINE, i);
|
||||||
|
}
|
||||||
|
|
||||||
|
Q: If i would like to develop cpu hotplug support for a new architecture,
|
||||||
|
what do i need at a minimum?
|
||||||
|
A: The following are what is required for CPU hotplug infrastructure to work
|
||||||
|
correctly.
|
||||||
|
|
||||||
|
- Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU
|
||||||
|
- __cpu_up() - Arch interface to bring up a CPU
|
||||||
|
- __cpu_disable() - Arch interface to shutdown a CPU, no more interrupts
|
||||||
|
can be handled by the kernel after the routine
|
||||||
|
returns. Including local APIC timers etc are
|
||||||
|
shutdown.
|
||||||
|
- __cpu_die() - This actually supposed to ensure death of the CPU.
|
||||||
|
Actually look at some example code in other arch
|
||||||
|
that implement CPU hotplug. The processor is taken
|
||||||
|
down from the idle() loop for that specific
|
||||||
|
architecture. __cpu_die() typically waits for some
|
||||||
|
per_cpu state to be set, to ensure the processor
|
||||||
|
dead routine is called to be sure positively.
|
||||||
|
|
||||||
|
Q: I need to ensure that a particular cpu is not removed when there is some
|
||||||
|
work specific to this cpu is in progress.
|
||||||
|
A: First switch the current thread context to preferred cpu
|
||||||
|
|
||||||
|
int my_func_on_cpu(int cpu)
|
||||||
|
{
|
||||||
|
cpumask_t saved_mask, new_mask = CPU_MASK_NONE;
|
||||||
|
int curr_cpu, err = 0;
|
||||||
|
|
||||||
|
saved_mask = current->cpus_allowed;
|
||||||
|
cpu_set(cpu, new_mask);
|
||||||
|
err = set_cpus_allowed(current, new_mask);
|
||||||
|
|
||||||
|
if (err)
|
||||||
|
return err;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If we got scheduled out just after the return from
|
||||||
|
* set_cpus_allowed() before running the work, this ensures
|
||||||
|
* we stay locked.
|
||||||
|
*/
|
||||||
|
curr_cpu = get_cpu();
|
||||||
|
|
||||||
|
if (curr_cpu != cpu) {
|
||||||
|
err = -EAGAIN;
|
||||||
|
goto ret;
|
||||||
|
} else {
|
||||||
|
/*
|
||||||
|
* Do work : But cant sleep, since get_cpu() disables preempt
|
||||||
|
*/
|
||||||
|
}
|
||||||
|
ret:
|
||||||
|
put_cpu();
|
||||||
|
set_cpus_allowed(current, saved_mask);
|
||||||
|
return err;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
Q: How do we determine how many CPUs are available for hotplug.
|
||||||
|
A: There is no clear spec defined way from ACPI that can give us that
|
||||||
|
information today. Based on some input from Natalie of Unisys,
|
||||||
|
that the ACPI MADT (Multiple APIC Description Tables) marks those possible
|
||||||
|
CPUs in a system with disabled status.
|
||||||
|
|
||||||
|
Andi implemented some simple heuristics that count the number of disabled
|
||||||
|
CPUs in MADT as hotpluggable CPUS. In the case there are no disabled CPUS
|
||||||
|
we assume 1/2 the number of CPUs currently present can be hotplugged.
|
||||||
|
|
||||||
|
Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field
|
||||||
|
in MADT is only 8 bits.
|
||||||
|
|
||||||
|
User Space Notification
|
||||||
|
|
||||||
|
Hotplug support for devices is common in Linux today. Its being used today to
|
||||||
|
support automatic configuration of network, usb and pci devices. A hotplug
|
||||||
|
event can be used to invoke an agent script to perform the configuration task.
|
||||||
|
|
||||||
|
You can add /etc/hotplug/cpu.agent to handle hotplug notification user space
|
||||||
|
scripts.
|
||||||
|
|
||||||
|
#!/bin/bash
|
||||||
|
# $Id: cpu.agent
|
||||||
|
# Kernel hotplug params include:
|
||||||
|
#ACTION=%s [online or offline]
|
||||||
|
#DEVPATH=%s
|
||||||
|
#
|
||||||
|
cd /etc/hotplug
|
||||||
|
. ./hotplug.functions
|
||||||
|
|
||||||
|
case $ACTION in
|
||||||
|
online)
|
||||||
|
echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt
|
||||||
|
;;
|
||||||
|
offline)
|
||||||
|
echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
debug_mesg CPU $ACTION event not supported
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
+138
-27
@@ -14,7 +14,10 @@ CONTENTS:
|
|||||||
1.1 What are cpusets ?
|
1.1 What are cpusets ?
|
||||||
1.2 Why are cpusets needed ?
|
1.2 Why are cpusets needed ?
|
||||||
1.3 How are cpusets implemented ?
|
1.3 How are cpusets implemented ?
|
||||||
1.4 How do I use cpusets ?
|
1.4 What are exclusive cpusets ?
|
||||||
|
1.5 What does notify_on_release do ?
|
||||||
|
1.6 What is memory_pressure ?
|
||||||
|
1.7 How do I use cpusets ?
|
||||||
2. Usage Examples and Syntax
|
2. Usage Examples and Syntax
|
||||||
2.1 Basic Usage
|
2.1 Basic Usage
|
||||||
2.2 Adding/removing cpus
|
2.2 Adding/removing cpus
|
||||||
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not
|
|||||||
allocate a page on a node that is not allowed in the requesting tasks
|
allocate a page on a node that is not allowed in the requesting tasks
|
||||||
mems_allowed vector.
|
mems_allowed vector.
|
||||||
|
|
||||||
If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
|
|
||||||
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
|
|
||||||
A cpuset that is cpu exclusive has a sched domain associated with it.
|
|
||||||
The sched domain consists of all cpus in the current cpuset that are not
|
|
||||||
part of any exclusive child cpusets.
|
|
||||||
This ensures that the scheduler load balacing code only balances
|
|
||||||
against the cpus that are in the sched domain as defined above and not
|
|
||||||
all of the cpus in the system. This removes any overhead due to
|
|
||||||
load balancing code trying to pull tasks outside of the cpu exclusive
|
|
||||||
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
|
||||||
|
|
||||||
A cpuset that is mem_exclusive restricts kernel allocations for
|
|
||||||
page, buffer and other data commonly shared by the kernel across
|
|
||||||
multiple users. All cpusets, whether mem_exclusive or not, restrict
|
|
||||||
allocations of memory for user space. This enables configuring a
|
|
||||||
system so that several independent jobs can share common kernel
|
|
||||||
data, such as file system pages, while isolating each jobs user
|
|
||||||
allocation in its own cpuset. To do this, construct a large
|
|
||||||
mem_exclusive cpuset to hold all the jobs, and construct child,
|
|
||||||
non-mem_exclusive cpusets for each individual job. Only a small
|
|
||||||
amount of typical kernel memory, such as requests from interrupt
|
|
||||||
handlers, is allowed to be taken outside even a mem_exclusive cpuset.
|
|
||||||
|
|
||||||
User level code may create and destroy cpusets by name in the cpuset
|
User level code may create and destroy cpusets by name in the cpuset
|
||||||
virtual file system, manage the attributes and permissions of these
|
virtual file system, manage the attributes and permissions of these
|
||||||
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
||||||
@@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows:
|
|||||||
The implementation of cpusets requires a few, simple hooks
|
The implementation of cpusets requires a few, simple hooks
|
||||||
into the rest of the kernel, none in performance critical paths:
|
into the rest of the kernel, none in performance critical paths:
|
||||||
|
|
||||||
- in main/init.c, to initialize the root cpuset at system boot.
|
- in init/main.c, to initialize the root cpuset at system boot.
|
||||||
- in fork and exit, to attach and detach a task from its cpuset.
|
- in fork and exit, to attach and detach a task from its cpuset.
|
||||||
- in sched_setaffinity, to mask the requested CPUs by what's
|
- in sched_setaffinity, to mask the requested CPUs by what's
|
||||||
allowed in that tasks cpuset.
|
allowed in that tasks cpuset.
|
||||||
@@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths:
|
|||||||
and related changes in both sched.c and arch/ia64/kernel/domain.c
|
and related changes in both sched.c and arch/ia64/kernel/domain.c
|
||||||
- in the mbind and set_mempolicy system calls, to mask the requested
|
- in the mbind and set_mempolicy system calls, to mask the requested
|
||||||
Memory Nodes by what's allowed in that tasks cpuset.
|
Memory Nodes by what's allowed in that tasks cpuset.
|
||||||
- in page_alloc, to restrict memory to allowed nodes.
|
- in page_alloc.c, to restrict memory to allowed nodes.
|
||||||
- in vmscan.c, to restrict page recovery to the current cpuset.
|
- in vmscan.c, to restrict page recovery to the current cpuset.
|
||||||
|
|
||||||
In addition a new file system, of type "cpuset" may be mounted,
|
In addition a new file system, of type "cpuset" may be mounted,
|
||||||
@@ -192,9 +172,15 @@ containing the following files describing that cpuset:
|
|||||||
|
|
||||||
- cpus: list of CPUs in that cpuset
|
- cpus: list of CPUs in that cpuset
|
||||||
- mems: list of Memory Nodes in that cpuset
|
- mems: list of Memory Nodes in that cpuset
|
||||||
|
- memory_migrate flag: if set, move pages to cpusets nodes
|
||||||
- cpu_exclusive flag: is cpu placement exclusive?
|
- cpu_exclusive flag: is cpu placement exclusive?
|
||||||
- mem_exclusive flag: is memory placement exclusive?
|
- mem_exclusive flag: is memory placement exclusive?
|
||||||
- tasks: list of tasks (by pid) attached to that cpuset
|
- tasks: list of tasks (by pid) attached to that cpuset
|
||||||
|
- notify_on_release flag: run /sbin/cpuset_release_agent on exit?
|
||||||
|
- memory_pressure: measure of how much paging pressure in cpuset
|
||||||
|
|
||||||
|
In addition, the root cpuset only has the following file:
|
||||||
|
- memory_pressure_enabled flag: compute memory_pressure?
|
||||||
|
|
||||||
New cpusets are created using the mkdir system call or shell
|
New cpusets are created using the mkdir system call or shell
|
||||||
command. The properties of a cpuset, such as its flags, allowed
|
command. The properties of a cpuset, such as its flags, allowed
|
||||||
@@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
|
|||||||
to represent the cpuset hierarchy provides for a familiar permission
|
to represent the cpuset hierarchy provides for a familiar permission
|
||||||
and name space for cpusets, with a minimum of additional kernel code.
|
and name space for cpusets, with a minimum of additional kernel code.
|
||||||
|
|
||||||
1.4 How do I use cpusets ?
|
|
||||||
|
1.4 What are exclusive cpusets ?
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
If a cpuset is cpu or mem exclusive, no other cpuset, other than
|
||||||
|
a direct ancestor or descendent, may share any of the same CPUs or
|
||||||
|
Memory Nodes.
|
||||||
|
|
||||||
|
A cpuset that is cpu_exclusive has a scheduler (sched) domain
|
||||||
|
associated with it. The sched domain consists of all CPUs in the
|
||||||
|
current cpuset that are not part of any exclusive child cpusets.
|
||||||
|
This ensures that the scheduler load balancing code only balances
|
||||||
|
against the CPUs that are in the sched domain as defined above and
|
||||||
|
not all of the CPUs in the system. This removes any overhead due to
|
||||||
|
load balancing code trying to pull tasks outside of the cpu_exclusive
|
||||||
|
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
||||||
|
|
||||||
|
A cpuset that is mem_exclusive restricts kernel allocations for
|
||||||
|
page, buffer and other data commonly shared by the kernel across
|
||||||
|
multiple users. All cpusets, whether mem_exclusive or not, restrict
|
||||||
|
allocations of memory for user space. This enables configuring a
|
||||||
|
system so that several independent jobs can share common kernel data,
|
||||||
|
such as file system pages, while isolating each jobs user allocation in
|
||||||
|
its own cpuset. To do this, construct a large mem_exclusive cpuset to
|
||||||
|
hold all the jobs, and construct child, non-mem_exclusive cpusets for
|
||||||
|
each individual job. Only a small amount of typical kernel memory,
|
||||||
|
such as requests from interrupt handlers, is allowed to be taken
|
||||||
|
outside even a mem_exclusive cpuset.
|
||||||
|
|
||||||
|
|
||||||
|
1.5 What does notify_on_release do ?
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
If the notify_on_release flag is enabled (1) in a cpuset, then whenever
|
||||||
|
the last task in the cpuset leaves (exits or attaches to some other
|
||||||
|
cpuset) and the last child cpuset of that cpuset is removed, then
|
||||||
|
the kernel runs the command /sbin/cpuset_release_agent, supplying the
|
||||||
|
pathname (relative to the mount point of the cpuset file system) of the
|
||||||
|
abandoned cpuset. This enables automatic removal of abandoned cpusets.
|
||||||
|
The default value of notify_on_release in the root cpuset at system
|
||||||
|
boot is disabled (0). The default value of other cpusets at creation
|
||||||
|
is the current value of their parents notify_on_release setting.
|
||||||
|
|
||||||
|
|
||||||
|
1.6 What is memory_pressure ?
|
||||||
|
-----------------------------
|
||||||
|
The memory_pressure of a cpuset provides a simple per-cpuset metric
|
||||||
|
of the rate that the tasks in a cpuset are attempting to free up in
|
||||||
|
use memory on the nodes of the cpuset to satisfy additional memory
|
||||||
|
requests.
|
||||||
|
|
||||||
|
This enables batch managers monitoring jobs running in dedicated
|
||||||
|
cpusets to efficiently detect what level of memory pressure that job
|
||||||
|
is causing.
|
||||||
|
|
||||||
|
This is useful both on tightly managed systems running a wide mix of
|
||||||
|
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
||||||
|
are trying to use more memory than allowed on the nodes assigned them,
|
||||||
|
and with tightly coupled, long running, massively parallel scientific
|
||||||
|
computing jobs that will dramatically fail to meet required performance
|
||||||
|
goals if they start to use more memory than allowed to them.
|
||||||
|
|
||||||
|
This mechanism provides a very economical way for the batch manager
|
||||||
|
to monitor a cpuset for signs of memory pressure. It's up to the
|
||||||
|
batch manager or other user code to decide what to do about it and
|
||||||
|
take action.
|
||||||
|
|
||||||
|
==> Unless this feature is enabled by writing "1" to the special file
|
||||||
|
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
|
||||||
|
code of __alloc_pages() for this metric reduces to simply noticing
|
||||||
|
that the cpuset_memory_pressure_enabled flag is zero. So only
|
||||||
|
systems that enable this feature will compute the metric.
|
||||||
|
|
||||||
|
Why a per-cpuset, running average:
|
||||||
|
|
||||||
|
Because this meter is per-cpuset, rather than per-task or mm,
|
||||||
|
the system load imposed by a batch scheduler monitoring this
|
||||||
|
metric is sharply reduced on large systems, because a scan of
|
||||||
|
the tasklist can be avoided on each set of queries.
|
||||||
|
|
||||||
|
Because this meter is a running average, instead of an accumulating
|
||||||
|
counter, a batch scheduler can detect memory pressure with a
|
||||||
|
single read, instead of having to read and accumulate results
|
||||||
|
for a period of time.
|
||||||
|
|
||||||
|
Because this meter is per-cpuset rather than per-task or mm,
|
||||||
|
the batch scheduler can obtain the key information, memory
|
||||||
|
pressure in a cpuset, with a single read, rather than having to
|
||||||
|
query and accumulate results over all the (dynamically changing)
|
||||||
|
set of tasks in the cpuset.
|
||||||
|
|
||||||
|
A per-cpuset simple digital filter (requires a spinlock and 3 words
|
||||||
|
of data per-cpuset) is kept, and updated by any task attached to that
|
||||||
|
cpuset, if it enters the synchronous (direct) page reclaim code.
|
||||||
|
|
||||||
|
A per-cpuset file provides an integer number representing the recent
|
||||||
|
(half-life of 10 seconds) rate of direct page reclaims caused by
|
||||||
|
the tasks in the cpuset, in units of reclaims attempted per second,
|
||||||
|
times 1000.
|
||||||
|
|
||||||
|
|
||||||
|
1.7 How do I use cpusets ?
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
In order to minimize the impact of cpusets on critical kernel
|
In order to minimize the impact of cpusets on critical kernel
|
||||||
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
|
|||||||
impacting the scheduler code in the kernel with a check for changes
|
impacting the scheduler code in the kernel with a check for changes
|
||||||
in a tasks processor placement.
|
in a tasks processor placement.
|
||||||
|
|
||||||
|
Normally, once a page is allocated (given a physical page
|
||||||
|
of main memory) then that page stays on whatever node it
|
||||||
|
was allocated, so long as it remains allocated, even if the
|
||||||
|
cpusets memory placement policy 'mems' subsequently changes.
|
||||||
|
If the cpuset flag file 'memory_migrate' is set true, then when
|
||||||
|
tasks are attached to that cpuset, any pages that task had
|
||||||
|
allocated to it on nodes in its previous cpuset are migrated
|
||||||
|
to the tasks new cpuset. Depending on the implementation,
|
||||||
|
this migration may either be done by swapping the page out,
|
||||||
|
so that the next time the page is referenced, it will be paged
|
||||||
|
into the tasks new cpuset, usually on the node where it was
|
||||||
|
referenced, or this migration may be done by directly copying
|
||||||
|
the pages from the tasks previous cpuset to the new cpuset,
|
||||||
|
where possible to the same node, relative to the new cpuset,
|
||||||
|
as the node that held the page, relative to the old cpuset.
|
||||||
|
Also if 'memory_migrate' is set true, then if that cpusets
|
||||||
|
'mems' file is modified, pages allocated to tasks in that
|
||||||
|
cpuset, that were on nodes in the previous setting of 'mems',
|
||||||
|
will be moved to nodes in the new setting of 'mems.' Again,
|
||||||
|
depending on the implementation, this might be done by swapping,
|
||||||
|
or by direct copying. In either case, pages that were not in
|
||||||
|
the tasks prior cpuset, or in the cpusets prior 'mems' setting,
|
||||||
|
will not be moved.
|
||||||
|
|
||||||
There is an exception to the above. If hotplug functionality is used
|
There is an exception to the above. If hotplug functionality is used
|
||||||
to remove all the CPUs that are currently assigned to a cpuset,
|
to remove all the CPUs that are currently assigned to a cpuset,
|
||||||
then the kernel will automatically update the cpus_allowed of all
|
then the kernel will automatically update the cpus_allowed of all
|
||||||
|
|||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user