Merge branch 'upstream'

2026-05-01 15:00:59 -07:00 · 2006-01-17 10:29:06 -05:00
parent 61420e147a 1bc4ccfff8
commit ea9b395fe2
5476 changed files with 318127 additions and 155285 deletions
@@ -10,6 +10,7 @@
 *.a
 *.s
 *.ko
 *.so
 *.mod.c
 #
@@ -23,6 +24,7 @@ Module.symvers
 # Generated include files
 #
 include/asm
 include/asm-*/asm-offsets.h
 include/config
 include/linux/autoconf.h
 include/linux/compile.h
@@ -1883,6 +1883,7 @@ N: Jaya Kumar
 E: jayalk@intworks.biz
 W: http://www.intworks.biz
 D: Arc monochrome LCD framebuffer driver, x86 reboot fixups
 D: pirq addr, CS5535 alsa audio driver
 S: Gurgaon, India
 S: Kuala Lumpur, Malaysia
@@ -3202,7 +3203,7 @@ N: Eugene Surovegin
 E: ebs@ebshome.net
 W: http://kernel.ebshome.net/
 P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C  2365 7602 F33D AE54 67F1
-D: Embedded PowerPC 4xx: I2C, PIC and random hacks/fixes
+D: Embedded PowerPC 4xx: EMAC, I2C, PIC and random hacks/fixes
 S: Sunnyvale, California 94085
 S: USA
@@ -31,8 +31,6 @@ al espa
 Eine deutsche Version dieser Datei finden Sie unter
 <http://www.stefan-winter.de/Changes-2.4.0.txt>.
 Last updated: October 29th, 2002
 Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu).
 Current Minimal Requirements
@@ -48,7 +46,7 @@ necessary on all systems; obviously, if you don't have any ISDN
 hardware, for example, you probably needn't concern yourself with
 isdn4k-utils.
-o  Gnu C                  2.95.3                  # gcc --version
+o  Gnu C                  3.2                     # gcc --version
 o  Gnu make               3.79.1                  # make --version
 o  binutils               2.12                    # ld -v
 o  util-linux             2.10o                   # fdformat --version
@@ -74,26 +72,7 @@ GCC
 ---
 The gcc version requirements may vary depending on the type of CPU in your
-computer. The next paragraph applies to users of x86 CPUs, but not
+computer.
 necessarily to users of other CPUs. Users of other CPUs should obtain
 information about their gcc version requirements from another source.
 The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it
 should be used when you need absolute stability. You may use gcc 3.0.x
 instead if you wish, although it may cause problems. Later versions of gcc 
 have not received much testing for Linux kernel compilation, and there are 
 almost certainly bugs (mainly, but not exclusively, in the kernel) that
 will need to be fixed in order to use these compilers. In any case, using
 pgcc instead of plain gcc is just asking for trouble.
 The Red Hat gcc 2.96 compiler subtree can also be used to build this tree.
 You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build
 the kernel correctly.
 In addition, please pay attention to compiler optimization.  Anything
 greater than -O2 may not be wise.  Similarly, if you choose to use gcc-2.95.x
 or derivatives, be sure not to use -fstrict-aliasing (which, depending on
 your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing).
 Make
 ----
@@ -322,9 +301,9 @@ Getting updated software
 Kernel compilation
 ******************
-gcc 2.95.3
+gcc
----------
+---
-o  <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz>
+o  <ftp://ftp.gnu.org/gnu/gcc/>
 Make
 ----
@@ -199,7 +199,7 @@ The rationale is:
    modifications are prevented
 - saves the compiler work to optimize redundant code away ;)
-int fun(int )
+int fun(int a)
 {
 	int result = 0;
 	char *buffer = kmalloc(SIZE);
@@ -344,7 +344,7 @@ Remember: if another thread can find your data structure, and you don't
 have a reference count on it, you almost certainly have a bug.
-		Chapter 11: Macros, Enums, Inline functions and RTL
+		Chapter 11: Macros, Enums and RTL
 Names of macros defining constants and labels in enums are capitalized.
@@ -429,7 +429,35 @@ from void pointer to any other pointer type is guaranteed by the C programming
 language.
-		Chapter 14: References
+		Chapter 14: The inline disease
 There appears to be a common misperception that gcc has a magic "make me
 faster" speedup option called "inline". While the use of inlines can be
 appropriate (for example as a means of replacing macros, see Chapter 11), it
 very often is not. Abundant use of the inline keyword leads to a much bigger
 kernel, which in turn slows the system as a whole down, due to a bigger
 icache footprint for the CPU and simply because there is less memory
 available for the pagecache. Just think about it; a pagecache miss causes a
 disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles
 that can go into these 5 miliseconds.
 A reasonable rule of thumb is to not put inline at functions that have more
 than 3 lines of code in them. An exception to this rule are the cases where
 a parameter is known to be a compiletime constant, and as a result of this
 constantness you *know* the compiler will be able to optimize most of your
 function away at compile time. For a good example of this later case, see
 the kmalloc() inline function.
 Often people argue that adding inline to functions that are static and used
 only once is always a win since there is no space tradeoff. While this is
 technically correct, gcc is capable of inlining these automatically without
 help, and the maintenance issue of removing the inline when a second user
 appears outweighs the potential value of the hint that tells gcc to do
 something it would have done anyway.
 		Chapter 15: References
 The C Programming Language, Second Edition
 by Brian W. Kernighan and Dennis M. Ritchie.
@@ -444,10 +472,13 @@ ISBN 0-201-61586-X.
 URL: http://cm.bell-labs.com/cm/cs/tpop/
 GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
-gcc internals and indent, all available from http://www.gnu.org
+gcc internals and indent, all available from http://www.gnu.org/manual/
 WG14 is the international standardization working group for the programming
-language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/
+language C, URL: http://www.open-std.org/JTC1/SC22/WG14/
 Kernel CodingStyle, by greg@kroah.com at OLS 2002:
 http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
 --
-Last updated on 16 February 2004 by a community effort on LKML.
+Last updated on 30 December 2005 by a community effort on LKML.
@@ -0,0 +1,6 @@
 *.xml
 *.ps
 *.pdf
 *.html
 *.9.gz
 *.9
@@ -53,6 +53,11 @@
 !Iinclude/linux/sched.h
 !Ekernel/sched.c
 !Ekernel/timer.c
     </sect1>
     <sect1><title>High-resolution timers</title>
 !Iinclude/linux/ktime.h
 !Iinclude/linux/hrtimer.h
 !Ekernel/hrtimer.c
     </sect1>
     <sect1><title>Internal Functions</title>
 !Ikernel/exit.c
@@ -369,6 +374,7 @@ X!Edrivers/acpi/motherboard.c
 X!Edrivers/acpi/bus.c
 -->
 !Edrivers/acpi/scan.c
 !Idrivers/acpi/scan.c
 <!-- No correct structured comments
 X!Edrivers/acpi/pci_bind.c
 -->
@@ -222,7 +222,7 @@
   <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
   <para>
-     There are two main types of kernel locks.  The fundamental type
+     There are three main types of kernel locks.  The fundamental type
     is the spinlock 
     (<filename class="headerfile">include/asm/spinlock.h</filename>),
     which is a very simple single-holder lock: if you can't get the 
@@ -230,16 +230,22 @@
     very small and fast, and can be used anywhere.
   </para>
   <para>
-     The second type is a semaphore
+     The second type is a mutex
     (<filename class="headerfile">include/linux/mutex.h</filename>): it
     is like a spinlock, but you may block holding a mutex.
     If you can't lock a mutex, your task will suspend itself, and be woken
     up when the mutex is released.  This means the CPU can do something
     else while you are waiting.  There are many cases when you simply
     can't sleep (see <xref linkend="sleeping-things"/>), and so have to
     use a spinlock instead.
   </para>
   <para>
     The third type is a semaphore
     (<filename class="headerfile">include/asm/semaphore.h</filename>): it
     can have more than one holder at any time (the number decided at
     initialization time), although it is most commonly used as a
-     single-holder lock (a mutex).  If you can't get a semaphore,
+     single-holder lock (a mutex).  If you can't get a semaphore, your
-     your task will put itself on the queue, and be woken up when the
+     task will be suspended and later on woken up - just like for mutexes.
     semaphore is released.  This means the CPU will do something
     else while you are waiting, but there are many cases when you
     simply can't sleep (see <xref linkend="sleeping-things"/>), and so
     have to use a spinlock instead.
   </para>
   <para>
     Neither type of lock is recursive: see
@@ -253,6 +253,7 @@
 !Edrivers/usb/core/urb.c
 !Edrivers/usb/core/message.c
 !Edrivers/usb/core/file.c
 !Edrivers/usb/core/driver.c
 !Edrivers/usb/core/usb.c
 !Edrivers/usb/core/hub.c
    </chapter>
@@ -229,7 +229,7 @@ int __init myradio_init(struct video_init *v)
 static int users = 0;
-static int radio_open(stuct video_device *dev, int flags)
+static int radio_open(struct video_device *dev, int flags)
 {
        if(users)
                return -EBUSY;
@@ -949,7 +949,7 @@ int __init mycamera_init(struct video_init *v)
 static int users = 0;
-static int camera_open(stuct video_device *dev, int flags)
+static int camera_open(struct video_device *dev, int flags)
 {
        if(users)
                return -EBUSY;
@@ -1,74 +1,67 @@
-Refcounter framework for elements of lists/arrays protected by
+Refcounter design for elements of lists/arrays protected by RCU.
 RCU.
 Refcounting on elements of  lists which are protected by traditional
 reader/writer spinlocks or semaphores are straight forward as in:
-1.					2.
+1.				2.
-add()					search_and_reference()
+add()				search_and_reference()
-{					{
+{				{
-	alloc_object				read_lock(&list_lock);
+    alloc_object		    read_lock(&list_lock);
-	...					search_for_element
+    ...				    search_for_element
-	atomic_set(&el->rc, 1);			atomic_inc(&el->rc);
+    atomic_set(&el->rc, 1);	    atomic_inc(&el->rc);
-	write_lock(&list_lock);			...
+    write_lock(&list_lock);	     ...
-	add_element				read_unlock(&list_lock);
+    add_element			    read_unlock(&list_lock);
-	...					...
+    ...				    ...
-	write_unlock(&list_lock);	}
+    write_unlock(&list_lock);	}
 }
 3.					4.
 release_referenced()			delete()
 {					{
-	...				write_lock(&list_lock);
+    ...					    write_lock(&list_lock);
-	atomic_dec(&el->rc, relfunc)	...
+    atomic_dec(&el->rc, relfunc)	    ...
-	...				delete_element
+    ...					    delete_element
-}					write_unlock(&list_lock);
+}					    write_unlock(&list_lock);
- 					...
+ 					    ...
-					if (atomic_dec_and_test(&el->rc))
+					    if (atomic_dec_and_test(&el->rc))
-						kfree(el);
+					        kfree(el);
-					...
+					    ...
 					}
 If this list/array is made lock free using rcu as in changing the
 write_lock in add() and delete() to spin_lock and changing read_lock
-in search_and_reference to rcu_read_lock(), the rcuref_get in
+in search_and_reference to rcu_read_lock(), the atomic_get in
 search_and_reference could potentially hold reference to an element which
-has already been deleted from the list/array.  rcuref_lf_get_rcu takes
+has already been deleted from the list/array.  atomic_inc_not_zero takes
 care of this scenario. search_and_reference should look as;
 1.					2.
 add()					search_and_reference()
 {					{
- 	alloc_object				rcu_read_lock();
+    alloc_object			    rcu_read_lock();
-	...					search_for_element
+    ...					    search_for_element
-	atomic_set(&el->rc, 1);			if (rcuref_inc_lf(&el->rc)) {
+    atomic_set(&el->rc, 1);		    if (atomic_inc_not_zero(&el->rc)) {
-	write_lock(&list_lock);				rcu_read_unlock();
+    write_lock(&list_lock);		        rcu_read_unlock();
-							return FAIL;
+					        return FAIL;
-	add_element				}
+    add_element				    }
-	...					...
+    ...					    ...
-	write_unlock(&list_lock);		rcu_read_unlock();
+    write_unlock(&list_lock);		    rcu_read_unlock();
 }					}
 3.					4.
 release_referenced()			delete()
 {					{
-	...				write_lock(&list_lock);
+    ...					    write_lock(&list_lock);
-	rcuref_dec(&el->rc, relfunc)	...
+    atomic_dec(&el->rc, relfunc)	    ...
-	...				delete_element
+    ...					    delete_element
-}					write_unlock(&list_lock);
+}					    write_unlock(&list_lock);
- 					...
+ 					    ...
-					if (rcuref_dec_and_test(&el->rc))
+					    if (atomic_dec_and_test(&el->rc))
-						call_rcu(&el->head, el_free);
+					        call_rcu(&el->head, el_free);
-					...
+					    ...
 					}
 Sometimes, reference to the element need to be obtained in the
-update (write) stream.  In such cases, rcuref_inc_lf might be an overkill
+update (write) stream.  In such cases, atomic_inc_not_zero might be an
-since the spinlock serialising list updates are held. rcuref_inc
+overkill since the spinlock serialising list updates are held. atomic_inc
 is to be used in such cases.
-For arches which do not have cmpxchg rcuref_inc_lf
+
 api uses a hashed spinlock implementation and the same hashed spinlock
 is acquired in all rcuref_xxx primitives to preserve atomicity.
 Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
 refcounter atleast at one place.  Mixing rcuref_inc and atomic_xxx api
 might lead to races. rcuref_inc_lf() must be used in lockfree
 RCU critical sections only.
@@ -27,18 +27,17 @@ Who To Submit Drivers To
 ------------------------
 Linux 2.0:
-	No new drivers are accepted for this kernel tree
+	No new drivers are accepted for this kernel tree.
 Linux 2.2:
 	No new drivers are accepted for this kernel tree.
 Linux 2.4:
 	If the code area has a general maintainer then please submit it to
 	the maintainer listed in MAINTAINERS in the kernel file. If the
 	maintainer does not respond or you cannot find the appropriate
-	maintainer then please contact the 2.2 kernel maintainer:
+	maintainer then please contact Marcelo Tosatti
-	Marc-Christian Petersen <m.c.p@wolk-project.de>.
+	<marcelo.tosatti@cyclades.com>.
 Linux 2.4:
 	The same rules apply as 2.2. The final contact point for Linux 2.4
 	submissions is Marcelo Tosatti <marcelo.tosatti@cyclades.com>.
 Linux 2.6:
 	The same rules apply as 2.4 except that you should follow linux-kernel
@@ -53,6 +52,7 @@ Licensing:	The code must be released to us under the
 		of exclusive GPL licensing, and if you wish the driver
 		to be useful to other communities such as BSD you may well
 		wish to release under multiple licenses.
 		See accepted licenses at include/linux/module.h
 Copyright:	The copyright owner must agree to use of GPL.
 		It's best if the submitter and copyright owner
@@ -143,5 +143,13 @@ KernelNewbies:
 	http://kernelnewbies.org/
 Linux USB project:
-	http://sourceforge.net/projects/linux-usb/
+	http://www.linux-usb.org/
 How to NOT write kernel driver by arjanv@redhat.com
 	http://people.redhat.com/arjanv/olspaper.pdf
 Kernel Janitor:
 	http://janitor.kernelnewbies.org/
 --
 Last updated on 17 Nov 2005.
@@ -78,7 +78,9 @@ Randy Dunlap's patch scripts:
 http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz
 Andrew Morton's patch scripts:
-http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.20
+http://www.zip.com.au/~akpm/linux/patches/
 Instead of these scripts, quilt is the recommended patch management
 tool (see above).
@@ -97,7 +99,7 @@ need to split up your patch.  See #3, next.
 3) Separate your changes.
-Separate each logical change into its own patch.
+Separate _logical changes_ into a single patch file.
 For example, if your changes include both bug fixes and performance
 enhancements for a single driver, separate those changes into two
@@ -112,6 +114,10 @@ If one patch depends on another patch in order for a change to be
 complete, that is OK.  Simply note "this patch depends on patch X"
 in your patch description.
 If you cannot condense your patch set into a smaller set of patches,
 then only post say 15 or so at a time and wait for review and integration.
 4) Select e-mail destination.
@@ -124,6 +130,10 @@ your patch to the primary Linux kernel developer's mailing list,
 linux-kernel@vger.kernel.org.  Most kernel developers monitor this
 e-mail list, and can comment on your changes.
 Do not send more than 15 patches at once to the vger mailing lists!!!
 Linus Torvalds is the final arbiter of all changes accepted into the
 Linux kernel.  His e-mail address is <torvalds@osdl.org>.  He gets
 a lot of e-mail, so typically you should do your best to -avoid- sending
@@ -149,6 +159,9 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc.  See the
 MAINTAINERS file for a mailing list that relates specifically to
 your change.
 Majordomo lists of VGER.KERNEL.ORG at:
 	<http://vger.kernel.org/vger-lists.html>
 If changes affect userland-kernel interfaces, please send
 the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
 a man-pages patch, or at least a notification of the change,
@@ -158,7 +171,7 @@ Even if the maintainer did not respond in step #4, make sure to ALWAYS
 copy the maintainer when you change their code.
 For small patches you may want to CC the Trivial Patch Monkey
-trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial"
+trivial@kernel.org managed by Adrian Bunk; which collects "trivial"
 patches. Trivial patches must qualify for one of the following rules:
 Spelling fixes in documentation
 Spelling fixes which could break grep(1).
@@ -171,7 +184,7 @@ patches. Trivial patches must qualify for one of the following rules:
 since people copy, as long as it's trivial)
 Any fix by the author/maintainer of the file. (ie. patch monkey
 in re-transmission mode)
-URL: <http://www.kernel.org/pub/linux/kernel/people/rusty/trivial/>
+URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/>
@@ -373,27 +386,14 @@ a diffstat, to show what files have changed, and the number of inserted
 and deleted lines per file.  A diffstat is especially useful on bigger
 patches.  Other comments relevant only to the moment or the maintainer,
 not suitable for the permanent changelog, should also go here.
 Use diffstat options "-p 1 -w 70" so that filenames are listed from the
 top of the kernel source tree and don't use too much horizontal space
 (easily fit in 80 columns, maybe with some indentation).
 See more details on the proper patch format in the following
 references.
 13) More references for submitting patches
 Andrew Morton, "The perfect patch" (tpp).
  <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
 Jeff Garzik, "Linux kernel patch submission format."
  <http://linux.yyz.us/patch-format.html>
 Greg KH, "How to piss off a kernel subsystem maintainer"
  <http://www.kroah.com/log/2005/03/31/>
 Kernel Documentation/CodingStyle
  <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
 Linus Torvald's mail on the canonical patch format:
  <http://lkml.org/lkml/2005/4/7/183>
 -----------------------------------
@@ -466,3 +466,31 @@ and 'extern __inline__'.
 Don't try to anticipate nebulous future cases which may or may not
 be useful:  "Make it as simple as you can, and no simpler."
 ----------------------
 SECTION 3 - REFERENCES
 ----------------------
 Andrew Morton, "The perfect patch" (tpp).
  <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
 Jeff Garzik, "Linux kernel patch submission format."
  <http://linux.yyz.us/patch-format.html>
 Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer".
  <http://www.kroah.com/log/2005/03/31/>
  <http://www.kroah.com/log/2005/07/08/>
  <http://www.kroah.com/log/2005/10/19/>
  <http://www.kroah.com/log/2006/01/11/>
 NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!.
  <http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2>
 Kernel Documentation/CodingStyle
  <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
 Linus Torvald's mail on the canonical patch format:
  <http://lkml.org/lkml/2005/4/7/183>
 --
 Last updated on 17 Nov 2005.
@@ -2,8 +2,8 @@
 	Applying Patches To The Linux Kernel
 	------------------------------------
-	(Written by Jesper Juhl, August 2005)
+	Original by: Jesper Juhl, August 2005
-
+	Last update: 2006-01-05
 A frequently asked question on the Linux Kernel Mailing List is how to apply
@@ -76,7 +76,7 @@ instead:
 If you wish to uncompress the patch file by hand first before applying it
 (what I assume you've done in the examples below), then you simply run
-gunzip or bunzip2 on the file - like this:
+gunzip or bunzip2 on the file -- like this:
 	gunzip patch-x.y.z.gz
 	bunzip2 patch-x.y.z.bz2
@@ -94,7 +94,7 @@ Common errors when patching
 ---
 When patch applies a patch file it attempts to verify the sanity of the
 file in different ways.
-Checking that the file looks like a valid patch file, checking the code
+Checking that the file looks like a valid patch file & checking the code
 around the bits being modified matches the context provided in the patch are
 just two of the basic sanity checks patch does.
@@ -118,16 +118,16 @@ wrong.
 When patch encounters a change that it can't fix up with fuzz it rejects it
 outright and leaves a file with a .rej extension (a reject file). You can
-read this file to see exactely what change couldn't be applied, so you can
+read this file to see exactly what change couldn't be applied, so you can
 go fix it up by hand if you wish.
-If you don't have any third party patches applied to your kernel source, but
+If you don't have any third-party patches applied to your kernel source, but
 only patches from kernel.org and you apply the patches in the correct order,
 and have made no modifications yourself to the source files, then you should
 never see a fuzz or reject message from patch. If you do see such messages
 anyway, then there's a high risk that either your local source tree or the
 patch file is corrupted in some way. In that case you should probably try
-redownloading the patch and if things are still not OK then you'd be advised
+re-downloading the patch and if things are still not OK then you'd be advised
 to start with a fresh tree downloaded in full from kernel.org.
 Let's look a bit more at some of the messages patch can produce.
@@ -136,7 +136,7 @@ If patch stops and presents a "File to patch:" prompt, then patch could not
 find a file to be patched. Most likely you forgot to specify -p1 or you are
 in the wrong directory. Less often, you'll find patches that need to be
 applied with -p0 instead of -p1 (reading the patch file should reveal if
-this is the case - if so, then this is an error by the person who created
+this is the case -- if so, then this is an error by the person who created
 the patch but is not fatal).
 If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
@@ -167,22 +167,28 @@ the patch will in fact apply it.
 A message similar to "patch: **** unexpected end of file in patch" or "patch
 unexpectedly ends in middle of line" means that patch could make no sense of
-the file you fed to it. Either your download is broken or you tried to feed
+the file you fed to it. Either your download is broken, you tried to feed
-patch a compressed patch file without uncompressing it first.
+patch a compressed patch file without uncompressing it first, or the patch
 file that you are using has been mangled by a mail client or mail transfer
 agent along the way somewhere, e.g., by splitting a long line into two lines.
 Often these warnings can easily be fixed by joining (concatenating) the
 two lines that had been split.
 As I already mentioned above, these errors should never happen if you apply
 a patch from kernel.org to the correct version of an unmodified source tree.
 So if you get these errors with kernel.org patches then you should probably
-assume that either your patch file or your tree is broken and I'd advice you
+assume that either your patch file or your tree is broken and I'd advise you
 to start over with a fresh download of a full kernel tree and the patch you
 wish to apply.
 Are there any alternatives to `patch'?
 ---
- Yes there are alternatives. You can use the `interdiff' program
+ Yes there are alternatives.
-(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
+
-differences between two patches and then apply the result.
+ You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to
 generate a patch representing the differences between two patches and then
 apply the result.
 This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
 step. The -z flag to interdiff will even let you feed it patches in gzip or
 bzip2 compressed form directly without the use of zcat or bzcat or manual
@@ -197,10 +203,10 @@ do the additional steps since interdiff can get things wrong in some cases.
 Another alternative is `ketchup', which is a python script for automatic
 downloading and applying of patches (http://www.selenic.com/ketchup/).
-Other nice tools are diffstat which shows a summary of changes made by a
+ Other nice tools are diffstat, which shows a summary of changes made by a
-patch, lsdiff which displays a short listing of affected files in a patch
+patch; lsdiff, which displays a short listing of affected files in a patch
-file, along with (optionally) the line numbers of the start of each patch
+file, along with (optionally) the line numbers of the start of each patch;
-and grepdiff which displays a list of the files modified by a patch where
+and grepdiff, which displays a list of the files modified by a patch where
 the patch contains a given regular expression.
@@ -225,8 +231,8 @@ The -mm kernels live at
 In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
 country code. This way you'll be downloading from a mirror site that's most
 likely geographically closer to you, resulting in faster downloads for you,
-less bandwidth used globally and less load on the main kernel.org servers -
+less bandwidth used globally and less load on the main kernel.org servers --
-these are good things, do use mirrors when possible.
+these are good things, so do use mirrors when possible.
 The 2.6.x kernels
@@ -234,14 +240,14 @@ The 2.6.x kernels
 These are the base stable releases released by Linus. The highest numbered
 release is the most recent.
-If regressions or other serious flaws are found then a -stable fix patch
+If regressions or other serious flaws are found, then a -stable fix patch
 will be released (see below) on top of this base. Once a new 2.6.x base
 kernel is released, a patch is made available that is a delta between the
 previous 2.6.x kernel and the new one.
-To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
+To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note
 that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
-base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
+base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to
 first revert the 2.6.x.y patch).
 Here are some examples:
@@ -258,12 +264,12 @@ $ patch -p1 -R < ../patch-2.6.11.1	# revert the 2.6.11.1 patch
 					# source dir is now 2.6.11
 $ patch -p1 < ../patch-2.6.12		# apply new 2.6.12 patch
 $ cd ..
-$ mv linux-2.6.11.1 inux-2.6.12		# rename source dir
+$ mv linux-2.6.11.1 linux-2.6.12		# rename source dir
 The 2.6.x.y kernels
 ---
- Kernels with 4 digit versions are -stable kernels. They contain small(ish)
+ Kernels with 4-digit versions are -stable kernels. They contain small(ish)
 critical fixes for security problems or significant regressions discovered
 in a given 2.6.x kernel.
@@ -274,9 +280,14 @@ versions.
 If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
 the current stable kernel.
 note: the -stable team usually do make incremental patches available as well
 as patches against the latest mainline release, but I only cover the
 non-incremental ones below. The incremental ones can be found at
 ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/
 These patches are not incremental, meaning that for example the 2.6.12.3
 patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
-of the base 2.6.12 kernel source.
+of the base 2.6.12 kernel source .
 So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
 source you have to first back out the 2.6.12.2 patch (so you are left with a
 base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
@@ -342,12 +353,12 @@ The -git kernels
 repository, hence the name).
 These patches are usually released daily and represent the current state of
-Linus' tree. They are more experimental than -rc kernels since they are
+Linus's tree. They are more experimental than -rc kernels since they are
 generated automatically without even a cursory glance to see if they are
 sane.
 -git patches are not incremental and apply either to a base 2.6.x kernel or
-a base 2.6.x-rc kernel - you can see which from their name.
+a base 2.6.x-rc kernel -- you can see which from their name.
 A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
 named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
@@ -390,12 +401,12 @@ You should generally strive to get your patches into mainline via -mm to
 ensure maximum testing.
 This branch is in constant flux and contains many experimental features, a
-lot of debugging patches not appropriate for mainline etc and is the most
+lot of debugging patches not appropriate for mainline etc., and is the most
 experimental of the branches described in this document.
 These kernels are not appropriate for use on systems that are supposed to be
 stable and they are more risky to run than any of the other branches (make
-sure you have up-to-date backups - that goes for any experimental kernel but
+sure you have up-to-date backups -- that goes for any experimental kernel but
 even more so for -mm kernels).
 These kernels in addition to all the other experimental patches they contain
@@ -433,7 +444,11 @@ $ cd ..
 $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3	# rename the source dir
-This concludes this list of explanations of the various kernel trees and I
+This concludes this list of explanations of the various kernel trees.
-hope you are now crystal clear on how to apply the various patches and help
+I hope you are now clear on how to apply the various patches and help testing
-testing the kernel.
+the kernel.
 Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert,
 Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have
 forgotten for their reviews and contributions to this document.
@@ -0,0 +1,271 @@
 I/O Barriers
 ============
 Tejun Heo <htejun@gmail.com>, July 22 2005
 I/O barrier requests are used to guarantee ordering around the barrier
 requests.  Unless you're crazy enough to use disk drives for
 implementing synchronization constructs (wow, sounds interesting...),
 the ordering is meaningful only for write requests for things like
 journal checkpoints.  All requests queued before a barrier request
 must be finished (made it to the physical medium) before the barrier
 request is started, and all requests queued after the barrier request
 must be started only after the barrier request is finished (again,
 made it to the physical medium).
 In other words, I/O barrier requests have the following two properties.
 1. Request ordering
 Requests cannot pass the barrier request.  Preceding requests are
 processed before the barrier and following requests after.
 Depending on what features a drive supports, this can be done in one
 of the following three ways.
 i. For devices which have queue depth greater than 1 (TCQ devices) and
 support ordered tags, block layer can just issue the barrier as an
 ordered request and the lower level driver, controller and drive
 itself are responsible for making sure that the ordering contraint is
 met.  Most modern SCSI controllers/drives should support this.
 NOTE: SCSI ordered tag isn't currently used due to limitation in the
      SCSI midlayer, see the following random notes section.
 ii. For devices which have queue depth greater than 1 but don't
 support ordered tags, block layer ensures that the requests preceding
 a barrier request finishes before issuing the barrier request.  Also,
 it defers requests following the barrier until the barrier request is
 finished.  Older SCSI controllers/drives and SATA drives fall in this
 category.
 iii. Devices which have queue depth of 1.  This is a degenerate case
 of ii.  Just keeping issue order suffices.  Ancient SCSI
 controllers/drives and IDE drives are in this category.
 2. Forced flushing to physcial medium
 Again, if you're not gonna do synchronization with disk drives (dang,
 it sounds even more appealing now!), the reason you use I/O barriers
 is mainly to protect filesystem integrity when power failure or some
 other events abruptly stop the drive from operating and possibly make
 the drive lose data in its cache.  So, I/O barriers need to guarantee
 that requests actually get written to non-volatile medium in order.
 There are four cases,
 i. No write-back cache.  Keeping requests ordered is enough.
 ii. Write-back cache but no flush operation.  There's no way to
 gurantee physical-medium commit order.  This kind of devices can't to
 I/O barriers.
 iii. Write-back cache and flush operation but no FUA (forced unit
 access).  We need two cache flushes - before and after the barrier
 request.
 iv. Write-back cache, flush operation and FUA.  We still need one
 flush to make sure requests preceding a barrier are written to medium,
 but post-barrier flush can be avoided by using FUA write on the
 barrier itself.
 How to support barrier requests in drivers
 ------------------------------------------
 All barrier handling is done inside block layer proper.  All low level
 drivers have to are implementing its prepare_flush_fn and using one
 the following two functions to indicate what barrier type it supports
 and how to prepare flush requests.  Note that the term 'ordered' is
 used to indicate the whole sequence of performing barrier requests
 including draining and flushing.
 typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
 int blk_queue_ordered(request_queue_t *q, unsigned ordered,
 		      prepare_flush_fn *prepare_flush_fn,
 		      unsigned gfp_mask);
 int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
 			     prepare_flush_fn *prepare_flush_fn,
 			     unsigned gfp_mask);
 The only difference between the two functions is whether or not the
 caller is holding q->queue_lock on entry.  The latter expects the
 caller is holding the lock.
@q			: the queue in question
@ordered		: the ordered mode the driver/device supports
@prepare_flush_fn	: this function should prepare @rq such that it
 			  flushes cache to physical medium when executed
@gfp_mask		: gfp_mask used when allocating data structures
 			  for ordered processing
 For example, SCSI disk driver's prepare_flush_fn looks like the
 following.
 static void sd_prepare_flush(request_queue_t *q, struct request *rq)
 {
 	memset(rq->cmd, 0, sizeof(rq->cmd));
 	rq->flags |= REQ_BLOCK_PC;
 	rq->timeout = SD_TIMEOUT;
 	rq->cmd[0] = SYNCHRONIZE_CACHE;
 }
 The following seven ordered modes are supported.  The following table
 shows which mode should be used depending on what features a
 device/driver supports.  In the leftmost column of table,
 QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
 The table is followed by description of each mode.  Note that in the
 descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
 used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
 preceding step must be complete before proceeding to the next step.
 '->' indicates that the next step can start as soon as the previous
 step is issued.
 	    write-back cache	ordered tag	flush		FUA
 -----------------------------------------------------------------------
 NONE		yes/no		N/A		no		N/A
 DRAIN		no		no		N/A		N/A
 DRAIN_FLUSH	yes		no		yes		no
 DRAIN_FUA	yes		no		yes		yes
 TAG		no		yes		N/A		N/A
 TAG_FLUSH	yes		yes		yes		no
 TAG_FUA		yes		yes		yes		yes
 QUEUE_ORDERED_NONE
 	I/O barriers are not needed and/or supported.
 	Sequence: N/A
 QUEUE_ORDERED_DRAIN
 	Requests are ordered by draining the request queue and cache
 	flushing isn't needed.
 	Sequence: drain => barrier
 QUEUE_ORDERED_DRAIN_FLUSH
 	Requests are ordered by draining the request queue and both
 	pre-barrier and post-barrier cache flushings are needed.
 	Sequence: drain => preflush => barrier => postflush
 QUEUE_ORDERED_DRAIN_FUA
 	Requests are ordered by draining the request queue and
 	pre-barrier cache flushing is needed.  By using FUA on barrier
 	request, post-barrier flushing can be skipped.
 	Sequence: drain => preflush => barrier
 QUEUE_ORDERED_TAG
 	Requests are ordered by ordered tag and cache flushing isn't
 	needed.
 	Sequence: barrier
 QUEUE_ORDERED_TAG_FLUSH
 	Requests are ordered by ordered tag and both pre-barrier and
 	post-barrier cache flushings are needed.
 	Sequence: preflush -> barrier -> postflush
 QUEUE_ORDERED_TAG_FUA
 	Requests are ordered by ordered tag and pre-barrier cache
 	flushing is needed.  By using FUA on barrier request,
 	post-barrier flushing can be skipped.
 	Sequence: preflush -> barrier
 Random notes/caveats
 --------------------
 * SCSI layer currently can't use TAG ordering even if the drive,
 controller and driver support it.  The problem is that SCSI midlayer
 request dispatch function is not atomic.  It releases queue lock and
 switch to SCSI host lock during issue and it's possible and likely to
 happen in time that requests change their relative positions.  Once
 this problem is solved, TAG ordering can be enabled.
 * Currently, no matter which ordered mode is used, there can be only
 one barrier request in progress.  All I/O barriers are held off by
 block layer until the previous I/O barrier is complete.  This doesn't
 make any difference for DRAIN ordered devices, but, for TAG ordered
 devices with very high command latency, passing multiple I/O barriers
 to low level *might* be helpful if they are very frequent.  Well, this
 certainly is a non-issue.  I'm writing this just to make clear that no
 two I/O barrier is ever passed to low-level driver.
 * Completion order.  Requests in ordered sequence are issued in order
 but not required to finish in order.  Barrier implementation can
 handle out-of-order completion of ordered sequence.  IOW, the requests
 MUST be processed in order but the hardware/software completion paths
 are allowed to reorder completion notifications - eg. current SCSI
 midlayer doesn't preserve completion order during error handling.
 * Requeueing order.  Low-level drivers are free to requeue any request
 after they removed it from the request queue with
 blkdev_dequeue_request().  As barrier sequence should be kept in order
 when requeued, generic elevator code takes care of putting requests in
 order around barrier.  See blk_ordered_req_seq() and
 ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
 Note that block drivers must not requeue preceding requests while
 completing latter requests in an ordered sequence.  Currently, no
 error checking is done against this.
 * Error handling.  Currently, block layer will report error to upper
 layer if any of requests in an ordered sequence fails.  Unfortunately,
 this doesn't seem to be enough.  Look at the following request flow.
 QUEUE_ORDERED_TAG_FLUSH is in use.
 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
 					  still in elevator
 Let's say request [2], [3] are write requests to update file system
 metadata (journal or whatever) and [barrier] is used to mark that
 those updates are valid.  Consider the following sequence.
 i.	Requests [0] ~ [post] leaves the request queue and enters
 	low-level driver.
 ii.	After a while, unfortunately, something goes wrong and the
 	drive fails [2].  Note that any of [0], [1] and [3] could have
 	completed by this time, but [pre] couldn't have been finished
 	as the drive must process it in order and it failed before
 	processing that command.
 iii.	Error handling kicks in and determines that the error is
 	unrecoverable and fails [2], and resumes operation.
 iv.	[pre] [barrier] [post] gets processed.
 v.	*BOOM* power fails
 The problem here is that the barrier request is *supposed* to indicate
 that filesystem update requests [2] and [3] made it safely to the
 physical medium and, if the machine crashes after the barrier is
 written, filesystem recovery code can depend on that.  Sadly, that
 isn't true in this case anymore.  IOW, the success of a I/O barrier
 should also be dependent on success of some of the preceding requests,
 where only upper layer (filesystem) knows what 'some' is.
 This can be solved by implementing a way to tell the block layer which
 requests affect the success of the following barrier request and
 making lower lever drivers to resume operation on error only after
 block layer tells it to do so.
 As the probability of this happening is very low and the drive should
 be faulty, implementing the fix is probably an overkill.  But, still,
 it's there.
 * In previous drafts of barrier implementation, there was fallback
 mechanism such that, if FUA or ordered TAG fails, less fancy ordered
 mode can be selected and the failed barrier request is retried
 automatically.  The rationale for this feature was that as FUA is
 pretty new in ATA world and ordered tag was never used widely, there
 could be devices which report to support those features but choke when
 actually given such requests.
 This was removed for two reasons 1. it's an overkill 2. it's
 impossible to implement properly when TAG ordering is used as low
 level drivers resume after an error automatically.  If it's ever
 needed adding it back and modifying low level drivers accordingly
 shouldn't be difficult.
@@ -31,7 +31,7 @@ The following people helped with review comments and inputs for this
 document:
 	Christoph Hellwig <hch@infradead.org>
 	Arjan van de Ven <arjanv@redhat.com>
-	Randy Dunlap <rddunlap@osdl.org>
+	Randy Dunlap <rdunlap@xenotime.net>
 	Andre Hedrick <andre@linux-ide.org>
 The following people helped with fixes/contributions to the bio patches
@@ -263,14 +263,8 @@ A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
 The generic i/o scheduler would make sure that it places the barrier request and
 all other requests coming after it after all the previous requests in the
 queue. Barriers may be implemented in different ways depending on the
-driver. A SCSI driver for example could make use of ordered tags to
+driver. For more details regarding I/O barriers, please read barrier.txt
-preserve the necessary ordering with a lower impact on throughput. For IDE
+in this directory.
 this might be two sync cache flush: a pre and post flush when encountering
 a barrier write.
 There is a provision for queues to indicate what kind of barriers they
 can provide. This is as of yet unmerged, details will be added here once it
 is in the kernel.
 1.2.2 Request Priority/Latency
@@ -0,0 +1,82 @@
 Block layer statistics in /sys/block/<dev>/stat
 ===============================================
 This file documents the contents of the /sys/block/<dev>/stat file.
 The stat file provides several statistics about the state of block
 device <dev>.
 Q. Why are there multiple statistics in a single file?  Doesn't sysfs
   normally contain a single value per file?
 A. By having a single file, the kernel can guarantee that the statistics
   represent a consistent snapshot of the state of the device.  If the
   statistics were exported as multiple files containing one statistic
   each, it would be impossible to guarantee that a set of readings
   represent a single point in time.
 The stat file consists of a single line of text containing 11 decimal
 values separated by whitespace.  The fields are summarized in the
 following table, and described in more detail below.
 Name            units         description
 ----            -----         -----------
 read I/Os       requests      number of read I/Os processed
 read merges     requests      number of read I/Os merged with in-queue I/O
 read sectors    sectors       number of sectors read
 read ticks      milliseconds  total wait time for read requests
 write I/Os      requests      number of write I/Os processed
 write merges    requests      number of write I/Os merged with in-queue I/O
 write sectors   sectors       number of sectors written
 write ticks     milliseconds  total wait time for write requests
 in_flight       requests      number of I/Os currently in flight
 io_ticks        milliseconds  total time this block device has been active
 time_in_queue   milliseconds  total wait time for all requests
 read I/Os, write I/Os
 =====================
 These values increment when an I/O request completes.
 read merges, write merges
 =========================
 These values increment when an I/O request is merged with an
 already-queued I/O request.
 read sectors, write sectors
 ===========================
 These values count the number of sectors read from or written to this
 block device.  The "sectors" in question are the standard UNIX 512-byte
 sectors, not any device- or filesystem-specific block size.  The
 counters are incremented when the I/O completes.
 read ticks, write ticks
 =======================
 These values count the number of milliseconds that I/O requests have
 waited on this block device.  If there are multiple I/O requests waiting,
 these values will increase at a rate greater than 1000/second; for
 example, if 60 read requests wait for an average of 30 ms, the read_ticks
 field will increase by 60*30 = 1800.
 in_flight
 =========
 This value counts the number of I/O requests that have been issued to
 the device driver but have not yet completed.  It does not include I/O
 requests that are in the queue but not yet issued to the device driver.
 io_ticks
 ========
 This value counts the number of milliseconds during which the device has
 had I/O requests queued.
 time_in_queue
 =============
 This value counts the number of milliseconds that I/O requests have waited
 on this block device.  If there are multiple I/O requests waiting, this
 value will increase as the product of the number of milliseconds times the
 number of requests waiting (see "read ticks" above for an example).
@@ -136,7 +136,7 @@ changes occur:
 8) void lazy_mmu_prot_update(pte_t pte)
 	This interface is called whenever the protection on
 	any user PTEs change.  This interface provides a notification
-	to architecture specific code to take appropiate action.
+	to architecture specific code to take appropriate action.
 Next, we have the cache flushing interfaces.  In general, when Linux
@@ -27,6 +27,7 @@ Contents:
 2.2  Powersave
 2.3  Userspace
 2.4  Ondemand
 2.5  Conservative
 3.   The Governor Interface in the CPUfreq Core
@@ -110,9 +111,64 @@ directory.
 The CPUfreq govenor "ondemand" sets the CPU depending on the
 current usage. To do this the CPU must have the capability to
-switch the frequency very fast.
+switch the frequency very quickly.  There are a number of sysfs file
 accessible parameters:
 sampling_rate: measured in uS (10^-6 seconds), this is how often you
 want the kernel to look at the CPU usage and to make decisions on
 what to do about the frequency.  Typically this is set to values of
 around '10000' or more.
 show_sampling_rate_(min|max): the minimum and maximum sampling rates
 available that you may set 'sampling_rate' to.
 up_threshold: defines what the average CPU usaged between the samplings
 of 'sampling_rate' needs to be for the kernel to make a decision on
 whether it should increase the frequency.  For example when it is set
 to its default value of '80' it means that between the checking
 intervals the CPU needs to be on average more than 80% in use to then
 decide that the CPU frequency needs to be increased.  
 sampling_down_factor: this parameter controls the rate that the CPU
 makes a decision on when to decrease the frequency.  When set to its
 default value of '5' it means that at 1/5 the sampling_rate the kernel
 makes a decision to lower the frequency.  Five "lower rate" decisions
 have to be made in a row before the CPU frequency is actually lower.
 If set to '1' then the frequency decreases as quickly as it increases,
 if set to '2' it decreases at half the rate of the increase.
 ignore_nice_load: this parameter takes a value of '0' or '1', when set
 to '0' (its default) then all processes are counted towards towards the
 'cpu utilisation' value.   When set to '1' then processes that are
 run with a 'nice' value will not count (and thus be ignored) in the
 overal usage calculation.  This is useful if you are running a CPU
 intensive calculation on your laptop that you do not care how long it
 takes to complete as you can 'nice' it and prevent it from taking part
 in the deciding process of whether to increase your CPU frequency.
 2.5 Conservative
 ----------------
 The CPUfreq governor "conservative", much like the "ondemand"
 governor, sets the CPU depending on the current usage.  It differs in
 behaviour in that it gracefully increases and decreases the CPU speed
 rather than jumping to max speed the moment there is any load on the
 CPU.  This behaviour more suitable in a battery powered environment.
 The governor is tweaked in the same manner as the "ondemand" governor
 through sysfs with the addition of:
 freq_step: this describes what percentage steps the cpu freq should be
 increased and decreased smoothly by.  By default the cpu frequency will
 increase in 5% chunks of your maximum cpu frequency.  You can change this
 value to anywhere between 0 and 100 where '0' will effectively lock your
 CPU at a speed regardless of its load whilst '100' will, in theory, make
 it behave identically to the "ondemand" governor.
 down_threshold: same as the 'up_threshold' found for the "ondemand"
 governor but for the opposite direction.  For example when set to its
 default value of '20' it means that if the CPU usage needs to be below
 20% between samples to have the frequency decreased.
 3. The Governor Interface in the CPUfreq Core
 =============================================
@@ -0,0 +1,357 @@
 		CPU hotplug Support in Linux(tm) Kernel
 		Maintainers:
 		CPU Hotplug Core:
 			Rusty Russell <rusty@rustycorp.com.au>
 			Srivatsa Vaddagiri <vatsa@in.ibm.com>
 		i386:
 			Zwane Mwaikambo <zwane@arm.linux.org.uk>
 		ppc64:
 			Nathan Lynch <nathanl@austin.ibm.com>
 			Joel Schopp <jschopp@austin.ibm.com>
 		ia64/x86_64:
 			Ashok Raj <ashok.raj@intel.com>
 Authors: Ashok Raj <ashok.raj@intel.com>
 Lots of feedback: Nathan Lynch <nathanl@austin.ibm.com>,
 	     Joel Schopp <jschopp@austin.ibm.com>
 Introduction
 Modern advances in system architectures have introduced advanced error
 reporting and correction capabilities in processors. CPU architectures permit
 partitioning support, where compute resources of a single CPU could be made
 available to virtual machine environments. There are couple OEMS that
 support NUMA hardware which are hot pluggable as well, where physical
 node insertion and removal require support for CPU hotplug.
 Such advances require CPUs available to a kernel to be removed either for
 provisioning reasons, or for RAS purposes to keep an offending CPU off
 system execution path. Hence the need for CPU hotplug support in the
 Linux kernel.
 A more novel use of CPU-hotplug support is its use today in suspend
 resume support for SMP. Dual-core and HT support makes even
 a laptop run SMP kernels which didn't support these methods. SMP support
 for suspend/resume is a work in progress.
 General Stuff about CPU Hotplug
 --------------------------------
 Command Line Switches
 ---------------------
 maxcpus=n    Restrict boot time cpus to n. Say if you have 4 cpus, using
             maxcpus=2 will only boot 2. You can choose to bring the
             other cpus later online, read FAQ's for more info.
 additional_cpus=n	[x86_64 only] use this to limit hotpluggable cpus.
                        This option sets
 			cpu_possible_map = cpu_present_map + additional_cpus
 CPU maps and such
 -----------------
 [More on cpumaps and primitive to manipulate, please check
 include/linux/cpumask.h that has more descriptive text.]
 cpu_possible_map: Bitmap of possible CPUs that can ever be available in the
 system. This is used to allocate some boot time memory for per_cpu variables
 that aren't designed to grow/shrink as CPUs are made available or removed.
 Once set during boot time discovery phase, the map is static, i.e no bits
 are added or removed anytime.  Trimming it accurately for your system needs
 upfront can save some boot time memory. See below for how we use heuristics
 in x86_64 case to keep this under check.
 cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up()
 after a cpu is available for kernel scheduling and ready to receive
 interrupts from devices. Its cleared when a cpu is brought down using
 __cpu_disable(), before which all OS services including interrupts are
 migrated to another target CPU.
 cpu_present_map: Bitmap of CPUs currently present in the system. Not all
 of them may be online. When physical hotplug is processed by the relevant
 subsystem (e.g ACPI) can change and new bit either be added or removed
 from the map depending on the event is hot-add/hot-remove. There are currently
 no locking rules as of now. Typical usage is to init topology during boot,
 at which time hotplug is disabled.
 You really dont need to manipulate any of the system cpu maps. They should
 be read-only for most use. When setting up per-cpu resources almost always use
 cpu_possible_map/for_each_cpu() to iterate.
 Never use anything other than cpumask_t to represent bitmap of CPUs.
 #include <linux/cpumask.h>
 for_each_cpu              - Iterate over cpu_possible_map
 for_each_online_cpu       - Iterate over cpu_online_map
 for_each_present_cpu      - Iterate over cpu_present_map
 for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask.
 #include <linux/cpu.h>
 lock_cpu_hotplug() and unlock_cpu_hotplug():
 The above calls are used to inhibit cpu hotplug operations. While holding the
 cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
 cpus going away, you could also use preempt_disable() and preempt_enable()
 for those sections. Just remember the critical section cannot call any
 function that can sleep or schedule this process away. The preempt_disable()
 will work as long as stop_machine_run() is used to take a cpu down.
 CPU Hotplug - Frequently Asked Questions.
 Q: How to i enable my kernel to support CPU hotplug?
 A: When doing make defconfig, Enable CPU hotplug support
   "Processor type and Features" -> Support for Hotpluggable CPUs
 Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well.
 You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support
 as well.
 Q: What architectures support CPU hotplug?
 A: As of 2.6.14, the following architectures support CPU hotplug.
 i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64
 Q: How to test if hotplug is supported on the newly built kernel?
 A: You should now notice an entry in sysfs.
 Check if sysfs is mounted, using the "mount" command. You should notice
 an entry as shown below in the output.
 ....
 none on /sys type sysfs (rw)
 ....
 if this is not mounted, do the following.
 #mkdir /sysfs
 #mount -t sysfs sys /sys
 now you should see entries for all present cpu, the following is an example
 in a 8-way system.
 #pwd
 #/sys/devices/system/cpu
 #ls -l
 total 0
 drwxr-xr-x  10 root root 0 Sep 19 07:44 .
 drwxr-xr-x  13 root root 0 Sep 19 07:45 ..
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu0
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu1
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu2
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu3
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu4
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu5
 drwxr-xr-x   3 root root 0 Sep 19 07:44 cpu6
 drwxr-xr-x   3 root root 0 Sep 19 07:48 cpu7
 Under each directory you would find an "online" file which is the control
 file to logically online/offline a processor.
 Q: Does hot-add/hot-remove refer to physical add/remove of cpus?
 A: The usage of hot-add/remove may not be very consistently used in the code.
 CONFIG_CPU_HOTPLUG enables logical online/offline capability in the kernel.
 To support physical addition/removal, one would need some BIOS hooks and
 the platform should have something like an attention button in PCI hotplug.
 CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs.
 Q: How do i logically offline a CPU?
 A: Do the following.
 #echo 0 > /sys/devices/system/cpu/cpuX/online
 once the logical offline is successful, check
 #cat /proc/interrupts
 you should now not see the CPU that you removed. Also online file will report
 the state as 0 when a cpu if offline and 1 when its online.
 #To display the current cpu state.
 #cat /sys/devices/system/cpu/cpuX/online
 Q: Why cant i remove CPU0 on some systems?
 A: Some architectures may have some special dependency on a certain CPU.
 For e.g in IA64 platforms we have ability to sent platform interrupts to the
 OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI
 specifications, we didn't have a way to change the target CPU. Hence if the
 current ACPI version doesn't support such re-direction, we disable that CPU
 by making it not-removable.
 In such cases you will also notice that the online file is missing under cpu0.
 Q: How do i find out if a particular CPU is not removable?
 A: Depending on the implementation, some architectures may show this by the
 absence of the "online" file. This is done if it can be determined ahead of
 time that this CPU cannot be removed.
 In some situations, this can be a run time check, i.e if you try to remove the
 last CPU, this will not be permitted. You can find such failures by
 investigating the return value of the "echo" command.
 Q: What happens when a CPU is being logically offlined?
 A: The following happen, listed in no particular order :-)
 - A notification is sent to in-kernel registered modules by sending an event
  CPU_DOWN_PREPARE
 - All process is migrated away from this outgoing CPU to a new CPU
 - All interrupts targeted to this CPU is migrated to a new CPU
 - timers/bottom half/task lets are also migrated to a new CPU
 - Once all services are migrated, kernel calls an arch specific routine
  __cpu_disable() to perform arch specific cleanup.
 - Once this is successful, an event for successful cleanup is sent by an event
  CPU_DEAD.
  "It is expected that each service cleans up when the CPU_DOWN_PREPARE
  notifier is called, when CPU_DEAD is called its expected there is nothing
  running on behalf of this CPU that was offlined"
 Q: If i have some kernel code that needs to be aware of CPU arrival and
   departure, how to i arrange for proper notification?
 A: This is what you would need in your kernel code to receive notifications.
    #include <linux/cpu.h>
    static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb,
 					    unsigned long action, void *hcpu)
 	{
 		unsigned int cpu = (unsigned long)hcpu;
 		switch (action) {
 		case CPU_ONLINE:
 			foobar_online_action(cpu);
 			break;
 		case CPU_DEAD:
 			foobar_dead_action(cpu);
 			break;
 		}
 		return NOTIFY_OK;
 	}
 	static struct notifier_block foobar_cpu_notifer =
 	{
 	   .notifier_call = foobar_cpu_callback,
 	};
 In your init function,
 	register_cpu_notifier(&foobar_cpu_notifier);
 You can fail PREPARE notifiers if something doesn't work to prepare resources.
 This will stop the activity and send a following CANCELED event back.
 CPU_DEAD should not be failed, its just a goodness indication, but bad
 things will happen if a notifier in path sent a BAD notify code.
 Q: I don't see my action being called for all CPUs already up and running?
 A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined.
   If you need to perform some action for each cpu already in the system, then
  for_each_online_cpu(i) {
 		foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i);
 		foobar_cpu_callback(&foobar-cpu_notifier, CPU_ONLINE, i);
  }
 Q: If i would like to develop cpu hotplug support for a new architecture,
   what do i need at a minimum?
 A: The following are what is required for CPU hotplug infrastructure to work
   correctly.
    - Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU
    - __cpu_up()        - Arch interface to bring up a CPU
    - __cpu_disable()   - Arch interface to shutdown a CPU, no more interrupts
                          can be handled by the kernel after the routine
                          returns. Including local APIC timers etc are
                          shutdown.
     - __cpu_die()      - This actually supposed to ensure death of the CPU.
                          Actually look at some example code in other arch
                          that implement CPU hotplug. The processor is taken
                          down from the idle() loop for that specific
                          architecture. __cpu_die() typically waits for some
                          per_cpu state to be set, to ensure the processor
                          dead routine is called to be sure positively.
 Q: I need to ensure that a particular cpu is not removed when there is some
   work specific to this cpu is in progress.
 A: First switch the current thread context to preferred cpu
   int my_func_on_cpu(int cpu)
   {
       cpumask_t saved_mask, new_mask = CPU_MASK_NONE;
       int curr_cpu, err = 0;
       saved_mask = current->cpus_allowed;
       cpu_set(cpu, new_mask);
       err = set_cpus_allowed(current, new_mask);
       if (err)
           return err;
       /*
        * If we got scheduled out just after the return from
        * set_cpus_allowed() before running the work, this ensures
        * we stay locked.
        */
       curr_cpu = get_cpu();
       if (curr_cpu != cpu) {
 	   err = -EAGAIN;
           goto ret;
       } else {
       	   /*
 	    * Do work : But cant sleep, since get_cpu() disables preempt
 	    */
       }
    ret:
    	put_cpu();
 	set_cpus_allowed(current, saved_mask);
 	return err;
    }
 Q: How do we determine how many CPUs are available for hotplug.
 A: There is no clear spec defined way from ACPI that can give us that
   information today. Based on some input from Natalie of Unisys,
   that the ACPI MADT (Multiple APIC Description Tables) marks those possible
   CPUs in a system with disabled status.
   Andi implemented some simple heuristics that count the number of disabled
   CPUs in MADT as hotpluggable CPUS.  In the case there are no disabled CPUS
   we assume 1/2 the number of CPUs currently present can be hotplugged.
   Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field
   in MADT is only 8 bits.
 User Space Notification
 Hotplug support for devices is common in Linux today. Its being used today to
 support automatic configuration of network, usb and pci devices. A hotplug
 event can be used to invoke an agent script to perform the configuration task.
 You can add /etc/hotplug/cpu.agent to handle hotplug notification user space
 scripts.
 	#!/bin/bash
 	# $Id: cpu.agent
 	# Kernel hotplug params include:
 	#ACTION=%s [online or offline]
 	#DEVPATH=%s
 	#
 	cd /etc/hotplug
 	. ./hotplug.functions
 	case $ACTION in
 		online)
 			echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt
 			;;
 		offline)
 			echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt
 			;;
 		*)
 			debug_mesg CPU $ACTION event not supported
        exit 1
        ;;
 	esac
@@ -14,7 +14,10 @@ CONTENTS:
  1.1 What are cpusets ?
  1.2 Why are cpusets needed ?
  1.3 How are cpusets implemented ?
-  1.4 How do I use cpusets ?
+  1.4 What are exclusive cpusets ?
  1.5 What does notify_on_release do ?
  1.6 What is memory_pressure ?
  1.7 How do I use cpusets ?
 2. Usage Examples and Syntax
  2.1 Basic Usage
  2.2 Adding/removing cpus
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not
 allocate a page on a node that is not allowed in the requesting tasks
 mems_allowed vector.
 If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
 ancestor or descendent, may share any of the same CPUs or Memory Nodes.
 A cpuset that is cpu exclusive has a sched domain associated with it.
 The sched domain consists of all cpus in the current cpuset that are not
 part of any exclusive child cpusets.
 This ensures that the scheduler load balacing code only balances
 against the cpus that are in the sched domain as defined above and not
 all of the cpus in the system. This removes any overhead due to
 load balancing code trying to pull tasks outside of the cpu exclusive
 cpuset only to be prevented by the tasks' cpus_allowed mask.
 A cpuset that is mem_exclusive restricts kernel allocations for
 page, buffer and other data commonly shared by the kernel across
 multiple users.  All cpusets, whether mem_exclusive or not, restrict
 allocations of memory for user space.  This enables configuring a
 system so that several independent jobs can share common kernel
 data, such as file system pages, while isolating each jobs user
 allocation in its own cpuset.  To do this, construct a large
 mem_exclusive cpuset to hold all the jobs, and construct child,
 non-mem_exclusive cpusets for each individual job.  Only a small
 amount of typical kernel memory, such as requests from interrupt
 handlers, is allowed to be taken outside even a mem_exclusive cpuset.
 User level code may create and destroy cpusets by name in the cpuset
 virtual file system, manage the attributes and permissions of these
 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
@@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows:
 The implementation of cpusets requires a few, simple hooks
 into the rest of the kernel, none in performance critical paths:
- - in main/init.c, to initialize the root cpuset at system boot.
+ - in init/main.c, to initialize the root cpuset at system boot.
 - in fork and exit, to attach and detach a task from its cpuset.
 - in sched_setaffinity, to mask the requested CPUs by what's
   allowed in that tasks cpuset.
@@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths:
   and related changes in both sched.c and arch/ia64/kernel/domain.c
 - in the mbind and set_mempolicy system calls, to mask the requested
   Memory Nodes by what's allowed in that tasks cpuset.
- - in page_alloc, to restrict memory to allowed nodes.
+ - in page_alloc.c, to restrict memory to allowed nodes.
 - in vmscan.c, to restrict page recovery to the current cpuset.
 In addition a new file system, of type "cpuset" may be mounted,
@@ -192,9 +172,15 @@ containing the following files describing that cpuset:
 - cpus: list of CPUs in that cpuset
 - mems: list of Memory Nodes in that cpuset
 - memory_migrate flag: if set, move pages to cpusets nodes
 - cpu_exclusive flag: is cpu placement exclusive?
 - mem_exclusive flag: is memory placement exclusive?
 - tasks: list of tasks (by pid) attached to that cpuset
 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
 - memory_pressure: measure of how much paging pressure in cpuset
 In addition, the root cpuset only has the following file:
 - memory_pressure_enabled flag: compute memory_pressure?
 New cpusets are created using the mkdir system call or shell
 command.  The properties of a cpuset, such as its flags, allowed
@@ -228,7 +214,108 @@ exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
 to represent the cpuset hierarchy provides for a familiar permission
 and name space for cpusets, with a minimum of additional kernel code.
-1.4 How do I use cpusets ?
+
 1.4 What are exclusive cpusets ?
 --------------------------------
 If a cpuset is cpu or mem exclusive, no other cpuset, other than
 a direct ancestor or descendent, may share any of the same CPUs or
 Memory Nodes.
 A cpuset that is cpu_exclusive has a scheduler (sched) domain
 associated with it.  The sched domain consists of all CPUs in the
 current cpuset that are not part of any exclusive child cpusets.
 This ensures that the scheduler load balancing code only balances
 against the CPUs that are in the sched domain as defined above and
 not all of the CPUs in the system. This removes any overhead due to
 load balancing code trying to pull tasks outside of the cpu_exclusive
 cpuset only to be prevented by the tasks' cpus_allowed mask.
 A cpuset that is mem_exclusive restricts kernel allocations for
 page, buffer and other data commonly shared by the kernel across
 multiple users.  All cpusets, whether mem_exclusive or not, restrict
 allocations of memory for user space.  This enables configuring a
 system so that several independent jobs can share common kernel data,
 such as file system pages, while isolating each jobs user allocation in
 its own cpuset.  To do this, construct a large mem_exclusive cpuset to
 hold all the jobs, and construct child, non-mem_exclusive cpusets for
 each individual job.  Only a small amount of typical kernel memory,
 such as requests from interrupt handlers, is allowed to be taken
 outside even a mem_exclusive cpuset.
 1.5 What does notify_on_release do ?
 ------------------------------------
 If the notify_on_release flag is enabled (1) in a cpuset, then whenever
 the last task in the cpuset leaves (exits or attaches to some other
 cpuset) and the last child cpuset of that cpuset is removed, then
 the kernel runs the command /sbin/cpuset_release_agent, supplying the
 pathname (relative to the mount point of the cpuset file system) of the
 abandoned cpuset.  This enables automatic removal of abandoned cpusets.
 The default value of notify_on_release in the root cpuset at system
 boot is disabled (0).  The default value of other cpusets at creation
 is the current value of their parents notify_on_release setting.
 1.6 What is memory_pressure ?
 -----------------------------
 The memory_pressure of a cpuset provides a simple per-cpuset metric
 of the rate that the tasks in a cpuset are attempting to free up in
 use memory on the nodes of the cpuset to satisfy additional memory
 requests.
 This enables batch managers monitoring jobs running in dedicated
 cpusets to efficiently detect what level of memory pressure that job
 is causing.
 This is useful both on tightly managed systems running a wide mix of
 submitted jobs, which may choose to terminate or re-prioritize jobs that
 are trying to use more memory than allowed on the nodes assigned them,
 and with tightly coupled, long running, massively parallel scientific
 computing jobs that will dramatically fail to meet required performance
 goals if they start to use more memory than allowed to them.
 This mechanism provides a very economical way for the batch manager
 to monitor a cpuset for signs of memory pressure.  It's up to the
 batch manager or other user code to decide what to do about it and
 take action.
 ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.
 Why a per-cpuset, running average:
    Because this meter is per-cpuset, rather than per-task or mm,
    the system load imposed by a batch scheduler monitoring this
    metric is sharply reduced on large systems, because a scan of
    the tasklist can be avoided on each set of queries.
    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a
    single read, instead of having to read and accumulate results
    for a period of time.
    Because this meter is per-cpuset rather than per-task or mm,
    the batch scheduler can obtain the key information, memory
    pressure in a cpuset, with a single read, rather than having to
    query and accumulate results over all the (dynamically changing)
    set of tasks in the cpuset.
 A per-cpuset simple digital filter (requires a spinlock and 3 words
 of data per-cpuset) is kept, and updated by any task attached to that
 cpuset, if it enters the synchronous (direct) page reclaim code.
 A per-cpuset file provides an integer number representing the recent
 (half-life of 10 seconds) rate of direct page reclaims caused by
 the tasks in the cpuset, in units of reclaims attempted per second,
 times 1000.
 1.7 How do I use cpusets ?
 --------------------------
 In order to minimize the impact of cpusets on critical kernel
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset.  This is done to avoid
 impacting the scheduler code in the kernel with a check for changes
 in a tasks processor placement.
 Normally, once a page is allocated (given a physical page
 of main memory) then that page stays on whatever node it
 was allocated, so long as it remains allocated, even if the
 cpusets memory placement policy 'mems' subsequently changes.
 If the cpuset flag file 'memory_migrate' is set true, then when
 tasks are attached to that cpuset, any pages that task had
 allocated to it on nodes in its previous cpuset are migrated
 to the tasks new cpuset.  Depending on the implementation,
 this migration may either be done by swapping the page out,
 so that the next time the page is referenced, it will be paged
 into the tasks new cpuset, usually on the node where it was
 referenced, or this migration may be done by directly copying
 the pages from the tasks previous cpuset to the new cpuset,
 where possible to the same node, relative to the new cpuset,
 as the node that held the page, relative to the old cpuset.
 Also if 'memory_migrate' is set true, then if that cpusets
 'mems' file is modified, pages allocated to tasks in that
 cpuset, that were on nodes in the previous setting of 'mems',
 will be moved to nodes in the new setting of 'mems.'  Again,
 depending on the implementation, this might be done by swapping,
 or by direct copying.  In either case, pages that were not in
 the tasks prior cpuset, or in the cpusets prior 'mems' setting,
 will not be moved.
 There is an exception to the above.  If hotplug functionality is used
 to remove all the CPUs that are currently assigned to a cpuset,
 then the kernel will automatically update the cpus_allowed of all
--- a/Show More
+++ b/Show More