Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

2026-05-01 15:00:59 -07:00 · 2006-06-26 16:35:44 +01:00
parent 17ffc7ba6d fcc18e83e1
commit 62ed948cb1
2601 changed files with 123934 additions and 101120 deletions
@@ -1573,12 +1573,8 @@ S: 160 00 Praha 6
 S: Czech Republic

 N: Niels Kristian Bech Jensen
-E: nkbj@image.dk
-W: http://www.image.dk/~nkbj
+E: nkbj1970@hotmail.com
 D: Miscellaneous kernel updates and fixes.
-S: Dr. Holsts Vej 34, lejl. 164
-S: DK-8230 Åbyhøj
-S: Denmark

 N: Michael K. Johnson
 E: johnsonm@redhat.com
@@ -0,0 +1,77 @@
+This directory attempts to document the ABI between the Linux kernel and
+userspace, and the relative stability of these interfaces.  Due to the
+everchanging nature of Linux, and the differing maturity levels, these
+interfaces should be used by userspace programs in different ways.
+
+We have four different levels of ABI stability, as shown by the four
+different subdirectories in this location.  Interfaces may change levels
+of stability according to the rules described below.
+
+The different levels of stability are:
+
+  stable/
+	This directory documents the interfaces that the developer has
+	defined to be stable.  Userspace programs are free to use these
+	interfaces with no restrictions, and backward compatibility for
+	them will be guaranteed for at least 2 years.  Most interfaces
+	(like syscalls) are expected to never change and always be
+	available.
+
+  testing/
+	This directory documents interfaces that are felt to be stable,
+	as the main development of this interface has been completed.
+	The interface can be changed to add new features, but the
+	current interface will not break by doing this, unless grave
+	errors or security problems are found in them.  Userspace
+	programs can start to rely on these interfaces, but they must be
+	aware of changes that can occur before these interfaces move to
+	be marked stable.  Programs that use these interfaces are
+	strongly encouraged to add their name to the description of
+	these interfaces, so that the kernel developers can easily
+	notify them if any changes occur (see the description of the
+	layout of the files below for details on how to do this.)
+
+  obsolete/
+  	This directory documents interfaces that are still remaining in
+	the kernel, but are marked to be removed at some later point in
+	time.  The description of the interface will document the reason
+	why it is obsolete and when it can be expected to be removed.
+	The file Documentation/feature-removal-schedule.txt may describe
+	some of these interfaces, giving a schedule for when they will
+	be removed.
+
+  removed/
+	This directory contains a list of the old interfaces that have
+	been removed from the kernel.
+
+Every file in these directories will contain the following information:
+
+What:		Short description of the interface
+Date:		Date created
+KernelVersion:	Kernel version this feature first showed up in.
+Contact:	Primary contact for this interface (may be a mailing list)
+Description:	Long description of the interface and how to use it.
+Users:		All users of this interface who wish to be notified when
+		it changes.  This is very important for interfaces in
+		the "testing" stage, so that kernel developers can work
+		with userspace developers to ensure that things do not
+		break in ways that are unacceptable.  It is also
+		important to get feedback for these interfaces to make
+		sure they are working in a proper way and do not need to
+		be changed further.
+
+
+How things move between levels:
+
+Interfaces in stable may move to obsolete, as long as the proper
+notification is given.
+
+Interfaces may be removed from obsolete and the kernel as long as the
+documented amount of time has gone by.
+
+Interfaces in the testing state can move to the stable state when the
+developers feel they are finished.  They cannot be removed from the
+kernel tree without going through the obsolete state first.
+
+It's up to the developer to place their interfaces in the category they
+wish for it to start out in.
@@ -0,0 +1,13 @@
+What:		devfs
+Date:		July 2005
+Contact:	Greg Kroah-Hartman <gregkh@suse.de>
+Description:
+	devfs has been unmaintained for a number of years, has unfixable
+	races, contains a naming policy within the kernel that is
+	against the LSB, and can be replaced by using udev.
+	The files fs/devfs/*, include/linux/devfs_fs*.h will be removed,
+	along with the the assorted devfs function calls throughout the
+	kernel tree.
+
+Users:
+
@@ -0,0 +1,10 @@
+What:		The kernel syscall interface
+Description:
+	This interface matches much of the POSIX interface and is based
+	on it and other Unix based interfaces.  It will only be added to
+	over time, and not have things removed from it.
+
+	Note that this interface is different for every architecture
+	that Linux supports.  Please see the architecture-specific
+	documentation for details on the syscall numbers that are to be
+	mapped to each syscall.
@@ -0,0 +1,30 @@
+What:		/sys/module
+Description:
+	The /sys/module tree consists of the following structure:
+
+	/sys/module/MODULENAME
+		The name of the module that is in the kernel.  This
+		module name will show up either if the module is built
+		directly into the kernel, or if it is loaded as a
+		dyanmic module.
+
+	/sys/module/MODULENAME/parameters
+		This directory contains individual files that are each
+		individual parameters of the module that are able to be
+		changed at runtime.  See the individual module
+		documentation as to the contents of these parameters and
+		what they accomplish.
+
+		Note: The individual parameter names and values are not
+		considered stable, only the fact that they will be
+		placed in this location within sysfs.  See the
+		individual driver documentation for details as to the
+		stability of the different parameters.
+
+	/sys/module/MODULENAME/refcnt
+		If the module is able to be unloaded from the kernel, this file
+		will contain the current reference count of the module.
+
+		Note: If the module is built into the kernel, or if the
+		CONFIG_MODULE_UNLOAD kernel configuration value is not enabled,
+		this file will not be present.
@@ -0,0 +1,16 @@
+What:		/sys/class/
+Date:		Febuary 2006
+Contact:	Greg Kroah-Hartman <gregkh@suse.de>
+Description:
+		The /sys/class directory will consist of a group of
+		subdirectories describing individual classes of devices
+		in the kernel.  The individual directories will consist
+		of either subdirectories, or symlinks to other
+		directories.
+
+		All programs that use this directory tree must be able
+		to handle both subdirectories or symlinks in order to
+		work properly.
+
+Users:
+	udev <linux-hotplug-devel@lists.sourceforge.net>
@@ -0,0 +1,25 @@
+What:		/sys/devices
+Date:		February 2006
+Contact:	Greg Kroah-Hartman <gregkh@suse.de>
+Description:
+		The /sys/devices tree contains a snapshot of the
+		internal state of the kernel device tree.  Devices will
+		be added and removed dynamically as the machine runs,
+		and between different kernel versions, the layout of the
+		devices within this tree will change.
+
+		Please do not rely on the format of this tree because of
+		this.  If a program wishes to find different things in
+		the tree, please use the /sys/class structure and rely
+		on the symlinks there to point to the proper location
+		within the /sys/devices tree of the individual devices.
+		Or rely on the uevent messages to notify programs of
+		devices being added and removed from this tree to find
+		the location of those devices.
+
+		Note that sometimes not all devices along the directory
+		chain will have emitted uevent messages, so userspace
+		programs must be able to handle such occurrences.
+
+Users:
+	udev <linux-hotplug-devel@lists.sourceforge.net>
@@ -155,7 +155,83 @@ problem, which is called the function-growth-hormone-imbalance syndrome.
 See next chapter.


-		Chapter 5: Functions
+		Chapter 5: Typedefs
+
+Please don't use things like "vps_t".
+
+It's a _mistake_ to use typedef for structures and pointers. When you see a
+
+	vps_t a;
+
+in the source, what does it mean?
+
+In contrast, if it says
+
+	struct virtual_container *a;
+
+you can actually tell what "a" is.
+
+Lots of people think that typedefs "help readability". Not so. They are
+useful only for:
+
+ (a) totally opaque objects (where the typedef is actively used to _hide_
+     what the object is).
+
+     Example: "pte_t" etc. opaque objects that you can only access using
+     the proper accessor functions.
+
+     NOTE! Opaqueness and "accessor functions" are not good in themselves.
+     The reason we have them for things like pte_t etc. is that there
+     really is absolutely _zero_ portably accessible information there.
+
+ (b) Clear integer types, where the abstraction _helps_ avoid confusion
+     whether it is "int" or "long".
+
+     u8/u16/u32 are perfectly fine typedefs, although they fit into
+     category (d) better than here.
+
+     NOTE! Again - there needs to be a _reason_ for this. If something is
+     "unsigned long", then there's no reason to do
+
+	typedef unsigned long myflags_t;
+
+     but if there is a clear reason for why it under certain circumstances
+     might be an "unsigned int" and under other configurations might be
+     "unsigned long", then by all means go ahead and use a typedef.
+
+ (c) when you use sparse to literally create a _new_ type for
+     type-checking.
+
+ (d) New types which are identical to standard C99 types, in certain
+     exceptional circumstances.
+
+     Although it would only take a short amount of time for the eyes and
+     brain to become accustomed to the standard types like 'uint32_t',
+     some people object to their use anyway.
+
+     Therefore, the Linux-specific 'u8/u16/u32/u64' types and their
+     signed equivalents which are identical to standard types are
+     permitted -- although they are not mandatory in new code of your
+     own.
+
+     When editing existing code which already uses one or the other set
+     of types, you should conform to the existing choices in that code.
+
+ (e) Types safe for use in userspace.
+
+     In certain structures which are visible to userspace, we cannot
+     require C99 types and cannot use the 'u32' form above. Thus, we
+     use __u32 and similar types in all structures which are shared
+     with userspace.
+
+Maybe there are other cases too, but the rule should basically be to NEVER
+EVER use a typedef unless you can clearly match one of those rules.
+
+In general, a pointer, or a struct that has elements that can reasonably
+be directly accessed should _never_ be a typedef.
+
+
+		Chapter 6: Functions

 Functions should be short and sweet, and do just one thing.  They should
 fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
@@ -183,7 +259,7 @@ and it gets confused.  You know you're brilliant, but maybe you'd like
 to understand what you did 2 weeks from now.


-		Chapter 6: Centralized exiting of functions
+		Chapter 7: Centralized exiting of functions

 Albeit deprecated by some people, the equivalent of the goto statement is
 used frequently by compilers in form of the unconditional jump instruction.
@@ -220,7 +296,7 @@ out:
 	return result;
 }

-		Chapter 7: Commenting
+		Chapter 8: Commenting

 Comments are good, but there is also a danger of over-commenting.  NEVER
 try to explain HOW your code works in a comment: it's much better to
@@ -240,7 +316,7 @@ When commenting the kernel API functions, please use the kerneldoc format.
 See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
 for details.

-		Chapter 8: You've made a mess of it
+		Chapter 9: You've made a mess of it

 That's OK, we all do.  You've probably been told by your long-time Unix
 user helper that "GNU emacs" automatically formats the C sources for
@@ -288,7 +364,7 @@ re-formatting you may want to take a look at the man page.  But
 remember: "indent" is not a fix for bad programming.


-		Chapter 9: Configuration-files
+		Chapter 10: Configuration-files

 For configuration options (arch/xxx/Kconfig, and all the Kconfig files),
 somewhat different indentation is used.
@@ -313,7 +389,7 @@ support for file-systems, for instance) should be denoted (DANGEROUS), other
 experimental options should be denoted (EXPERIMENTAL).


-		Chapter 10: Data structures
+		Chapter 11: Data structures

 Data structures that have visibility outside the single-threaded
 environment they are created and destroyed in should always have
@@ -344,7 +420,7 @@ Remember: if another thread can find your data structure, and you don't
 have a reference count on it, you almost certainly have a bug.


-		Chapter 11: Macros, Enums and RTL
+		Chapter 12: Macros, Enums and RTL

 Names of macros defining constants and labels in enums are capitalized.

@@ -399,7 +475,7 @@ The cpp manual deals with macros exhaustively. The gcc internals manual also
 covers RTL which is used frequently with assembly language in the kernel.


-		Chapter 12: Printing kernel messages
+		Chapter 13: Printing kernel messages

 Kernel developers like to be seen as literate. Do mind the spelling
 of kernel messages to make a good impression. Do not use crippled
@@ -410,7 +486,7 @@ Kernel messages do not have to be terminated with a period.
 Printing numbers in parentheses (%d) adds no value and should be avoided.


-		Chapter 13: Allocating memory
+		Chapter 14: Allocating memory

 The kernel provides the following general purpose memory allocators:
 kmalloc(), kzalloc(), kcalloc(), and vmalloc().  Please refer to the API
@@ -429,7 +505,7 @@ from void pointer to any other pointer type is guaranteed by the C programming
 language.


-		Chapter 14: The inline disease
+		Chapter 15: The inline disease

 There appears to be a common misperception that gcc has a magic "make me
 faster" speedup option called "inline". While the use of inlines can be
@@ -457,7 +533,7 @@ something it would have done anyway.



-		Chapter 15: References
+		Appendix I: References

 The C Programming Language, Second Edition
 by Brian W. Kernighan and Dennis M. Ritchie.
@@ -481,4 +557,4 @@ Kernel CodingStyle, by greg@kroah.com at OLS 2002:
 http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/

 --
-Last updated on 30 December 2005 by a community effort on LKML.
+Last updated on 30 April 2006.
@@ -62,6 +62,8 @@
     <sect1><title>Internal Functions</title>
 !Ikernel/exit.c
 !Ikernel/signal.c
+!Iinclude/linux/kthread.h
+!Ekernel/kthread.c
     </sect1>

     <sect1><title>Kernel objects manipulation</title>
@@ -114,9 +116,33 @@ X!Ilib/string.c
     </sect1>
  </chapter>

+  <chapter id="kernel-lib">
+     <title>Basic Kernel Library Functions</title>
+
+     <para>
+       The Linux kernel provides more basic utility functions.
+     </para>
+
+     <sect1><title>Bitmap Operations</title>
+!Elib/bitmap.c
+!Ilib/bitmap.c
+     </sect1>
+
+     <sect1><title>Command-line Parsing</title>
+!Elib/cmdline.c
+     </sect1>
+
+     <sect1><title>CRC Functions</title>
+!Elib/crc16.c
+!Elib/crc32.c
+!Elib/crc-ccitt.c
+     </sect1>
+  </chapter>
+
  <chapter id="mm">
     <title>Memory Management in Linux</title>
     <sect1><title>The Slab Cache</title>
+!Iinclude/linux/slab.h
 !Emm/slab.c
     </sect1>
     <sect1><title>User Space Memory Access</title>
@@ -280,12 +306,13 @@ X!Ekernel/module.c
     <sect1><title>MTRR Handling</title>
 !Earch/i386/kernel/cpu/mtrr/main.c
     </sect1>
+
     <sect1><title>PCI Support Library</title>
 !Edrivers/pci/pci.c
 !Edrivers/pci/pci-driver.c
 !Edrivers/pci/remove.c
 !Edrivers/pci/pci-acpi.c
-<!-- kerneldoc does not understand to __devinit
+<!-- kerneldoc does not understand __devinit
 X!Edrivers/pci/search.c
 -->
 !Edrivers/pci/msi.c
@@ -314,6 +341,13 @@ X!Earch/i386/kernel/mca.c
     </sect1>
  </chapter>

+  <chapter id="firmware">
+     <title>Firmware Interfaces</title>
+     <sect1><title>DMI Interfaces</title>
+!Edrivers/firmware/dmi_scan.c
+     </sect1>
+  </chapter>
+
  <chapter id="devfs">
     <title>The Device File System</title>
 !Efs/devfs/base.c
@@ -331,6 +365,18 @@ X!Earch/i386/kernel/mca.c
 !Esecurity/security.c
  </chapter>

+  <chapter id="audit">
+     <title>Audit Interfaces</title>
+!Ekernel/audit.c
+!Ikernel/auditsc.c
+!Ikernel/auditfilter.c
+  </chapter>
+
+  <chapter id="accounting">
+     <title>Accounting Framework</title>
+!Ikernel/acct.c
+  </chapter>
+
  <chapter id="pmfuncs">
     <title>Power Management</title>
 !Ekernel/power/pm.c
@@ -390,7 +436,6 @@ X!Edrivers/pnp/system.c
     </sect1>
  </chapter>

-
  <chapter id="blkdev">
     <title>Block Devices</title>
 !Eblock/ll_rw_blk.c
@@ -401,6 +446,14 @@ X!Edrivers/pnp/system.c
 !Edrivers/char/misc.c
  </chapter>

+  <chapter id="parportdev">
+     <title>Parallel Port Devices</title>
+!Iinclude/linux/parport.h
+!Edrivers/parport/ieee1284.c
+!Edrivers/parport/share.c
+!Idrivers/parport/daisy.c
+  </chapter>
+
  <chapter id="viddev">
     <title>Video4Linux</title>
 !Edrivers/media/video/videodev.c
@@ -169,6 +169,22 @@ void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf);

 	</sect2>

+	<sect2><title>PIO data read/write</title>
+	<programlisting>
+void (*data_xfer) (struct ata_device *, unsigned char *, unsigned int, int);
+	</programlisting>
+
+	<para>
+All bmdma-style drivers must implement this hook.  This is the low-level
+operation that actually copies the data bytes during a PIO data
+transfer.
+Typically the driver
+will choose one of ata_pio_data_xfer_noirq(), ata_pio_data_xfer(), or
+ata_mmio_data_xfer().
+	</para>
+
+	</sect2>
+
 	<sect2><title>ATA command execute</title>
 	<programlisting>
 void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf);
@@ -204,11 +220,10 @@ command.
 	<programlisting>
 u8   (*check_status)(struct ata_port *ap);
 u8   (*check_altstatus)(struct ata_port *ap);
-u8   (*check_err)(struct ata_port *ap);
 	</programlisting>

 	<para>
-	Reads the Status/AltStatus/Error ATA shadow register from
+	Reads the Status/AltStatus ATA shadow register from
 	hardware.  On some hardware, reading the Status register has
 	the side effect of clearing the interrupt condition.
 	Most drivers for taskfile-based hardware use
@@ -269,23 +284,6 @@ void (*set_mode) (struct ata_port *ap);

 	</sect2>

-	<sect2><title>Reset ATA bus</title>
-	<programlisting>
-void (*phy_reset) (struct ata_port *ap);
-	</programlisting>
-
-	<para>
-	The very first step in the probe phase.  Actions vary depending
-	on the bus type, typically.  After waking up the device and probing
-	for device presence (PATA and SATA), typically a soft reset
-	(SRST) will be performed.  Drivers typically use the helper
-	functions ata_bus_reset() or sata_phy_reset() for this hook.
-	Many SATA drivers use sata_phy_reset() or call it from within
-	their own phy_reset() functions.
-	</para>
-
-	</sect2>
-
 	<sect2><title>Control PCI IDE BMDMA engine</title>
 	<programlisting>
 void (*bmdma_setup) (struct ata_queued_cmd *qc);
@@ -354,16 +352,74 @@ int (*qc_issue) (struct ata_queued_cmd *qc);

 	</sect2>

-	<sect2><title>Timeout (error) handling</title>
+	<sect2><title>Exception and probe handling (EH)</title>
 	<programlisting>
 void (*eng_timeout) (struct ata_port *ap);
+void (*phy_reset) (struct ata_port *ap);
 	</programlisting>

 	<para>
-This is a high level error handling function, called from the
-error handling thread, when a command times out.  Most newer
-hardware will implement its own error handling code here.  IDE BMDMA
-drivers may use the helper function ata_eng_timeout().
+Deprecated.  Use ->error_handler() instead.
+	</para>
+
+	<programlisting>
+void (*freeze) (struct ata_port *ap);
+void (*thaw) (struct ata_port *ap);
+	</programlisting>
+
+	<para>
+ata_port_freeze() is called when HSM violations or some other
+condition disrupts normal operation of the port.  A frozen port
+is not allowed to perform any operation until the port is
+thawed, which usually follows a successful reset.
+	</para>
+
+	<para>
+The optional ->freeze() callback can be used for freezing the port
+hardware-wise (e.g. mask interrupt and stop DMA engine).  If a
+port cannot be frozen hardware-wise, the interrupt handler
+must ack and clear interrupts unconditionally while the port
+is frozen.
+	</para>
+	<para>
+The optional ->thaw() callback is called to perform the opposite of ->freeze():
+prepare the port for normal operation once again.  Unmask interrupts,
+start DMA engine, etc.
+	</para>
+
+	<programlisting>
+void (*error_handler) (struct ata_port *ap);
+	</programlisting>
+
+	<para>
+->error_handler() is a driver's hook into probe, hotplug, and recovery
+and other exceptional conditions.  The primary responsibility of an
+implementation is to call ata_do_eh() or ata_bmdma_drive_eh() with a set
+of EH hooks as arguments:
+	</para>
+
+	<para>
+'prereset' hook (may be NULL) is called during an EH reset, before any other actions
+are taken.
+	</para>
+
+	<para>
+'postreset' hook (may be NULL) is called after the EH reset is performed.  Based on
+existing conditions, severity of the problem, and hardware capabilities,
+	</para>
+
+	<para>
+Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be
+called to perform the low-level EH reset.
+	</para>
+
+	<programlisting>
+void (*post_internal_cmd) (struct ata_queued_cmd *qc);
+	</programlisting>
+
+	<para>
+Perform any hardware-specific actions necessary to finish processing
+after executing a probe-time or EH-time command via ata_exec_internal().
 	</para>

 	</sect2>
@@ -144,9 +144,47 @@ over a rather long period of time, but improvements are always welcome!
 	whether the increased speed is worth it.

 8.	Although synchronize_rcu() is a bit slower than is call_rcu(),
-	it usually results in simpler code.  So, unless update performance
-	is important or the updaters cannot block, synchronize_rcu()
-	should be used in preference to call_rcu().
+	it usually results in simpler code.  So, unless update
+	performance is critically important or the updaters cannot block,
+	synchronize_rcu() should be used in preference to call_rcu().
+
+	An especially important property of the synchronize_rcu()
+	primitive is that it automatically self-limits: if grace periods
+	are delayed for whatever reason, then the synchronize_rcu()
+	primitive will correspondingly delay updates.  In contrast,
+	code using call_rcu() should explicitly limit update rate in
+	cases where grace periods are delayed, as failing to do so can
+	result in excessive realtime latencies or even OOM conditions.
+
+	Ways of gaining this self-limiting property when using call_rcu()
+	include:
+
+	a.	Keeping a count of the number of data-structure elements
+		used by the RCU-protected data structure, including those
+		waiting for a grace period to elapse.  Enforce a limit
+		on this number, stalling updates as needed to allow
+		previously deferred frees to complete.
+
+		Alternatively, limit only the number awaiting deferred
+		free rather than the total number of elements.
+
+	b.	Limiting update rate.  For example, if updates occur only
+		once per hour, then no explicit rate limiting is required,
+		unless your system is already badly broken.  The dcache
+		subsystem takes this approach -- updates are guarded
+		by a global lock, limiting their rate.
+
+	c.	Trusted update -- if updates can only be done manually by
+		superuser or some other trusted user, then it might not
+		be necessary to automatically limit them.  The theory
+		here is that superuser already has lots of ways to crash
+		the machine.
+
+	d.	Use call_rcu_bh() rather than call_rcu(), in order to take
+		advantage of call_rcu_bh()'s faster grace periods.
+
+	e.	Periodically invoke synchronize_rcu(), permitting a limited
+		number of updates per grace period.

 9.	All RCU list-traversal primitives, which include
 	list_for_each_rcu(), list_for_each_entry_rcu(),
@@ -184,7 +184,17 @@ synchronize_rcu()
 	blocking, it registers a function and argument which are invoked
 	after all ongoing RCU read-side critical sections have completed.
 	This callback variant is particularly useful in situations where
-	it is illegal to block.
+	it is illegal to block or where update-side performance is
+	critically important.
+
+	However, the call_rcu() API should not be used lightly, as use
+	of the synchronize_rcu() API generally results in simpler code.
+	In addition, the synchronize_rcu() API has the nice property
+	of automatically limiting update rate should grace periods
+	be delayed.  This property results in system resilience in face
+	of denial-of-service attacks.  Code using call_rcu() should limit
+	update rate in order to gain this same sort of resilience.  See
+	checklist.txt for some approaches to limiting the update rate.

 rcu_assign_pointer()

@@ -790,7 +800,6 @@ RCU pointer update:

 RCU grace period:

-	synchronize_kernel (deprecated)
 	synchronize_net
 	synchronize_sched
 	synchronize_rcu
@@ -0,0 +1,57 @@
+Linux Kernel patch sumbittal checklist
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here are some basic things that developers should do if they
+want to see their kernel patch submittals accepted quicker.
+
+These are all above and beyond the documentation that is provided
+in Documentation/SubmittingPatches and elsewhere about submitting
+Linux kernel patches.
+
+
+
+- Builds cleanly with applicable or modified CONFIG options =y, =m, and =n.
+  No gcc warnings/errors, no linker warnings/errors.
+
+- Passes allnoconfig, allmodconfig
+
+- Builds on multiple CPU arch-es by using local cross-compile tools
+  or something like PLM at OSDL.
+
+- ppc64 is a good architecture for cross-compilation checking because it
+  tends to use `unsigned long' for 64-bit quantities.
+
+- Matches kernel coding style(!)
+
+- Any new or modified CONFIG options don't muck up the config menu.
+
+- All new Kconfig options have help text.
+
+- Has been carefully reviewed with respect to relevant Kconfig
+  combinations.  This is very hard to get right with testing --
+  brainpower pays off here.
+
+- Check cleanly with sparse.
+
+- Use 'make checkstack' and 'make namespacecheck' and fix any
+  problems that they find.  Note:  checkstack does not point out
+  problems explicitly, but any one function that uses more than
+  512 bytes on the stack is a candidate for change.
+
+- Include kernel-doc to document global kernel APIs.  (Not required
+  for static functions, but OK there also.)  Use 'make htmldocs'
+  or 'make mandocs' to check the kernel-doc and fix any issues.
+
+- Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
+  CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
+  CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
+  enabled.
+
+- Has been build- and runtime tested with and without CONFIG_SMP and
+  CONFIG_PREEMPT.
+
+- If the patch affects IO/Disk, etc: has been tested with and without
+  CONFIG_LBD.
+
+
+2006-APR-27
@@ -3,7 +3,7 @@

 	     Maintained by Torben Mathiasen <device@lanana.org>

-		      Last revised: 25 January 2005
+		      Last revised: 15 May 2006

 This list is the Linux Device List, the official registry of allocated
 device numbers and /dev directory nodes for the Linux operating
@@ -94,7 +94,6 @@ Your cooperation is appreciated.
 		  9 = /dev/urandom	Faster, less secure random number gen.
 		 10 = /dev/aio		Asyncronous I/O notification interface
 		 11 = /dev/kmsg		Writes to this come out as printk's
-		 12 = /dev/oldmem	Access to crash dump from kexec kernel
  1 block	RAM disk
 		  0 = /dev/ram0		First RAM disk
 		  1 = /dev/ram1		Second RAM disk
@@ -262,13 +261,13 @@ Your cooperation is appreciated.
 		NOTE: These devices permit both read and write access.

  7 block	Loopback devices
-		  0 = /dev/loop0	First loopback device
-		  1 = /dev/loop1	Second loopback device
+		  0 = /dev/loop0	First loop device
+		  1 = /dev/loop1	Second loop device
 		    ...

-		The loopback devices are used to mount filesystems not
+		The loop devices are used to mount filesystems not
 		associated with block devices.	The binding to the
-		loopback devices is handled by mount(8) or losetup(8).
+		loop devices is handled by mount(8) or losetup(8).

  8 block	SCSI disk devices (0-15)
 		  0 = /dev/sda		First SCSI disk whole disk
@@ -943,7 +942,7 @@ Your cooperation is appreciated.
 		240 = /dev/ftlp		FTL on 16th Memory Technology Device 

 		Partitions are handled in the same way as for IDE
-		disks (see major number 3) expect that the partition
+		disks (see major number 3) except that the partition
 		limit is 15 rather than 63 per disk (same as SCSI.)

 45 char	isdn4linux ISDN BRI driver
@@ -1168,7 +1167,7 @@ Your cooperation is appreciated.
 		The filename of the encrypted container and the passwords
 		are sent via ioctls (using the sdmount tool) to the master
 		node which then activates them via one of the
-		/dev/scramdisk/x nodes for loopback mounting (all handled
+		/dev/scramdisk/x nodes for loop mounting (all handled
 		through the sdmount tool).

 		Requested by: andy@scramdisklinux.org
@@ -2538,18 +2537,32 @@ Your cooperation is appreciated.
 		  0 = /dev/usb/lp0	First USB printer
 		    ...
 		 15 = /dev/usb/lp15	16th USB printer
-		 16 = /dev/usb/mouse0	First USB mouse
-		    ...
-		 31 = /dev/usb/mouse15	16th USB mouse
-		 32 = /dev/usb/ez0	First USB firmware loader
-		    ...
-		 47 = /dev/usb/ez15	16th USB firmware loader
 		 48 = /dev/usb/scanner0	First USB scanner
 		    ...
 		 63 = /dev/usb/scanner15 16th USB scanner
 		 64 = /dev/usb/rio500	Diamond Rio 500
 		 65 = /dev/usb/usblcd	USBLCD Interface (info@usblcd.de)
 		 66 = /dev/usb/cpad0	Synaptics cPad (mouse/LCD)
+		 96 = /dev/usb/hiddev0	1st USB HID device
+		    ...
+		111 = /dev/usb/hiddev15	16th USB HID device
+		112 = /dev/usb/auer0	1st auerswald ISDN device
+		    ...
+		127 = /dev/usb/auer15	16th auerswald ISDN device
+		128 = /dev/usb/brlvgr0	First Braille Voyager device
+		    ...
+		131 = /dev/usb/brlvgr3	Fourth Braille Voyager device
+		132 = /dev/usb/idmouse	ID Mouse (fingerprint scanner) device
+		133 = /dev/usb/sisusbvga1	First SiSUSB VGA device
+		    ...
+		140 = /dev/usb/sisusbvga8	Eigth SISUSB VGA device
+		144 = /dev/usb/lcd	USB LCD device
+		160 = /dev/usb/legousbtower0	1st USB Legotower device
+		    ...
+		175 = /dev/usb/legousbtower15	16th USB Legotower device
+		240 = /dev/usb/dabusb0	First daubusb device
+		    ...
+		243 = /dev/usb/dabusb3	Fourth dabusb device

 180 block	USB block devices
 		0 = /dev/uba		First USB block device
@@ -2710,6 +2723,17 @@ Your cooperation is appreciated.
 		  1 = /dev/cpu/1/msr		MSRs on CPU 1
 		    ...

+202 block	Xen Virtual Block Device
+		  0 = /dev/xvda       First Xen VBD whole disk
+		  16 = /dev/xvdb      Second Xen VBD whole disk
+		  32 = /dev/xvdc      Third Xen VBD whole disk
+		    ...
+		  240 = /dev/xvdp     Sixteenth Xen VBD whole disk
+
+                Partitions are handled in the same way as for IDE
+                disks (see major number 3) except that the limit on
+                partitions is 15.
+
 203 char	CPU CPUID information
 		  0 = /dev/cpu/0/cpuid		CPUID on CPU 0
 		  1 = /dev/cpu/1/cpuid		CPUID on CPU 1
@@ -2747,11 +2771,27 @@ Your cooperation is appreciated.
 		 46 = /dev/ttyCPM0		PPC CPM (SCC or SMC) - port 0
 		    ...
 		 47 = /dev/ttyCPM5		PPC CPM (SCC or SMC) - port 5
-		 50 = /dev/ttyIOC40		Altix serial card
+		 50 = /dev/ttyIOC0		Altix serial card
 		    ...
-		 81 = /dev/ttyIOC431		Altix serial card
-		 82 = /dev/ttyVR0               NEC VR4100 series SIU
-		 83 = /dev/ttyVR1               NEC VR4100 series DSIU
+		 81 = /dev/ttyIOC31		Altix serial card
+		 82 = /dev/ttyVR0		NEC VR4100 series SIU
+		 83 = /dev/ttyVR1		NEC VR4100 series DSIU
+		 84 = /dev/ttyIOC84		Altix ioc4 serial card
+		    ...
+		 115 = /dev/ttyIOC115		Altix ioc4 serial card
+		 116 = /dev/ttySIOC0		Altix ioc3 serial card
+		    ...
+		 147 = /dev/ttySIOC31		Altix ioc3 serial card
+		 148 = /dev/ttyPSC0		PPC PSC - port 0
+		    ...
+		 153 = /dev/ttyPSC5		PPC PSC - port 5
+		 154 = /dev/ttyAT0		ATMEL serial port 0
+		    ...
+		 169 = /dev/ttyAT15		ATMEL serial port 15
+		 170 = /dev/ttyNX0		Hilscher netX serial port 0
+		    ...
+		 185 = /dev/ttyNX15		Hilscher netX serial port 15
+		 186 = /dev/ttyJ0		JTAG1 DCC protocol based serial port emulation

 205 char	Low-density serial ports (alternate device)
 		  0 = /dev/culu0		Callout device for ttyLU0
@@ -2786,8 +2826,8 @@ Your cooperation is appreciated.
 		 50 = /dev/cuioc40		Callout device for ttyIOC40
 		    ...
 		 81 = /dev/cuioc431		Callout device for ttyIOC431
-		 82 = /dev/cuvr0                Callout device for ttyVR0
-		 83 = /dev/cuvr1                Callout device for ttyVR1
+		 82 = /dev/cuvr0		Callout device for ttyVR0
+		 83 = /dev/cuvr1		Callout device for ttyVR1


 206 char	OnStream SC-x0 tape devices
@@ -2897,7 +2937,6 @@ Your cooperation is appreciated.
 		    ...
 		196 = /dev/dvb/adapter3/video0    first video decoder of fourth card

-
 216 char	Bluetooth RFCOMM TTY devices
 		  0 = /dev/rfcomm0		First Bluetooth RFCOMM TTY device
 		  1 = /dev/rfcomm1		Second Bluetooth RFCOMM TTY device
@@ -3002,12 +3041,43 @@ Your cooperation is appreciated.
 		ioctl()'s can be used to rewind the tape regardless of
 		the device used to access it.

-231 char	InfiniBand MAD
+231 char	InfiniBand
 		0 = /dev/infiniband/umad0
 		1 = /dev/infiniband/umad1
-		 ...
+		  ...
+		63 = /dev/infiniband/umad63    63rd InfiniBandMad device
+		64 = /dev/infiniband/issm0     First InfiniBand IsSM device
+		65 = /dev/infiniband/issm1     Second InfiniBand IsSM device
+		  ...
+		127 = /dev/infiniband/issm63    63rd InfiniBand IsSM device
+		128 = /dev/infiniband/uverbs0   First InfiniBand verbs device
+		129 = /dev/infiniband/uverbs1   Second InfiniBand verbs device
+		  ...
+		159 = /dev/infiniband/uverbs31  31st InfiniBand verbs device

-232-239		UNASSIGNED
+232 char	Biometric Devices
+		0 = /dev/biometric/sensor0/fingerprint	first fingerprint sensor on first device
+		1 = /dev/biometric/sensor0/iris		first iris sensor on first device
+		2 = /dev/biometric/sensor0/retina	first retina sensor on first device
+		3 = /dev/biometric/sensor0/voiceprint	first voiceprint sensor on first device
+		4 = /dev/biometric/sensor0/facial	first facial sensor on first device
+		5 = /dev/biometric/sensor0/hand		first hand sensor on first device
+		  ...
+		10 = /dev/biometric/sensor1/fingerprint	first fingerprint sensor on second device
+		  ...
+		20 = /dev/biometric/sensor2/fingerprint	first fingerprint sensor on third device
+		  ...
+
+233 char	PathScale InfiniPath interconnect
+		0 = /dev/ipath        Primary device for programs (any unit)
+		1 = /dev/ipath0       Access specifically to unit 0
+		2 = /dev/ipath1       Access specifically to unit 1
+		  ...
+		4 = /dev/ipath3       Access specifically to unit 3
+		129 = /dev/ipath_sma    Device used by Subnet Management Agent
+		130 = /dev/ipath_diag   Device used by diagnostics programs
+
+234-239		UNASSIGNED

 240-254 char	LOCAL/EXPERIMENTAL USE
 240-254 block	LOCAL/EXPERIMENTAL USE
@@ -3021,6 +3091,28 @@ Your cooperation is appreciated.
 		This major is reserved to assist the expansion to a
 		larger number space.  No device nodes with this major
 		should ever be created on the filesystem.
+		(This is probaly not true anymore, but I'll leave it
+		for now /Torben)
+
+---LARGE MAJORS!!!!!---
+
+256 char	Equinox SST multi-port serial boards
+		   0 = /dev/ttyEQ0	First serial port on first Equinox SST board
+		 127 = /dev/ttyEQ127	Last serial port on first Equinox SST board
+		 128 = /dev/ttyEQ128	First serial port on second Equinox SST board
+		  ...
+		1027 = /dev/ttyEQ1027	Last serial port on eighth Equinox SST board
+
+256 block	Resident Flash Disk Flash Translation Layer
+		  0 = /dev/rfda		First RFD FTL layer
+		 16 = /dev/rfdb		Second RFD FTL layer
+		  ...
+		240 = /dev/rfdp		16th RFD FTL layer
+
+257 char	Phoenix Technologies Cryptographic Services Driver
+		  0 = /dev/ptlsec	Crypto Services Driver
+
+

 ****	ADDITIONAL /dev DIRECTORY ENTRIES

@@ -33,21 +33,6 @@ Who:	Adrian Bunk <bunk@stusta.de>

 ---------------------------

-What:	RCU API moves to EXPORT_SYMBOL_GPL
-When:	April 2006
-Files:	include/linux/rcupdate.h, kernel/rcupdate.c
-Why:	Outside of Linux, the only implementations of anything even
-	vaguely resembling RCU that I am aware of are in DYNIX/ptx,
-	VM/XA, Tornado, and K42.  I do not expect anyone to port binary
-	drivers or kernel modules from any of these, since the first two
-	are owned by IBM and the last two are open-source research OSes.
-	So these will move to GPL after a grace period to allow
-	people, who might be using implementations that I am not aware
-	of, to adjust to this upcoming change.
-Who:	Paul E. McKenney <paulmck@us.ibm.com>
-
---------------------------
-
 What:	raw1394: requests of type RAW1394_REQ_ISO_SEND, RAW1394_REQ_ISO_LISTEN
 When:	November 2006
 Why:	Deprecated in favour of the new ioctl-based rawiso interface, which is
@@ -99,7 +99,7 @@ prototypes:
 	int (*sync_fs)(struct super_block *sb, int wait);
 	void (*write_super_lockfs) (struct super_block *);
 	void (*unlockfs) (struct super_block *);
-	int (*statfs) (struct super_block *, struct kstatfs *);
+	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct super_block *);
@@ -142,15 +142,16 @@ see also dquot_operations section.

 --------------------------- file_system_type ---------------------------
 prototypes:
-	struct super_block *(*get_sb) (struct file_system_type *, int,
-			const char *, void *);
+	struct int (*get_sb) (struct file_system_type *, int,
+			const char *, void *, struct vfsmount *);
 	void (*kill_sb) (struct super_block *);
 locking rules:
 		may block	BKL
 get_sb		yes		yes
 kill_sb		yes		yes

->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
+->get_sb() returns error or 0 with locked superblock attached to the vfsmount
+(exclusive on ->s_umount).
 ->kill_sb() takes a write-locked superblock, does all shutdown work on it,
 unlocks and drops the reference.

@@ -19,7 +19,7 @@ following procedure:

 (2) Have the follow_link() op do the following steps:

-     (a) Call do_kern_mount() to call the appropriate filesystem to set up a
+     (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
         superblock and gain a vfsmount structure representing it.

     (b) Copy the nameidata provided as an argument and substitute the dentry
@@ -18,6 +18,14 @@ Non-privileged mount (or user mount):
  user.  NOTE: this is not the same as mounts allowed with the "user"
  option in /etc/fstab, which is not discussed here.

+Filesystem connection:
+
+  A connection between the filesystem daemon and the kernel.  The
+  connection exists until either the daemon dies, or the filesystem is
+  umounted.  Note that detaching (or lazy umounting) the filesystem
+  does _not_ break the connection, in this case it will exist until
+  the last reference to the filesystem is released.
+
 Mount owner:

  The user who does the mounting.
@@ -86,16 +94,20 @@ Mount options
  The default is infinite.  Note that the size of read requests is
  limited anyway to 32 pages (which is 128kbyte on i386).

-Sysfs
-~~~~~
+Control filesystem
+~~~~~~~~~~~~~~~~~~

-FUSE sets up the following hierarchy in sysfs:
+There's a control filesystem for FUSE, which can be mounted by:

-  /sys/fs/fuse/connections/N/
+  mount -t fusectl none /sys/fs/fuse/connections

-where N is an increasing number allocated to each new connection.
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.

-For each connection the following attributes are defined:
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
+
+For each connection the following files exist within this directory:

 'waiting'

@@ -110,7 +122,47 @@ For each connection the following attributes are defined:
  connection.  This means that all waiting requests will be aborted an
  error returned for all aborted and new requests.

-Only a privileged user may read or write these attributes.
+Only the owner of the mount may read or write these files.
+
+Interrupting filesystem operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+
+  1) If the request is not yet sent to userspace AND the signal is
+     fatal (SIGKILL or unhandled fatal signal), then the request is
+     dequeued and returns immediately.
+
+  2) If the request is not yet sent to userspace AND the signal is not
+     fatal, then an 'interrupted' flag is set for the request.  When
+     the request has been successfully transfered to userspace and
+     this flag is set, an INTERRUPT request is queued.
+
+  3) If the request is already sent to userspace, then an INTERRUPT
+     request is queued.
+
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the _original_ request, with
+the error set to EINTR.
+
+It is also possible that there's a race between processing the
+original request and it's INTERRUPT request.  There are two possibilities:
+
+  1) The INTERRUPT request is processed before the original request is
+     processed
+
+  2) The INTERRUPT request is processed after the original request has
+     been answered
+
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error.  In case
+1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
+reply will be ignored.

 Aborting a filesystem connection
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -139,8 +191,8 @@ the filesystem.  There are several ways to do this:
  - Use forced umount (umount -f).  Works in all cases but only if
    filesystem is still attached (it hasn't been lazy unmounted)

-  - Abort filesystem through the sysfs interface.  Most powerful
-    method, always works.
+  - Abort filesystem through the FUSE control filesystem.  Most
+    powerful method, always works.

 How do non-privileged mounts work?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,25 +356,7 @@ Scenario 1 -  Simple deadlock
 |                                    |     for "file"]
 |                                    |    *DEADLOCK*

-The solution for this is to allow requests to be interrupted while
-they are in userspace:
-
- |      [interrupted by signal]       |
- |    <fuse_unlink()                  |
- |    [release semaphore]             |    [semaphore acquired]
- |  <sys_unlink()                     |
- |                                    |    >fuse_unlink()
- |                                    |      [queue req on fc->pending]
- |                                    |      [wake up fc->waitq]
- |                                    |      [sleep on req->waitq]
-
-If the filesystem daemon was single threaded, this will stop here,
-since there's no other thread to dequeue and execute the request.
-In this case the solution is to kill the FUSE daemon as well.  If
-there are multiple serving threads, you just have to kill them as
-long as any remain.
-
-Moral: a filesystem which deadlocks, can soon find itself dead.
+The solution for this is to allow the filesystem to be aborted.

 Scenario 2 - Tricky deadlock
 ----------------------------
@@ -355,24 +389,14 @@ but is caused by a pagefault.
 |                                    |           [lock page]
 |                                    |           * DEADLOCK *

-Solution is again to let the the request be interrupted (not
-elaborated further).
+Solution is basically the same as above.

-An additional problem is that while the write buffer is being
-copied to the request, the request must not be interrupted.  This
-is because the destination address of the copy may not be valid
-after the request is interrupted.
+An additional problem is that while the write buffer is being copied
+to the request, the request must not be interrupted/aborted.  This is
+because the destination address of the copy may not be valid after the
+request has returned.

-This is solved with doing the copy atomically, and allowing
-interruption while the page(s) belonging to the write buffer are
-faulted with get_user_pages().  The 'req->locked' flag indicates
-when the copy is taking place, and interruption is delayed until
-this flag is unset.
-
-Scenario 3 - Tricky deadlock with asynchronous read
---------------------------------------------------
-
-The same situation as above, except thread-1 will wait on page lock
-and hence it will be uninterruptible as well.  The solution is to
-abort the connection with forced umount (if mount is attached) or
-through the abort attribute in sysfs.
+This is solved with doing the copy atomically, and allowing abort
+while the page(s) belonging to the write buffer are faulted with
+get_user_pages().  The 'req->locked' flag indicates when the copy is
+taking place, and abort is delayed until this flag is unset.
@@ -50,10 +50,11 @@ Turn your foo_read_super() into a function that would return 0 in case of
 success and negative number in case of error (-EINVAL unless you have more
 informative error value to report).  Call it foo_fill_super().  Now declare

-struct super_block foo_get_sb(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+int foo_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-	return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
+	return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
+			   mnt);
 }

 (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
@@ -70,11 +70,13 @@ tmpfs mounts.  See Documentation/filesystems/tmpfs.txt for more information.
 What is rootfs?
 ---------------

-Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
-(It's used internally as the starting and stopping point for searches of the
-kernel's doubly-linked list of mount points.)
+Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
+always present in 2.6 systems.  You can't unmount rootfs for approximately the
+same reason you can't kill the init process; rather than having special code
+to check for and handle an empty list, it's smaller and simpler for the kernel
+to just make sure certain lists can't become empty.

-Most systems just mount another filesystem over it and ignore it.  The
+Most systems just mount another filesystem over rootfs and ignore it.  The
 amount of space an empty instance of ramfs takes up is tiny.

 What is initramfs?
@@ -92,14 +94,16 @@ out of that.

 All this differs from the old initrd in several ways:

-  - The old initrd was a separate file, while the initramfs archive is linked
-    into the linux kernel image.  (The directory linux-*/usr is devoted to
-    generating this archive during the build.)
+  - The old initrd was always a separate file, while the initramfs archive is
+    linked into the linux kernel image.  (The directory linux-*/usr is devoted
+    to generating this archive during the build.)

  - The old initrd file was a gzipped filesystem image (in some file format,
-    such as ext2, that had to be built into the kernel), while the new
+    such as ext2, that needed a driver built into the kernel), while the new
    initramfs archive is a gzipped cpio archive (like tar only simpler,
-    see cpio(1) and Documentation/early-userspace/buffer-format.txt).
+    see cpio(1) and Documentation/early-userspace/buffer-format.txt).  The
+    kernel's cpio extraction code is not only extremely small, it's also
+    __init data that can be discarded during the boot process.

  - The program run by the old initrd (which was called /initrd, not /init) did
    some setup and then returned to the kernel, while the init program from
@@ -124,13 +128,14 @@ Populating initramfs:

 The 2.6 kernel build process always creates a gzipped cpio format initramfs
 archive and links it into the resulting kernel binary.  By default, this
-archive is empty (consuming 134 bytes on x86).  The config option
-CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
-in menuconfig, and living in usr/Kconfig) can be used to specify a source for
-the initramfs archive, which will automatically be incorporated into the
-resulting binary.  This option can point to an existing gzipped cpio archive, a
-directory containing files to be archived, or a text file specification such
-as the following example:
+archive is empty (consuming 134 bytes on x86).
+
+The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
+devices->block devices in menuconfig, and living in usr/Kconfig) can be used
+to specify a source for the initramfs archive, which will automatically be
+incorporated into the resulting binary.  This option can point to an existing
+gzipped cpio archive, a directory containing files to be archived, or a text
+file specification such as the following example:

  dir /dev 755 0 0
  nod /dev/console 644 0 0 c 5 1
@@ -146,23 +151,84 @@ as the following example:
 Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
 documenting the above file format.

-One advantage of the text file is that root access is not required to
+One advantage of the configuration file is that root access is not required to
 set permissions or create device nodes in the new archive.  (Note that those
 two example "file" entries expect to find files named "init.sh" and "busybox" in
 a directory called "initramfs", under the linux-2.6.* directory.  See
 Documentation/early-userspace/README for more details.)

-The kernel does not depend on external cpio tools, gen_init_cpio is created
-from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
-boot-time extractor is also (obviously) self-contained.  However, if you _do_
-happen to have cpio installed, the following command line can extract the
-generated cpio image back into its component files:
+The kernel does not depend on external cpio tools.  If you specify a
+directory instead of a configuration file, the kernel's build infrastructure
+creates a configuration file from that directory (usr/Makefile calls
+scripts/gen_initramfs_list.sh), and proceeds to package up that directory
+using the config file (by feeding it to usr/gen_init_cpio, which is created
+from usr/gen_init_cpio.c).  The kernel's build-time cpio creation code is
+entirely self-contained, and the kernel's boot-time extractor is also
+(obviously) self-contained.
+
+The one thing you might need external cpio utilities installed for is creating
+or extracting your own preprepared cpio files to feed to the kernel build
+(instead of a config file or directory).
+
+The following command line can extract a cpio image (either by the above script
+or by the kernel build) back into its component files:

  cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames

+The following shell script can create a prebuilt cpio archive you can
+use in place of the above config file:
+
+  #!/bin/sh
+
+  # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
+  # Licensed under GPL version 2
+
+  if [ $# -ne 2 ]
+  then
+    echo "usage: mkinitramfs directory imagename.cpio.gz"
+    exit 1
+  fi
+
+  if [ -d "$1" ]
+  then
+    echo "creating $2 from $1"
+    (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
+  else
+    echo "First argument must be a directory"
+    exit 1
+  fi
+
+Note: The cpio man page contains some bad advice that will break your initramfs
+archive if you follow it.  It says "A typical way to generate the list
+of filenames is with the find command; you should give find the -depth option
+to minimize problems with permissions on directories that are unwritable or not
+searchable."  Don't do this when creating initramfs.cpio.gz images, it won't
+work.  The Linux kernel cpio extractor won't create files in a directory that
+doesn't exist, so the directory entries must go before the files that go in
+those directories.  The above script gets them in the right order.
+
+External initramfs images:
+--------------------------
+
+If the kernel has initrd support enabled, an external cpio.gz archive can also
+be passed into a 2.6 kernel in place of an initrd.  In this case, the kernel
+will autodetect the type (initramfs, not initrd) and extract the external cpio
+archive into rootfs before trying to run /init.
+
+This has the memory efficiency advantages of initramfs (no ramdisk block
+device) but the separate packaging of initrd (which is nice if you have
+non-GPL code you'd like to run from initramfs, without conflating it with
+the GPL licensed Linux kernel binary).
+
+It can also be used to supplement the kernel's built-in initamfs image.  The
+files in the external archive will overwrite any conflicting files in
+the built-in initramfs archive.  Some distributors also prefer to customize
+a single kernel image with task-specific initramfs images, without recompiling.
+
 Contents of initramfs:
 ----------------------

+An initramfs archive is a complete self-contained root filesystem for Linux.
 If you don't already understand what shared libraries, devices, and paths
 you need to get a minimal root filesystem up and running, here are some
 references:
@@ -176,13 +242,36 @@ code against, along with some related utilities.  It is BSD licensed.

 I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
 myself.  These are LGPL and GPL, respectively.  (A self-contained initramfs
-package is planned for the busybox 1.2 release.)
+package is planned for the busybox 1.3 release.)

 In theory you could use glibc, but that's not well suited for small embedded
 uses like this.  (A "hello world" program statically linked against glibc is
 over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do
 name lookups, even when otherwise statically linked.)

+A good first step is to get initramfs to run a statically linked "hello world"
+program as init, and test it under an emulator like qemu (www.qemu.org) or
+User Mode Linux, like so:
+
+  cat > hello.c << EOF
+  #include <stdio.h>
+  #include <unistd.h>
+
+  int main(int argc, char *argv[])
+  {
+    printf("Hello world!\n");
+    sleep(999999999);
+  }
+  EOF
+  gcc -static hello2.c -o init
+  echo init | cpio -o -H newc | gzip > test.cpio.gz
+  # Testing external initramfs using the initrd loading mechanism.
+  qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
+
+When debugging a normal root filesystem, it's nice to be able to boot with
+"init=/bin/sh".  The initramfs equivalent is "rdinit=/bin/sh", and it's
+just as useful.
+
 Why cpio rather than tar?
 -------------------------

@@ -241,7 +330,7 @@ the above threads) is:
 Future directions:
 ------------------

-Today (2.6.14), initramfs is always compiled in, but not always used.  The
+Today (2.6.16), initramfs is always compiled in, but not always used.  The
 kernel falls back to legacy boot code that is reached only if initramfs does
 not contain an /init program.  The fallback is legacy code, there to ensure a
 smooth transition and allowing early boot functionality to gradually move to
@@ -258,8 +347,9 @@ and so on.

 This kind of complexity (which inevitably includes policy) is rightly handled
 in userspace.  Both klibc and busybox/uClibc are working on simple initramfs
-packages to drop into a kernel build, and when standard solutions are ready
-and widely deployed, the kernel's legacy early boot code will become obsolete
-and a candidate for the feature removal schedule.
+packages to drop into a kernel build.

-But that's a while off yet.
+The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
+The kernel's current early boot code (partition detection, etc) will probably
+be migrated into a default initramfs, automatically created and used by the
+kernel build.
--- a/Show More
+++ b/Show More