You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge branch 'linus' into cpus4096
Conflicts: arch/x86/xen/smp.c kernel/sched_rt.c net/iucv/iucv.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit is contained in:
@@ -26,3 +26,37 @@ Description:
|
||||
I/O statistics of partition <part>. The format is the
|
||||
same as the above-written /sys/block/<disk>/stat
|
||||
format.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/format
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Metadata format for integrity capable block device.
|
||||
E.g. T10-DIF-TYPE1-CRC.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/read_verify
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Indicates whether the block layer should verify the
|
||||
integrity of read requests serviced by devices that
|
||||
support sending integrity metadata.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/tag_size
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Number of bytes of integrity tag space available per
|
||||
512 bytes of data.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/write_generate
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Indicates whether the block layer should automatically
|
||||
generate checksums for write requests bound for
|
||||
devices that support receiving integrity metadata.
|
||||
|
||||
@@ -0,0 +1,35 @@
|
||||
What: /sys/bus/css/devices/.../type
|
||||
Date: March 2008
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the subchannel type, as reported by the hardware.
|
||||
This attribute is present for all subchannel types.
|
||||
|
||||
What: /sys/bus/css/devices/.../modalias
|
||||
Date: March 2008
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the module alias as reported with uevents.
|
||||
It is of the format css:t<type> and present for all
|
||||
subchannel types.
|
||||
|
||||
What: /sys/bus/css/drivers/io_subchannel/.../chpids
|
||||
Date: December 2002
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the ids of the channel paths used by this
|
||||
subchannel, as reported by the channel subsystem
|
||||
during subchannel recognition.
|
||||
Note: This is an I/O-subchannel specific attribute.
|
||||
Users: s390-tools, HAL
|
||||
|
||||
What: /sys/bus/css/drivers/io_subchannel/.../pimpampom
|
||||
Date: December 2002
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the PIM/PAM/POM values, as reported by the
|
||||
channel subsystem when last queried by the common I/O
|
||||
layer (this implies that this attribute is not neccessarily
|
||||
in sync with the values current in the channel subsystem).
|
||||
Note: This is an I/O-subchannel specific attribute.
|
||||
Users: s390-tools, HAL
|
||||
@@ -0,0 +1,71 @@
|
||||
What: /sys/firmware/memmap/
|
||||
Date: June 2008
|
||||
Contact: Bernhard Walle <bwalle@suse.de>
|
||||
Description:
|
||||
On all platforms, the firmware provides a memory map which the
|
||||
kernel reads. The resources from that memory map are registered
|
||||
in the kernel resource tree and exposed to userspace via
|
||||
/proc/iomem (together with other resources).
|
||||
|
||||
However, on most architectures that firmware-provided memory
|
||||
map is modified afterwards by the kernel itself, either because
|
||||
the kernel merges that memory map with other information or
|
||||
just because the user overwrites that memory map via command
|
||||
line.
|
||||
|
||||
kexec needs the raw firmware-provided memory map to setup the
|
||||
parameter segment of the kernel that should be booted with
|
||||
kexec. Also, the raw memory map is useful for debugging. For
|
||||
that reason, /sys/firmware/memmap is an interface that provides
|
||||
the raw memory map to userspace.
|
||||
|
||||
The structure is as follows: Under /sys/firmware/memmap there
|
||||
are subdirectories with the number of the entry as their name:
|
||||
|
||||
/sys/firmware/memmap/0
|
||||
/sys/firmware/memmap/1
|
||||
/sys/firmware/memmap/2
|
||||
/sys/firmware/memmap/3
|
||||
...
|
||||
|
||||
The maximum depends on the number of memory map entries provided
|
||||
by the firmware. The order is just the order that the firmware
|
||||
provides.
|
||||
|
||||
Each directory contains three files:
|
||||
|
||||
start : The start address (as hexadecimal number with the
|
||||
'0x' prefix).
|
||||
end : The end address, inclusive (regardless whether the
|
||||
firmware provides inclusive or exclusive ranges).
|
||||
type : Type of the entry as string. See below for a list of
|
||||
valid types.
|
||||
|
||||
So, for example:
|
||||
|
||||
/sys/firmware/memmap/0/start
|
||||
/sys/firmware/memmap/0/end
|
||||
/sys/firmware/memmap/0/type
|
||||
/sys/firmware/memmap/1/start
|
||||
...
|
||||
|
||||
Currently following types exist:
|
||||
|
||||
- System RAM
|
||||
- ACPI Tables
|
||||
- ACPI Non-volatile Storage
|
||||
- reserved
|
||||
|
||||
Following shell snippet can be used to display that memory
|
||||
map in a human-readable format:
|
||||
|
||||
-------------------- 8< ----------------------------------------
|
||||
#!/bin/bash
|
||||
cd /sys/firmware/memmap
|
||||
for dir in * ; do
|
||||
start=$(cat $dir/start)
|
||||
end=$(cat $dir/end)
|
||||
type=$(cat $dir/type)
|
||||
printf "%016x-%016x (%s)\n" $start $[ $end +1] "$type"
|
||||
done
|
||||
-------------------- >8 ----------------------------------------
|
||||
+1
-1
@@ -377,7 +377,7 @@ Bug Reporting
|
||||
bugzilla.kernel.org is where the Linux kernel developers track kernel
|
||||
bugs. Users are encouraged to report all bugs that they find in this
|
||||
tool. For details on how to use the kernel bugzilla, please see:
|
||||
http://test.kernel.org/bugzilla/faq.html
|
||||
http://bugzilla.kernel.org/page.cgi?id=faq.html
|
||||
|
||||
The file REPORTING-BUGS in the main kernel source directory has a good
|
||||
template for how to report a possible kernel bug, and details what kind
|
||||
|
||||
@@ -1,17 +1,26 @@
|
||||
ChangeLog:
|
||||
Started by Ingo Molnar <mingo@redhat.com>
|
||||
Update by Max Krasnyansky <maxk@qualcomm.com>
|
||||
|
||||
SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com>
|
||||
|
||||
SMP IRQ affinity
|
||||
|
||||
/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
|
||||
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
|
||||
to turn off all CPUs, and if an IRQ controller does not support IRQ
|
||||
affinity then the value will not change from the default 0xffffffff.
|
||||
|
||||
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
|
||||
the IRQ to CPU4-7 (this is an 8-CPU SMP box):
|
||||
/proc/irq/default_smp_affinity specifies default affinity mask that applies
|
||||
to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
|
||||
will be set to the default mask. It can then be changed as described above.
|
||||
Default mask is 0xffffffff.
|
||||
|
||||
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
|
||||
it to CPU4-7 (this is an 8-CPU SMP box):
|
||||
|
||||
[root@moon 44]# cd /proc/irq/44
|
||||
[root@moon 44]# cat smp_affinity
|
||||
ffffffff
|
||||
|
||||
[root@moon 44]# echo 0f > smp_affinity
|
||||
[root@moon 44]# cat smp_affinity
|
||||
0000000f
|
||||
@@ -21,17 +30,27 @@ PING hell (195.4.7.3): 56 data bytes
|
||||
--- hell ping statistics ---
|
||||
6029 packets transmitted, 6027 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.1/0.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 0 1785 1785 1783 1783 1
|
||||
1 0 IO-APIC-level eth1
|
||||
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
|
||||
|
||||
As can be seen from the line above IRQ44 was delivered only to the first four
|
||||
processors (0-3).
|
||||
Now lets restrict that IRQ to CPU(4-7).
|
||||
|
||||
[root@moon 44]# echo f0 > smp_affinity
|
||||
[root@moon 44]# cat smp_affinity
|
||||
000000f0
|
||||
[root@moon 44]# ping -f h
|
||||
PING hell (195.4.7.3): 56 data bytes
|
||||
..
|
||||
--- hell ping statistics ---
|
||||
2779 packets transmitted, 2777 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.5/585.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1
|
||||
[root@moon 44]#
|
||||
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
|
||||
|
||||
This time around IRQ44 was delivered only to the last four processors.
|
||||
i.e counters for the CPU0-3 did not change.
|
||||
|
||||
|
||||
@@ -93,6 +93,9 @@ Since NMI handlers disable preemption, synchronize_sched() is guaranteed
|
||||
not to return until all ongoing NMI handlers exit. It is therefore safe
|
||||
to free up the handler's data as soon as synchronize_sched() returns.
|
||||
|
||||
Important note: for this to work, the architecture in question must
|
||||
invoke irq_enter() and irq_exit() on NMI entry and exit, respectively.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
|
||||
|
||||
@@ -52,6 +52,10 @@ of each iteration. Unfortunately, chaotic relaxation requires highly
|
||||
structured data, such as the matrices used in scientific programs, and
|
||||
is thus inapplicable to most data structures in operating-system kernels.
|
||||
|
||||
In 1992, Henry (now Alexia) Massalin completed a dissertation advising
|
||||
parallel programmers to defer processing when feasible to simplify
|
||||
synchronization. RCU makes extremely heavy use of this advice.
|
||||
|
||||
In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
|
||||
simplest deferred-free technique: simply waiting a fixed amount of time
|
||||
before freeing blocks awaiting deferred free. Jacobson did not describe
|
||||
@@ -138,6 +142,13 @@ blocking in read-side critical sections appeared [PaulEMcKenney2006c],
|
||||
Robert Olsson described an RCU-protected trie-hash combination
|
||||
[RobertOlsson2006a].
|
||||
|
||||
2007 saw the journal version of the award-winning RCU paper from 2006
|
||||
[ThomasEHart2007a], as well as a paper demonstrating use of Promela
|
||||
and Spin to mechanically verify an optimization to Oleg Nesterov's
|
||||
QRCU [PaulEMcKenney2007QRCUspin], a design document describing
|
||||
preemptible RCU [PaulEMcKenney2007PreemptibleRCU], and the three-part
|
||||
LWN "What is RCU?" series [PaulEMcKenney2007WhatIsRCUFundamentally,
|
||||
PaulEMcKenney2008WhatIsRCUUsage, and PaulEMcKenney2008WhatIsRCUAPI].
|
||||
|
||||
Bibtex Entries
|
||||
|
||||
@@ -202,6 +213,20 @@ Bibtex Entries
|
||||
,Year="1991"
|
||||
}
|
||||
|
||||
@phdthesis{HMassalinPhD
|
||||
,author="H. Massalin"
|
||||
,title="Synthesis: An Efficient Implementation of Fundamental Operating
|
||||
System Services"
|
||||
,school="Columbia University"
|
||||
,address="New York, NY"
|
||||
,year="1992"
|
||||
,annotation="
|
||||
Mondo optimizing compiler.
|
||||
Wait-free stuff.
|
||||
Good advice: defer work to avoid synchronization.
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{Jacobson93
|
||||
,author="Van Jacobson"
|
||||
,title="Avoid Read-Side Locking Via Delayed Free"
|
||||
@@ -635,3 +660,86 @@ Revised:
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2007PreemptibleRCU
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="The design of preemptible read-copy-update"
|
||||
,month="October"
|
||||
,day="8"
|
||||
,year="2007"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/253651/}
|
||||
[Viewed October 25, 2007]"
|
||||
,annotation="
|
||||
LWN article describing the design of preemptible RCU.
|
||||
"
|
||||
}
|
||||
|
||||
########################################################################
|
||||
#
|
||||
# "What is RCU?" LWN series.
|
||||
#
|
||||
|
||||
@unpublished{PaulEMcKenney2007WhatIsRCUFundamentally
|
||||
,Author="Paul E. McKenney and Jonathan Walpole"
|
||||
,Title="What is {RCU}, Fundamentally?"
|
||||
,month="December"
|
||||
,day="17"
|
||||
,year="2007"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/262464/}
|
||||
[Viewed December 27, 2007]"
|
||||
,annotation="
|
||||
Lays out the three basic components of RCU: (1) publish-subscribe,
|
||||
(2) wait for pre-existing readers to complete, and (2) maintain
|
||||
multiple versions.
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2008WhatIsRCUUsage
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="What is {RCU}? Part 2: Usage"
|
||||
,month="January"
|
||||
,day="4"
|
||||
,year="2008"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/263130/}
|
||||
[Viewed January 4, 2008]"
|
||||
,annotation="
|
||||
Lays out six uses of RCU:
|
||||
1. RCU is a Reader-Writer Lock Replacement
|
||||
2. RCU is a Restricted Reference-Counting Mechanism
|
||||
3. RCU is a Bulk Reference-Counting Mechanism
|
||||
4. RCU is a Poor Man's Garbage Collector
|
||||
5. RCU is a Way of Providing Existence Guarantees
|
||||
6. RCU is a Way of Waiting for Things to Finish
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2008WhatIsRCUAPI
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="{RCU} part 3: the {RCU} {API}"
|
||||
,month="January"
|
||||
,day="17"
|
||||
,year="2008"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/264090/}
|
||||
[Viewed January 10, 2008]"
|
||||
,annotation="
|
||||
Gives an overview of the Linux-kernel RCU API and a brief annotated RCU
|
||||
bibliography.
|
||||
"
|
||||
}
|
||||
|
||||
@article{DinakarGuniguntala2008IBMSysJ
|
||||
,author="D. Guniguntala and P. E. McKenney and J. Triplett and J. Walpole"
|
||||
,title="The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with {Linux}"
|
||||
,Year="2008"
|
||||
,Month="April"
|
||||
,journal="IBM Systems Journal"
|
||||
,volume="47"
|
||||
,number="2"
|
||||
,pages="@@-@@"
|
||||
,annotation="
|
||||
RCU, realtime RCU, sleepable RCU, performance.
|
||||
"
|
||||
}
|
||||
|
||||
@@ -13,10 +13,13 @@ over a rather long period of time, but improvements are always welcome!
|
||||
detailed performance measurements show that RCU is nonetheless
|
||||
the right tool for the job.
|
||||
|
||||
The other exception would be where performance is not an issue,
|
||||
and RCU provides a simpler implementation. An example of this
|
||||
situation is the dynamic NMI code in the Linux 2.6 kernel,
|
||||
at least on architectures where NMIs are rare.
|
||||
Another exception is where performance is not an issue, and RCU
|
||||
provides a simpler implementation. An example of this situation
|
||||
is the dynamic NMI code in the Linux 2.6 kernel, at least on
|
||||
architectures where NMIs are rare.
|
||||
|
||||
Yet another exception is where the low real-time latency of RCU's
|
||||
read-side primitives is critically important.
|
||||
|
||||
1. Does the update code have proper mutual exclusion?
|
||||
|
||||
@@ -39,9 +42,10 @@ over a rather long period of time, but improvements are always welcome!
|
||||
|
||||
2. Do the RCU read-side critical sections make proper use of
|
||||
rcu_read_lock() and friends? These primitives are needed
|
||||
to suppress preemption (or bottom halves, in the case of
|
||||
rcu_read_lock_bh()) in the read-side critical sections,
|
||||
and are also an excellent aid to readability.
|
||||
to prevent grace periods from ending prematurely, which
|
||||
could result in data being unceremoniously freed out from
|
||||
under your read-side code, which can greatly increase the
|
||||
actuarial risk of your kernel.
|
||||
|
||||
As a rough rule of thumb, any dereference of an RCU-protected
|
||||
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
|
||||
@@ -54,15 +58,30 @@ over a rather long period of time, but improvements are always welcome!
|
||||
be running while updates are in progress. There are a number
|
||||
of ways to handle this concurrency, depending on the situation:
|
||||
|
||||
a. Make updates appear atomic to readers. For example,
|
||||
a. Use the RCU variants of the list and hlist update
|
||||
primitives to add, remove, and replace elements on an
|
||||
RCU-protected list. Alternatively, use the RCU-protected
|
||||
trees that have been added to the Linux kernel.
|
||||
|
||||
This is almost always the best approach.
|
||||
|
||||
b. Proceed as in (a) above, but also maintain per-element
|
||||
locks (that are acquired by both readers and writers)
|
||||
that guard per-element state. Of course, fields that
|
||||
the readers refrain from accessing can be guarded by the
|
||||
update-side lock.
|
||||
|
||||
This works quite well, also.
|
||||
|
||||
c. Make updates appear atomic to readers. For example,
|
||||
pointer updates to properly aligned fields will appear
|
||||
atomic, as will individual atomic primitives. Operations
|
||||
performed under a lock and sequences of multiple atomic
|
||||
primitives will -not- appear to be atomic.
|
||||
|
||||
This is almost always the best approach.
|
||||
This can work, but is starting to get a bit tricky.
|
||||
|
||||
b. Carefully order the updates and the reads so that
|
||||
d. Carefully order the updates and the reads so that
|
||||
readers see valid data at all phases of the update.
|
||||
This is often more difficult than it sounds, especially
|
||||
given modern CPUs' tendency to reorder memory references.
|
||||
@@ -123,18 +142,22 @@ over a rather long period of time, but improvements are always welcome!
|
||||
when publicizing a pointer to a structure that can
|
||||
be traversed by an RCU read-side critical section.
|
||||
|
||||
5. If call_rcu(), or a related primitive such as call_rcu_bh(),
|
||||
is used, the callback function must be written to be called
|
||||
from softirq context. In particular, it cannot block.
|
||||
5. If call_rcu(), or a related primitive such as call_rcu_bh() or
|
||||
call_rcu_sched(), is used, the callback function must be
|
||||
written to be called from softirq context. In particular,
|
||||
it cannot block.
|
||||
|
||||
6. Since synchronize_rcu() can block, it cannot be called from
|
||||
any sort of irq context.
|
||||
any sort of irq context. Ditto for synchronize_sched() and
|
||||
synchronize_srcu().
|
||||
|
||||
7. If the updater uses call_rcu(), then the corresponding readers
|
||||
must use rcu_read_lock() and rcu_read_unlock(). If the updater
|
||||
uses call_rcu_bh(), then the corresponding readers must use
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up
|
||||
will result in confusion and broken kernels.
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh(). If the updater
|
||||
uses call_rcu_sched(), then the corresponding readers must
|
||||
disable preemption. Mixing things up will result in confusion
|
||||
and broken kernels.
|
||||
|
||||
One exception to this rule: rcu_read_lock() and rcu_read_unlock()
|
||||
may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
|
||||
@@ -143,9 +166,9 @@ over a rather long period of time, but improvements are always welcome!
|
||||
such cases is a must, of course! And the jury is still out on
|
||||
whether the increased speed is worth it.
|
||||
|
||||
8. Although synchronize_rcu() is a bit slower than is call_rcu(),
|
||||
it usually results in simpler code. So, unless update
|
||||
performance is critically important or the updaters cannot block,
|
||||
8. Although synchronize_rcu() is slower than is call_rcu(), it
|
||||
usually results in simpler code. So, unless update performance
|
||||
is critically important or the updaters cannot block,
|
||||
synchronize_rcu() should be used in preference to call_rcu().
|
||||
|
||||
An especially important property of the synchronize_rcu()
|
||||
@@ -187,23 +210,23 @@ over a rather long period of time, but improvements are always welcome!
|
||||
number of updates per grace period.
|
||||
|
||||
9. All RCU list-traversal primitives, which include
|
||||
list_for_each_rcu(), list_for_each_entry_rcu(),
|
||||
rcu_dereference(), list_for_each_rcu(), list_for_each_entry_rcu(),
|
||||
list_for_each_continue_rcu(), and list_for_each_safe_rcu(),
|
||||
must be within an RCU read-side critical section. RCU
|
||||
must be either within an RCU read-side critical section or
|
||||
must be protected by appropriate update-side locks. RCU
|
||||
read-side critical sections are delimited by rcu_read_lock()
|
||||
and rcu_read_unlock(), or by similar primitives such as
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh().
|
||||
|
||||
Use of the _rcu() list-traversal primitives outside of an
|
||||
RCU read-side critical section causes no harm other than
|
||||
a slight performance degradation on Alpha CPUs. It can
|
||||
also be quite helpful in reducing code bloat when common
|
||||
code is shared between readers and updaters.
|
||||
The reason that it is permissible to use RCU list-traversal
|
||||
primitives when the update-side lock is held is that doing so
|
||||
can be quite helpful in reducing code bloat when common code is
|
||||
shared between readers and updaters.
|
||||
|
||||
10. Conversely, if you are in an RCU read-side critical section,
|
||||
you -must- use the "_rcu()" variants of the list macros.
|
||||
Failing to do so will break Alpha and confuse people reading
|
||||
your code.
|
||||
and you don't hold the appropriate update-side lock, you -must-
|
||||
use the "_rcu()" variants of the list macros. Failing to do so
|
||||
will break Alpha and confuse people reading your code.
|
||||
|
||||
11. Note that synchronize_rcu() -only- guarantees to wait until
|
||||
all currently executing rcu_read_lock()-protected RCU read-side
|
||||
@@ -230,6 +253,14 @@ over a rather long period of time, but improvements are always welcome!
|
||||
must use whatever locking or other synchronization is required
|
||||
to safely access and/or modify that data structure.
|
||||
|
||||
RCU callbacks are -usually- executed on the same CPU that executed
|
||||
the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(),
|
||||
but are by -no- means guaranteed to be. For example, if a given
|
||||
CPU goes offline while having an RCU callback pending, then that
|
||||
RCU callback will execute on some surviving CPU. (If this was
|
||||
not the case, a self-spawning RCU callback would prevent the
|
||||
victim CPU from ever going offline.)
|
||||
|
||||
14. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu())
|
||||
may only be invoked from process context. Unlike other forms of
|
||||
RCU, it -is- permissible to block in an SRCU read-side critical
|
||||
|
||||
@@ -10,23 +10,30 @@ status messages via printk(), which can be examined via the dmesg
|
||||
command (perhaps grepping for "torture"). The test is started
|
||||
when the module is loaded, and stops when the module is unloaded.
|
||||
|
||||
However, actually setting this config option to "y" results in the system
|
||||
running the test immediately upon boot, and ending only when the system
|
||||
is taken down. Normally, one will instead want to build the system
|
||||
with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control
|
||||
the test, perhaps using a script similar to the one shown at the end of
|
||||
this document. Note that you will need CONFIG_MODULE_UNLOAD in order
|
||||
to be able to end the test.
|
||||
CONFIG_RCU_TORTURE_TEST_RUNNABLE
|
||||
|
||||
It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will
|
||||
result in the tests being loaded into the base kernel. In this case,
|
||||
the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify
|
||||
whether the RCU torture tests are to be started immediately during
|
||||
boot or whether the /proc/sys/kernel/rcutorture_runnable file is used
|
||||
to enable them. This /proc file can be used to repeatedly pause and
|
||||
restart the tests, regardless of the initial state specified by the
|
||||
CONFIG_RCU_TORTURE_TEST_RUNNABLE config option.
|
||||
|
||||
You will normally -not- want to start the RCU torture tests during boot
|
||||
(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing
|
||||
this can sometimes be useful in finding boot-time bugs.
|
||||
|
||||
|
||||
MODULE PARAMETERS
|
||||
|
||||
This module has the following parameters:
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
irqreaders Says to invoke RCU readers from irq level. This is currently
|
||||
done via timers. Defaults to "1" for variants of RCU that
|
||||
permit this. (Or, more accurately, variants of RCU that do
|
||||
-not- permit this know to ignore this variable.)
|
||||
|
||||
nfakewriters This is the number of RCU fake writer threads to run. Fake
|
||||
writer threads repeatedly use the synchronous "wait for
|
||||
@@ -37,6 +44,16 @@ nfakewriters This is the number of RCU fake writer threads to run. Fake
|
||||
to trigger special cases caused by multiple writers, such as
|
||||
the synchronize_srcu() early return optimization.
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
|
||||
shuffle_interval
|
||||
The number of seconds to keep the test threads affinitied
|
||||
to a particular subset of the CPUs, defaults to 3 seconds.
|
||||
Used in conjunction with test_no_idle_hz.
|
||||
|
||||
stat_interval The number of seconds between output of torture
|
||||
statistics (via printk()). Regardless of the interval,
|
||||
statistics are printed when the module is unloaded.
|
||||
@@ -44,10 +61,11 @@ stat_interval The number of seconds between output of torture
|
||||
be printed -only- when the module is unloaded, and this
|
||||
is the default.
|
||||
|
||||
shuffle_interval
|
||||
The number of seconds to keep the test threads affinitied
|
||||
to a particular subset of the CPUs, defaults to 5 seconds.
|
||||
Used in conjunction with test_no_idle_hz.
|
||||
stutter The length of time to run the test before pausing for this
|
||||
same period of time. Defaults to "stutter=5", so as
|
||||
to run and pause for (roughly) five-second intervals.
|
||||
Specifying "stutter=0" causes the test to run continuously
|
||||
without pausing, which is the old default behavior.
|
||||
|
||||
test_no_idle_hz Whether or not to test the ability of RCU to operate in
|
||||
a kernel that disables the scheduling-clock interrupt to
|
||||
|
||||
@@ -1,3 +1,11 @@
|
||||
Please note that the "What is RCU?" LWN series is an excellent place
|
||||
to start learning about RCU:
|
||||
|
||||
1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/
|
||||
2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/
|
||||
3. RCU part 3: the RCU API http://lwn.net/Articles/264090/
|
||||
|
||||
|
||||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
@@ -772,26 +780,18 @@ Linux-kernel source code, but it helps to have a full list of the
|
||||
APIs, since there does not appear to be a way to categorize them
|
||||
in docbook. Here is the list, by category.
|
||||
|
||||
Markers for RCU read-side critical sections:
|
||||
|
||||
rcu_read_lock
|
||||
rcu_read_unlock
|
||||
rcu_read_lock_bh
|
||||
rcu_read_unlock_bh
|
||||
srcu_read_lock
|
||||
srcu_read_unlock
|
||||
|
||||
RCU pointer/list traversal:
|
||||
|
||||
rcu_dereference
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_entry_rcu
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
hlist_for_each_entry_rcu
|
||||
|
||||
RCU pointer update:
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
|
||||
RCU pointer/list update:
|
||||
|
||||
rcu_assign_pointer
|
||||
list_add_rcu
|
||||
@@ -799,16 +799,36 @@ RCU pointer update:
|
||||
list_del_rcu
|
||||
list_replace_rcu
|
||||
hlist_del_rcu
|
||||
hlist_add_after_rcu
|
||||
hlist_add_before_rcu
|
||||
hlist_add_head_rcu
|
||||
hlist_replace_rcu
|
||||
list_splice_init_rcu()
|
||||
|
||||
RCU grace period:
|
||||
RCU: Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock synchronize_net rcu_barrier
|
||||
rcu_read_unlock synchronize_rcu
|
||||
call_rcu
|
||||
|
||||
|
||||
bh: Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock_bh call_rcu_bh rcu_barrier_bh
|
||||
rcu_read_unlock_bh
|
||||
|
||||
|
||||
sched: Critical sections Grace period Barrier
|
||||
|
||||
[preempt_disable] synchronize_sched rcu_barrier_sched
|
||||
[and friends] call_rcu_sched
|
||||
|
||||
|
||||
SRCU: Critical sections Grace period Barrier
|
||||
|
||||
srcu_read_lock synchronize_srcu N/A
|
||||
srcu_read_unlock
|
||||
|
||||
synchronize_net
|
||||
synchronize_sched
|
||||
synchronize_rcu
|
||||
synchronize_srcu
|
||||
call_rcu
|
||||
call_rcu_bh
|
||||
|
||||
See the comment headers in the source code (or the docbook generated
|
||||
from them) for more information.
|
||||
|
||||
@@ -0,0 +1,327 @@
|
||||
----------------------------------------------------------------------
|
||||
1. INTRODUCTION
|
||||
|
||||
Modern filesystems feature checksumming of data and metadata to
|
||||
protect against data corruption. However, the detection of the
|
||||
corruption is done at read time which could potentially be months
|
||||
after the data was written. At that point the original data that the
|
||||
application tried to write is most likely lost.
|
||||
|
||||
The solution is to ensure that the disk is actually storing what the
|
||||
application meant it to. Recent additions to both the SCSI family
|
||||
protocols (SBC Data Integrity Field, SCC protection proposal) as well
|
||||
as SATA/T13 (External Path Protection) try to remedy this by adding
|
||||
support for appending integrity metadata to an I/O. The integrity
|
||||
metadata (or protection information in SCSI terminology) includes a
|
||||
checksum for each sector as well as an incrementing counter that
|
||||
ensures the individual sectors are written in the right order. And
|
||||
for some protection schemes also that the I/O is written to the right
|
||||
place on disk.
|
||||
|
||||
Current storage controllers and devices implement various protective
|
||||
measures, for instance checksumming and scrubbing. But these
|
||||
technologies are working in their own isolated domains or at best
|
||||
between adjacent nodes in the I/O path. The interesting thing about
|
||||
DIF and the other integrity extensions is that the protection format
|
||||
is well defined and every node in the I/O path can verify the
|
||||
integrity of the I/O and reject it if corruption is detected. This
|
||||
allows not only corruption prevention but also isolation of the point
|
||||
of failure.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2. THE DATA INTEGRITY EXTENSIONS
|
||||
|
||||
As written, the protocol extensions only protect the path between
|
||||
controller and storage device. However, many controllers actually
|
||||
allow the operating system to interact with the integrity metadata
|
||||
(IMD). We have been working with several FC/SAS HBA vendors to enable
|
||||
the protection information to be transferred to and from their
|
||||
controllers.
|
||||
|
||||
The SCSI Data Integrity Field works by appending 8 bytes of protection
|
||||
information to each sector. The data + integrity metadata is stored
|
||||
in 520 byte sectors on disk. Data + IMD are interleaved when
|
||||
transferred between the controller and target. The T13 proposal is
|
||||
similar.
|
||||
|
||||
Because it is highly inconvenient for operating systems to deal with
|
||||
520 (and 4104) byte sectors, we approached several HBA vendors and
|
||||
encouraged them to allow separation of the data and integrity metadata
|
||||
scatter-gather lists.
|
||||
|
||||
The controller will interleave the buffers on write and split them on
|
||||
read. This means that the Linux can DMA the data buffers to and from
|
||||
host memory without changes to the page cache.
|
||||
|
||||
Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
|
||||
is somewhat heavy to compute in software. Benchmarks found that
|
||||
calculating this checksum had a significant impact on system
|
||||
performance for a number of workloads. Some controllers allow a
|
||||
lighter-weight checksum to be used when interfacing with the operating
|
||||
system. Emulex, for instance, supports the TCP/IP checksum instead.
|
||||
The IP checksum received from the OS is converted to the 16-bit CRC
|
||||
when writing and vice versa. This allows the integrity metadata to be
|
||||
generated by Linux or the application at very low cost (comparable to
|
||||
software RAID5).
|
||||
|
||||
The IP checksum is weaker than the CRC in terms of detecting bit
|
||||
errors. However, the strength is really in the separation of the data
|
||||
buffers and the integrity metadata. These two distinct buffers much
|
||||
match up for an I/O to complete.
|
||||
|
||||
The separation of the data and integrity metadata buffers as well as
|
||||
the choice in checksums is referred to as the Data Integrity
|
||||
Extensions. As these extensions are outside the scope of the protocol
|
||||
bodies (T10, T13), Oracle and its partners are trying to standardize
|
||||
them within the Storage Networking Industry Association.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
3. KERNEL CHANGES
|
||||
|
||||
The data integrity framework in Linux enables protection information
|
||||
to be pinned to I/Os and sent to/received from controllers that
|
||||
support it.
|
||||
|
||||
The advantage to the integrity extensions in SCSI and SATA is that
|
||||
they enable us to protect the entire path from application to storage
|
||||
device. However, at the same time this is also the biggest
|
||||
disadvantage. It means that the protection information must be in a
|
||||
format that can be understood by the disk.
|
||||
|
||||
Generally Linux/POSIX applications are agnostic to the intricacies of
|
||||
the storage devices they are accessing. The virtual filesystem switch
|
||||
and the block layer make things like hardware sector size and
|
||||
transport protocols completely transparent to the application.
|
||||
|
||||
However, this level of detail is required when preparing the
|
||||
protection information to send to a disk. Consequently, the very
|
||||
concept of an end-to-end protection scheme is a layering violation.
|
||||
It is completely unreasonable for an application to be aware whether
|
||||
it is accessing a SCSI or SATA disk.
|
||||
|
||||
The data integrity support implemented in Linux attempts to hide this
|
||||
from the application. As far as the application (and to some extent
|
||||
the kernel) is concerned, the integrity metadata is opaque information
|
||||
that's attached to the I/O.
|
||||
|
||||
The current implementation allows the block layer to automatically
|
||||
generate the protection information for any I/O. Eventually the
|
||||
intent is to move the integrity metadata calculation to userspace for
|
||||
user data. Metadata and other I/O that originates within the kernel
|
||||
will still use the automatic generation interface.
|
||||
|
||||
Some storage devices allow each hardware sector to be tagged with a
|
||||
16-bit value. The owner of this tag space is the owner of the block
|
||||
device. I.e. the filesystem in most cases. The filesystem can use
|
||||
this extra space to tag sectors as they see fit. Because the tag
|
||||
space is limited, the block interface allows tagging bigger chunks by
|
||||
way of interleaving. This way, 8*16 bits of information can be
|
||||
attached to a typical 4KB filesystem block.
|
||||
|
||||
This also means that applications such as fsck and mkfs will need
|
||||
access to manipulate the tags from user space. A passthrough
|
||||
interface for this is being worked on.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
4. BLOCK LAYER IMPLEMENTATION DETAILS
|
||||
|
||||
4.1 BIO
|
||||
|
||||
The data integrity patches add a new field to struct bio when
|
||||
CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
|
||||
to a struct bip which contains the bio integrity payload. Essentially
|
||||
a bip is a trimmed down struct bio which holds a bio_vec containing
|
||||
the integrity metadata and the required housekeeping information (bvec
|
||||
pool, vector count, etc.)
|
||||
|
||||
A kernel subsystem can enable data integrity protection on a bio by
|
||||
calling bio_integrity_alloc(bio). This will allocate and attach the
|
||||
bip to the bio.
|
||||
|
||||
Individual pages containing integrity metadata can subsequently be
|
||||
attached using bio_integrity_add_page().
|
||||
|
||||
bio_free() will automatically free the bip.
|
||||
|
||||
|
||||
4.2 BLOCK DEVICE
|
||||
|
||||
Because the format of the protection data is tied to the physical
|
||||
disk, each block device has been extended with a block integrity
|
||||
profile (struct blk_integrity). This optional profile is registered
|
||||
with the block layer using blk_integrity_register().
|
||||
|
||||
The profile contains callback functions for generating and verifying
|
||||
the protection data, as well as getting and setting application tags.
|
||||
The profile also contains a few constants to aid in completing,
|
||||
merging and splitting the integrity metadata.
|
||||
|
||||
Layered block devices will need to pick a profile that's appropriate
|
||||
for all subdevices. blk_integrity_compare() can help with that. DM
|
||||
and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
|
||||
will require extra work due to the application tag.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
5.0 BLOCK LAYER INTEGRITY API
|
||||
|
||||
5.1 NORMAL FILESYSTEM
|
||||
|
||||
The normal filesystem is unaware that the underlying block device
|
||||
is capable of sending/receiving integrity metadata. The IMD will
|
||||
be automatically generated by the block layer at submit_bio() time
|
||||
in case of a WRITE. A READ request will cause the I/O integrity
|
||||
to be verified upon completion.
|
||||
|
||||
IMD generation and verification can be toggled using the
|
||||
|
||||
/sys/block/<bdev>/integrity/write_generate
|
||||
|
||||
and
|
||||
|
||||
/sys/block/<bdev>/integrity/read_verify
|
||||
|
||||
flags.
|
||||
|
||||
|
||||
5.2 INTEGRITY-AWARE FILESYSTEM
|
||||
|
||||
A filesystem that is integrity-aware can prepare I/Os with IMD
|
||||
attached. It can also use the application tag space if this is
|
||||
supported by the block device.
|
||||
|
||||
|
||||
int bdev_integrity_enabled(block_device, int rw);
|
||||
|
||||
bdev_integrity_enabled() will return 1 if the block device
|
||||
supports integrity metadata transfer for the data direction
|
||||
specified in 'rw'.
|
||||
|
||||
bdev_integrity_enabled() honors the write_generate and
|
||||
read_verify flags in sysfs and will respond accordingly.
|
||||
|
||||
|
||||
int bio_integrity_prep(bio);
|
||||
|
||||
To generate IMD for WRITE and to set up buffers for READ, the
|
||||
filesystem must call bio_integrity_prep(bio).
|
||||
|
||||
Prior to calling this function, the bio data direction and start
|
||||
sector must be set, and the bio should have all data pages
|
||||
added. It is up to the caller to ensure that the bio does not
|
||||
change while I/O is in progress.
|
||||
|
||||
bio_integrity_prep() should only be called if
|
||||
bio_integrity_enabled() returned 1.
|
||||
|
||||
|
||||
int bio_integrity_tag_size(bio);
|
||||
|
||||
If the filesystem wants to use the application tag space it will
|
||||
first have to find out how much storage space is available.
|
||||
Because tag space is generally limited (usually 2 bytes per
|
||||
sector regardless of sector size), the integrity framework
|
||||
supports interleaving the information between the sectors in an
|
||||
I/O.
|
||||
|
||||
Filesystems can call bio_integrity_tag_size(bio) to find out how
|
||||
many bytes of storage are available for that particular bio.
|
||||
|
||||
Another option is bdev_get_tag_size(block_device) which will
|
||||
return the number of available bytes per hardware sector.
|
||||
|
||||
|
||||
int bio_integrity_set_tag(bio, void *tag_buf, len);
|
||||
|
||||
After a successful return from bio_integrity_prep(),
|
||||
bio_integrity_set_tag() can be used to attach an opaque tag
|
||||
buffer to a bio. Obviously this only makes sense if the I/O is
|
||||
a WRITE.
|
||||
|
||||
|
||||
int bio_integrity_get_tag(bio, void *tag_buf, len);
|
||||
|
||||
Similarly, at READ I/O completion time the filesystem can
|
||||
retrieve the tag buffer using bio_integrity_get_tag().
|
||||
|
||||
|
||||
6.3 PASSING EXISTING INTEGRITY METADATA
|
||||
|
||||
Filesystems that either generate their own integrity metadata or
|
||||
are capable of transferring IMD from user space can use the
|
||||
following calls:
|
||||
|
||||
|
||||
struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
|
||||
|
||||
Allocates the bio integrity payload and hangs it off of the bio.
|
||||
nr_pages indicate how many pages of protection data need to be
|
||||
stored in the integrity bio_vec list (similar to bio_alloc()).
|
||||
|
||||
The integrity payload will be freed at bio_free() time.
|
||||
|
||||
|
||||
int bio_integrity_add_page(bio, page, len, offset);
|
||||
|
||||
Attaches a page containing integrity metadata to an existing
|
||||
bio. The bio must have an existing bip,
|
||||
i.e. bio_integrity_alloc() must have been called. For a WRITE,
|
||||
the integrity metadata in the pages must be in a format
|
||||
understood by the target device with the notable exception that
|
||||
the sector numbers will be remapped as the request traverses the
|
||||
I/O stack. This implies that the pages added using this call
|
||||
will be modified during I/O! The first reference tag in the
|
||||
integrity metadata must have a value of bip->bip_sector.
|
||||
|
||||
Pages can be added using bio_integrity_add_page() as long as
|
||||
there is room in the bip bio_vec array (nr_pages).
|
||||
|
||||
Upon completion of a READ operation, the attached pages will
|
||||
contain the integrity metadata received from the storage device.
|
||||
It is up to the receiver to process them and verify data
|
||||
integrity upon completion.
|
||||
|
||||
|
||||
6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
|
||||
METADATA
|
||||
|
||||
To enable integrity exchange on a block device the gendisk must be
|
||||
registered as capable:
|
||||
|
||||
int blk_integrity_register(gendisk, blk_integrity);
|
||||
|
||||
The blk_integrity struct is a template and should contain the
|
||||
following:
|
||||
|
||||
static struct blk_integrity my_profile = {
|
||||
.name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
|
||||
.generate_fn = my_generate_fn,
|
||||
.verify_fn = my_verify_fn,
|
||||
.get_tag_fn = my_get_tag_fn,
|
||||
.set_tag_fn = my_set_tag_fn,
|
||||
.tuple_size = sizeof(struct my_tuple_size),
|
||||
.tag_size = <tag bytes per hw sector>,
|
||||
};
|
||||
|
||||
'name' is a text string which will be visible in sysfs. This is
|
||||
part of the userland API so chose it carefully and never change
|
||||
it. The format is standards body-type-variant.
|
||||
E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
|
||||
|
||||
'generate_fn' generates appropriate integrity metadata (for WRITE).
|
||||
|
||||
'verify_fn' verifies that the data buffer matches the integrity
|
||||
metadata.
|
||||
|
||||
'tuple_size' must be set to match the size of the integrity
|
||||
metadata per sector. I.e. 8 for DIF and EPP.
|
||||
|
||||
'tag_size' must be set to identify how many bytes of tag space
|
||||
are available per hardware sector. For DIF this is either 2 or
|
||||
0 depending on the value of the Control Mode Page ATO bit.
|
||||
|
||||
See 6.2 for a description of get_tag_fn and set_tag_fn.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
|
||||
@@ -14,9 +14,8 @@ represent the thread siblings to cpu X in the same physical package;
|
||||
To implement it in an architecture-neutral way, a new source file,
|
||||
drivers/base/topology.c, is to export the 4 attributes.
|
||||
|
||||
If one architecture wants to support this feature, it just needs to
|
||||
implement 4 defines, typically in file include/asm-XXX/topology.h.
|
||||
The 4 defines are:
|
||||
For an architecture to support this feature, it must define some of
|
||||
these macros in include/asm-XXX/topology.h:
|
||||
#define topology_physical_package_id(cpu)
|
||||
#define topology_core_id(cpu)
|
||||
#define topology_thread_siblings(cpu)
|
||||
@@ -25,17 +24,10 @@ The 4 defines are:
|
||||
The type of **_id is int.
|
||||
The type of siblings is cpumask_t.
|
||||
|
||||
To be consistent on all architectures, the 4 attributes should have
|
||||
default values if their values are unavailable. Below is the rule.
|
||||
1) physical_package_id: If cpu has no physical package id, -1 is the
|
||||
default value.
|
||||
2) core_id: If cpu doesn't support multi-core, its core id is 0.
|
||||
3) thread_siblings: Just include itself, if the cpu doesn't support
|
||||
HT/multi-thread.
|
||||
4) core_siblings: Just include itself, if the cpu doesn't support
|
||||
multi-core and HT/Multi-thread.
|
||||
|
||||
So be careful when declaring the 4 defines in include/asm-XXX/topology.h.
|
||||
|
||||
If an attribute isn't defined on an architecture, it won't be exported.
|
||||
|
||||
To be consistent on all architectures, include/linux/topology.h
|
||||
provides default definitions for any of the above macros that are
|
||||
not defined by include/asm-XXX/topology.h:
|
||||
1) physical_package_id: -1
|
||||
2) core_id: 0
|
||||
3) thread_siblings: just the given CPU
|
||||
4) core_siblings: just the given CPU
|
||||
|
||||
@@ -222,13 +222,6 @@ Who: Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: i2c-i810, i2c-prosavage and i2c-savage4
|
||||
When: May 2008
|
||||
Why: These drivers are superseded by i810fb, intelfb and savagefb.
|
||||
Who: Jean Delvare <khali@linux-fr.org>
|
||||
|
||||
---------------------------
|
||||
|
||||
What (Why):
|
||||
- include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files
|
||||
(superseded by xt_TOS/xt_tos target & match)
|
||||
|
||||
@@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
|
||||
1. Quick usage instructions:
|
||||
===========================
|
||||
|
||||
- Grab updated e2fsprogs from
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
|
||||
This is a patchset on top of e2fsprogs-1.39, which can be found at
|
||||
- Compile and install the latest version of e2fsprogs (as of this
|
||||
writing version 1.41) from:
|
||||
|
||||
http://sourceforge.net/project/showfiles.php?group_id=2406
|
||||
|
||||
or
|
||||
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
|
||||
|
||||
- It's still mke2fs -j /dev/hda1
|
||||
or grab the latest git repository from:
|
||||
|
||||
- mount /dev/hda1 /wherever -t ext4dev
|
||||
git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
|
||||
|
||||
- To enable extents,
|
||||
- Create a new filesystem using the ext4dev filesystem type:
|
||||
|
||||
mount /dev/hda1 /wherever -t ext4dev -o extents
|
||||
# mke2fs -t ext4dev /dev/hda1
|
||||
|
||||
- The filesystem is compatible with the ext3 driver until you add a file
|
||||
which has extents (ie: `mount -o extents', then create a file).
|
||||
Or configure an existing ext3 filesystem to support extents and set
|
||||
the test_fs flag to indicate that it's ok for an in-development
|
||||
filesystem to touch this filesystem:
|
||||
|
||||
NOTE: The "extents" mount flag is temporary. It will soon go away and
|
||||
extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
|
||||
# tune2fs -O extents -E test_fs /dev/hda1
|
||||
|
||||
If the filesystem was created with 128 byte inodes, it can be
|
||||
converted to use 256 byte for greater efficiency via:
|
||||
|
||||
# tune2fs -I 256 /dev/hda1
|
||||
|
||||
(Note: we currently do not have tools to convert an ext4dev
|
||||
filesystem back to ext3; so please do not do try this on production
|
||||
filesystems.)
|
||||
|
||||
- Mounting:
|
||||
|
||||
# mount -t ext4dev /dev/hda1 /wherever
|
||||
|
||||
- When comparing performance with other filesystems, remember that
|
||||
ext3/4 by default offers higher data integrity guarantees than most. So
|
||||
when comparing with a metadata-only journalling filesystem, use `mount -o
|
||||
data=writeback'. And you might as well use `mount -o nobh' too along
|
||||
with it. Making the journal larger than the mke2fs default often helps
|
||||
performance with metadata-intensive workloads.
|
||||
ext3/4 by default offers higher data integrity guarantees than most.
|
||||
So when comparing with a metadata-only journalling filesystem, such
|
||||
as ext3, use `mount -o data=writeback'. And you might as well use
|
||||
`mount -o nobh' too along with it. Making the journal larger than
|
||||
the mke2fs default often helps performance with metadata-intensive
|
||||
workloads.
|
||||
|
||||
2. Features
|
||||
===========
|
||||
|
||||
2.1 Currently available
|
||||
|
||||
* ability to use filesystems > 16TB
|
||||
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
|
||||
* extent format reduces metadata overhead (RAM, IO for access, transactions)
|
||||
* extent format more robust in face of on-disk corruption due to magics,
|
||||
* internal redunancy in tree
|
||||
|
||||
2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
|
||||
|
||||
* dir_index and resize inode will be on by default
|
||||
* large inodes will be used by default for fast EAs, nsec timestamps, etc
|
||||
* improved file allocation (multi-block alloc)
|
||||
* fix 32000 subdirectory limit
|
||||
* nsec timestamps for mtime, atime, ctime, create time
|
||||
* inode version field on disk (NFSv4, Lustre)
|
||||
* reduced e2fsck time via uninit_bg feature
|
||||
* journal checksumming for robustness, performance
|
||||
* persistent file preallocation (e.g for streaming media, databases)
|
||||
* ability to pack bitmaps and inode tables into larger virtual groups via the
|
||||
flex_bg feature
|
||||
* large file support
|
||||
* Inode allocation using large virtual block groups via flex_bg
|
||||
* delayed allocation
|
||||
* large block (up to pagesize) support
|
||||
* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
|
||||
the ordering)
|
||||
|
||||
2.2 Candidate features for future inclusion
|
||||
|
||||
There are several under discussion, whether they all make it in is
|
||||
partly a function of how much time everyone has to work on them:
|
||||
* Online defrag (patches available but not well tested)
|
||||
* reduced mke2fs time via lazy itable initialization in conjuction with
|
||||
the uninit_bg feature (capability to do this is available in e2fsprogs
|
||||
but a kernel thread to do lazy zeroing of unused inode table blocks
|
||||
after filesystem is first mounted is required for safety)
|
||||
|
||||
* improved file allocation (multi-block alloc, delayed alloc; basically done)
|
||||
* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
|
||||
* nsec timestamps for mtime, atime, ctime, create time (patch exists,
|
||||
needs some e2fsck work)
|
||||
* inode version field on disk (NFSv4, Lustre; prototype exists)
|
||||
* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
|
||||
* journal checksumming for robustness, performance (prototype exists)
|
||||
* persistent file preallocation (e.g for streaming media, databases)
|
||||
There are several others under discussion, whether they all make it in is
|
||||
partly a function of how much time everyone has to work on them. Features like
|
||||
metadata checksumming have been discussed and planned for a bit but no patches
|
||||
exist yet so I'm not sure they're in the near-term roadmap.
|
||||
|
||||
Features like metadata checksumming have been discussed and planned for
|
||||
a bit but no patches exist yet so I'm not sure they're in the near-term
|
||||
roadmap.
|
||||
The big performance win will come with mballoc, delalloc and flex_bg
|
||||
grouping of bitmaps and inode tables. Some test results available here:
|
||||
|
||||
The big performance win will come with mballoc and delalloc. CFS has
|
||||
been using mballoc for a few years already with Lustre, and IBM + Bull
|
||||
did a lot of benchmarking on it. The reason it isn't in the first set of
|
||||
patches is partly a manageability issue, and partly because it doesn't
|
||||
directly affect the on-disk format (outside of much better allocation)
|
||||
so it isn't critical to get into the first round of changes. I believe
|
||||
Alex is working on a new set of patches right now.
|
||||
- http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
|
||||
- http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
|
||||
|
||||
3. Options
|
||||
==========
|
||||
@@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
|
||||
to use for allocation size and alignment. For RAID5/6
|
||||
systems this should be the number of data
|
||||
disks * RAID chunk size in file system blocks.
|
||||
|
||||
delalloc (*) Deferring block allocation until write-out time.
|
||||
nodelalloc Disable delayed allocation. Blocks are allocation
|
||||
when data is copied from user to page cache.
|
||||
Data Mode
|
||||
---------
|
||||
=========
|
||||
There are 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
@@ -236,10 +259,10 @@ typically provide the best ext4 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext4 only officially journals metadata, but it logically
|
||||
groups metadata and data blocks into a single unit called a transaction. When
|
||||
it's time to write the new metadata out to disk, the associated data blocks
|
||||
are written first. In general, this mode performs slightly slower than
|
||||
writeback but significantly faster than journal mode.
|
||||
groups metadata information related to data changes with the data blocks into a
|
||||
single unit called a transaction. When it's time to write the new metadata
|
||||
out to disk, the associated data blocks are written first. In general,
|
||||
this mode performs slightly slower than writeback but significantly faster than journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new data is
|
||||
@@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both data and
|
||||
metadata into a consistent state. This mode is the slowest except when data
|
||||
needs to be read from and written to disk at the same time where it
|
||||
outperforms all others modes.
|
||||
outperforms all others modes. Curently ext4 does not have delayed
|
||||
allocation support if this data journalling mode is selected.
|
||||
|
||||
References
|
||||
==========
|
||||
@@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
|
||||
<file:fs/jbd2/>
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net/
|
||||
http://ext2resize.sourceforge.net
|
||||
|
||||
useful links: http://fedoraproject.org/wiki/ext3-devel
|
||||
http://www.bullopensource.org/ext4/
|
||||
http://ext4.wiki.kernel.org/index.php/Main_Page
|
||||
http://fedoraproject.org/wiki/Features/Ext4
|
||||
|
||||
@@ -0,0 +1,114 @@
|
||||
Glock internal locking rules
|
||||
------------------------------
|
||||
|
||||
This documents the basic principles of the glock state machine
|
||||
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
|
||||
has two main (internal) locks:
|
||||
|
||||
1. A spinlock (gl_spin) which protects the internal state such
|
||||
as gl_state, gl_target and the list of holders (gl_holders)
|
||||
2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
|
||||
threads from making calls to the DLM, etc. at the same time. If a
|
||||
thread takes this lock, it must then call run_queue (usually via the
|
||||
workqueue) when it releases it in order to ensure any pending tasks
|
||||
are completed.
|
||||
|
||||
The gl_holders list contains all the queued lock requests (not
|
||||
just the holders) associated with the glock. If there are any
|
||||
held locks, then they will be contiguous entries at the head
|
||||
of the list. Locks are granted in strictly the order that they
|
||||
are queued, except for those marked LM_FLAG_PRIORITY which are
|
||||
used only during recovery, and even then only for journal locks.
|
||||
|
||||
There are three lock states that users of the glock layer can request,
|
||||
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
|
||||
to the following DLM lock modes:
|
||||
|
||||
Glock mode | DLM lock mode
|
||||
------------------------------
|
||||
UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
|
||||
SH | PR (Protected read)
|
||||
DF | CW (Concurrent write)
|
||||
EX | EX (Exclusive)
|
||||
|
||||
Thus DF is basically a shared mode which is incompatible with the "normal"
|
||||
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
|
||||
operations. The glocks are basically a lock plus some routines which deal
|
||||
with cache management. The following rules apply for the cache:
|
||||
|
||||
Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
|
||||
--------------------------------------------------------------------------
|
||||
UN | No | No | No | No
|
||||
SH | Yes | Yes | No | No
|
||||
DF | No | Yes | No | No
|
||||
EX | Yes | Yes | Yes | Yes
|
||||
|
||||
These rules are implemented using the various glock operations which
|
||||
are defined for each type of glock. Not all types of glocks use
|
||||
all the modes. Only inode glocks use the DF mode for example.
|
||||
|
||||
Table of glock operations and per type constants:
|
||||
|
||||
Field | Purpose
|
||||
----------------------------------------------------------------------------
|
||||
go_xmote_th | Called before remote state change (e.g. to sync dirty data)
|
||||
go_xmote_bh | Called after remote state change (e.g. to refill cache)
|
||||
go_inval | Called if remote state change requires invalidating the cache
|
||||
go_demote_ok | Returns boolean value of whether its ok to demote a glock
|
||||
| (e.g. checks timeout, and that there is no cached data)
|
||||
go_lock | Called for the first local holder of a lock
|
||||
go_unlock | Called on the final local unlock of a lock
|
||||
go_dump | Called to print content of object for debugfs file, or on
|
||||
| error to dump glock to the log.
|
||||
go_type; | The type of the glock, LM_TYPE_.....
|
||||
go_min_hold_time | The minimum hold time
|
||||
|
||||
The minimum hold time for each lock is the time after a remote lock
|
||||
grant for which we ignore remote demote requests. This is in order to
|
||||
prevent a situation where locks are being bounced around the cluster
|
||||
from node to node with none of the nodes making any progress. This
|
||||
tends to show up most with shared mmaped files which are being written
|
||||
to by multiple nodes. By delaying the demotion in response to a
|
||||
remote callback, that gives the userspace program time to make
|
||||
some progress before the pages are unmapped.
|
||||
|
||||
There is a plan to try and remove the go_lock and go_unlock callbacks
|
||||
if possible, in order to try and speed up the fast path though the locking.
|
||||
Also, eventually we hope to make the glock "EX" mode locally shared
|
||||
such that any local locking will be done with the i_mutex as required
|
||||
rather than via the glock.
|
||||
|
||||
Locking rules for glock operations:
|
||||
|
||||
Operation | GLF_LOCK bit lock held | gl_spin spinlock held
|
||||
-----------------------------------------------------------------
|
||||
go_xmote_th | Yes | No
|
||||
go_xmote_bh | Yes | No
|
||||
go_inval | Yes | No
|
||||
go_demote_ok | Sometimes | Yes
|
||||
go_lock | Yes | No
|
||||
go_unlock | Yes | No
|
||||
go_dump | Sometimes | Yes
|
||||
|
||||
N.B. Operations must not drop either the bit lock or the spinlock
|
||||
if its held on entry. go_dump and do_demote_ok must never block.
|
||||
Note that go_dump will only be called if the glock's state
|
||||
indicates that it is caching uptodate data.
|
||||
|
||||
Glock locking order within GFS2:
|
||||
|
||||
1. i_mutex (if required)
|
||||
2. Rename glock (for rename only)
|
||||
3. Inode glock(s)
|
||||
(Parents before children, inodes at "same level" with same parent in
|
||||
lock number order)
|
||||
4. Rgrp glock(s) (for (de)allocation operations)
|
||||
5. Transaction glock (via gfs2_trans_begin) for non-read operations
|
||||
6. Page lock (always last, very important!)
|
||||
|
||||
There are two glocks per inode. One deals with access to the inode
|
||||
itself (locking order as above), and the other, known as the iopen
|
||||
glock is used in conjunction with the i_nlink field in the inode to
|
||||
determine the lifetime of the inode in question. Locking of inodes
|
||||
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
|
||||
|
||||
@@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
|
||||
Of some interest is the introduction of the /proc/irq directory to 2.4.
|
||||
It could be used to set IRQ to CPU affinity, this means that you can "hook" an
|
||||
IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
|
||||
irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask
|
||||
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
|
||||
prof_cpu_mask.
|
||||
|
||||
For example
|
||||
> ls /proc/irq/
|
||||
0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
|
||||
1 11 13 15 17 19 3 5 7 9
|
||||
1 11 13 15 17 19 3 5 7 9 default_smp_affinity
|
||||
> ls /proc/irq/0/
|
||||
smp_affinity
|
||||
|
||||
The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ
|
||||
is the same by default:
|
||||
smp_affinity is a bitmask, in which you can specify which CPUs can handle the
|
||||
IRQ, you can set it by doing:
|
||||
|
||||
> cat /proc/irq/0/smp_affinity
|
||||
> echo 1 > /proc/irq/10/smp_affinity
|
||||
|
||||
This means that only the first CPU will handle the IRQ, but you can also echo
|
||||
5 which means that only the first and fourth CPU can handle the IRQ.
|
||||
|
||||
The contents of each smp_affinity file is the same by default:
|
||||
|
||||
> cat /proc/irq/0/smp_affinity
|
||||
ffffffff
|
||||
|
||||
It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can
|
||||
set it by doing:
|
||||
The default_smp_affinity mask applies to all non-active IRQs, which are the
|
||||
IRQs which have not yet been allocated/activated, and hence which lack a
|
||||
/proc/irq/[0-9]* directory.
|
||||
|
||||
> echo 1 > /proc/irq/prof_cpu_mask
|
||||
|
||||
This means that only the first CPU will handle the IRQ, but you can also echo 5
|
||||
which means that only the first and fourth CPU can handle the IRQ.
|
||||
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
|
||||
profiler. Default value is ffffffff (all cpus).
|
||||
|
||||
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
|
||||
between all the CPUs which are allowed to handle it. As usual the kernel has
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,47 +0,0 @@
|
||||
Kernel driver i2c-i810
|
||||
|
||||
Supported adapters:
|
||||
* Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH)
|
||||
* Intel 82845G (GMCH)
|
||||
|
||||
Authors:
|
||||
Frodo Looijaard <frodol@dds.nl>,
|
||||
Philip Edelbrock <phil@netroedge.com>,
|
||||
Kyösti Mälkki <kmalkki@cc.hut.fi>,
|
||||
Ralph Metzler <rjkm@thp.uni-koeln.de>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Main contact: Mark Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
WARNING: If you have an '810' or '815' motherboard, your standard I2C
|
||||
temperature sensors are most likely on the 801's I2C bus. You want the
|
||||
i2c-i801 driver for those, not this driver.
|
||||
|
||||
Now for the i2c-i810...
|
||||
|
||||
The GMCH chip contains two I2C interfaces.
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org .
|
||||
|
||||
The second interface is a general-purpose I2C bus. It may be connected to a
|
||||
TV-out chip such as the BT869 or possibly to a digital flat-panel display.
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
Both busses use the i2c-algo-bit driver for 'bit banging'
|
||||
and support for specific transactions is provided by i2c-algo-bit.
|
||||
|
||||
Issues
|
||||
------
|
||||
|
||||
If you enable bus testing in i2c-algo-bit (insmod i2c-algo-bit bit_test=1),
|
||||
the test may fail; if so, the i2c-i810 driver won't be inserted. However,
|
||||
we think this has been fixed.
|
||||
@@ -1,23 +0,0 @@
|
||||
Kernel driver i2c-prosavage
|
||||
|
||||
Supported adapters:
|
||||
|
||||
S3/VIA KM266/VT8375 aka ProSavage8
|
||||
S3/VIA KM133/VT8365 aka Savage4
|
||||
|
||||
Author: Henk Vergonet <henk@god.dyndns.org>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The Savage4 chips contain two I2C interfaces (aka a I2C 'master' or
|
||||
'host').
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org . The second interface is a general-purpose I2C bus.
|
||||
|
||||
Usefull for gaining access to the TV Encoder chips.
|
||||
|
||||
@@ -1,26 +0,0 @@
|
||||
Kernel driver i2c-savage4
|
||||
|
||||
Supported adapters:
|
||||
* Savage4
|
||||
* Savage2000
|
||||
|
||||
Authors:
|
||||
Alexander Wold <awold@bigfoot.com>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The Savage4 chips contain two I2C interfaces (aka a I2C 'master'
|
||||
or 'host').
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org . The DDC bus is not yet supported because its register
|
||||
is not directly memory-mapped.
|
||||
|
||||
The second interface is a general-purpose I2C bus. This is the only
|
||||
interface supported by the driver at the moment.
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user