Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6

This commit is contained in:
Igor Mammedov
2008-04-28 23:08:21 +00:00
committed by Steve French
693 changed files with 18231 additions and 10114 deletions
+55 -1
View File
@@ -119,7 +119,7 @@ X!Ilib/string.c
!Elib/string.c
</sect1>
<sect1><title>Bit Operations</title>
!Iinclude/asm-x86/bitops_32.h
!Iinclude/asm-x86/bitops.h
</sect1>
</chapter>
@@ -645,4 +645,58 @@ X!Idrivers/video/console/fonts.c
!Edrivers/i2c/i2c-core.c
</chapter>
<chapter id="clk">
<title>Clock Framework</title>
<para>
The clock framework defines programming interfaces to support
software management of the system clock tree.
This framework is widely used with System-On-Chip (SOC) platforms
to support power management and various devices which may need
custom clock rates.
Note that these "clocks" don't relate to timekeeping or real
time clocks (RTCs), each of which have separate frameworks.
These <structname>struct clk</structname> instances may be used
to manage for example a 96 MHz signal that is used to shift bits
into and out of peripherals or busses, or otherwise trigger
synchronous state machine transitions in system hardware.
</para>
<para>
Power management is supported by explicit software clock gating:
unused clocks are disabled, so the system doesn't waste power
changing the state of transistors that aren't in active use.
On some systems this may be backed by hardware clock gating,
where clocks are gated without being disabled in software.
Sections of chips that are powered but not clocked may be able
to retain their last state.
This low power state is often called a <emphasis>retention
mode</emphasis>.
This mode still incurs leakage currents, especially with finer
circuit geometries, but for CMOS circuits power is mostly used
by clocked state changes.
</para>
<para>
Power-aware drivers only enable their clocks when the device
they manage is in active use. Also, system sleep states often
differ according to which clock domains are active: while a
"standby" state may allow wakeup from several active domains, a
"mem" (suspend-to-RAM) state may require a more wholesale shutdown
of clocks derived from higher speed PLLs and oscillators, limiting
the number of possible wakeup event sources. A driver's suspend
method may need to be aware of system-specific clock constraints
on the target sleep state.
</para>
<para>
Some platforms support programmable clock generators. These
can be used by external chips of various kinds, such as other
CPUs, multimedia codecs, and devices with strict requirements
for interface clocking.
</para>
!Iinclude/linux/clk.h
</chapter>
</book>
+52
View File
@@ -0,0 +1,52 @@
[This file is cloned from VesaFB/aty128fb]
What is gxfb?
=================
This is a graphics framebuffer driver for AMD Geode GX2 based processors.
Advantages:
* No need to use AMD's VSA code (or other VESA emulation layer) in the
BIOS.
* It provides a nice large console (128 cols + 48 lines with 1024x768)
without using tiny, unreadable fonts.
* You can run XF68_FBDev on top of /dev/fb0
* Most important: boot logo :-)
Disadvantages:
* graphic mode is slower than text mode...
How to use it?
==============
Switching modes is done using gxfb.mode_option=<resolution>... boot
parameter or using `fbset' program.
See Documentation/fb/modedb.txt for more information on modedb
resolutions.
X11
===
XF68_FBDev should generally work fine, but it is non-accelerated.
Configuration
=============
You can pass kernel command line options to gxfb with gxfb.<option>.
For example, gxfb.mode_option=800x600@75.
Accepted options:
mode_option - specify the video mode. Of the form
<x>x<y>[-<bpp>][@<refresh>]
vram - size of video ram (normally auto-detected)
vt_switch - enable vt switching during suspend/resume. The vt
switch is slow, but harmless.
--
Andres Salomon <dilinger@debian.org>
+2
View File
@@ -14,6 +14,8 @@ graphics devices. These would include:
Intel 915GM
Intel 945G
Intel 945GM
Intel 965G
Intel 965GM
B. List of available options
+52
View File
@@ -0,0 +1,52 @@
[This file is cloned from VesaFB/aty128fb]
What is lxfb?
=================
This is a graphics framebuffer driver for AMD Geode LX based processors.
Advantages:
* No need to use AMD's VSA code (or other VESA emulation layer) in the
BIOS.
* It provides a nice large console (128 cols + 48 lines with 1024x768)
without using tiny, unreadable fonts.
* You can run XF68_FBDev on top of /dev/fb0
* Most important: boot logo :-)
Disadvantages:
* graphic mode is slower than text mode...
How to use it?
==============
Switching modes is done using lxfb.mode_option=<resolution>... boot
parameter or using `fbset' program.
See Documentation/fb/modedb.txt for more information on modedb
resolutions.
X11
===
XF68_FBDev should generally work fine, but it is non-accelerated.
Configuration
=============
You can pass kernel command line options to lxfb with lxfb.<option>.
For example, lxfb.mode_option=800x600@75.
Accepted options:
mode_option - specify the video mode. Of the form
<x>x<y>[-<bpp>][@<refresh>]
vram - size of video ram (normally auto-detected)
vt_switch - enable vt switching during suspend/resume. The vt
switch is slow, but harmless.
--
Andres Salomon <dilinger@debian.org>
+7 -9
View File
@@ -1,7 +1,7 @@
Metronomefb
-----------
Maintained by Jaya Kumar <jayakumar.lkml.gmail.com>
Last revised: Nov 20, 2007
Last revised: Mar 10, 2008
Metronomefb is a driver for the Metronome display controller. The controller
is from E-Ink Corporation. It is intended to be used to drive the E-Ink
@@ -11,20 +11,18 @@ display media here http://www.e-ink.com/products/matrix/metronome.html .
Metronome is interfaced to the host CPU through the AMLCD interface. The
host CPU generates the control information and the image in a framebuffer
which is then delivered to the AMLCD interface by a host specific method.
Currently, that's implemented for the PXA's LCDC controller. The display and
error status are each pulled through individual GPIOs.
The display and error status are each pulled through individual GPIOs.
Metronomefb was written for the PXA255/gumstix/lyre combination and
therefore currently has board set specific code in it. If other boards based on
other architectures are available, then the host specific code can be separated
and abstracted out.
Metronomefb is platform independent and depends on a board specific driver
to do all physical IO work. Currently, an example is implemented for the
PXA board used in the AM-200 EPD devkit. This example is am200epd.c
Metronomefb requires waveform information which is delivered via the AMLCD
interface to the metronome controller. The waveform information is expected to
be delivered from userspace via the firmware class interface. The waveform file
can be compressed as long as your udev or hotplug script is aware of the need
to uncompress it before delivering it. metronomefb will ask for waveform.wbf
which would typically go into /lib/firmware/waveform.wbf depending on your
to uncompress it before delivering it. metronomefb will ask for metronome.wbf
which would typically go into /lib/firmware/metronome.wbf depending on your
udev/hotplug setup. I have only tested with a single waveform file which was
originally labeled 23P01201_60_WT0107_MTC. I do not know what it stands for.
Caution should be exercised when manipulating the waveform as there may be
+4
View File
@@ -125,8 +125,12 @@ There may be more modes.
amifb - Amiga chipset frame buffer
aty128fb - ATI Rage128 / Pro frame buffer
atyfb - ATI Mach64 frame buffer
pm2fb - Permedia 2/2V frame buffer
pm3fb - Permedia 3 frame buffer
sstfb - Voodoo 1/2 (SST1) chipset frame buffer
tdfxfb - 3D Fx frame buffer
tridentfb - Trident (Cyber)blade chipset frame buffer
vt8623fb - VIA 8623 frame buffer
BTW, only a few drivers use this at the moment. Others are to follow
(feel free to send patches).
@@ -128,15 +128,6 @@ Who: Arjan van de Ven <arjan@linux.intel.com>
---------------------------
What: vm_ops.nopage
When: Soon, provided in-kernel callers have been converted
Why: This interface is replaced by vm_ops.fault, but it has been around
forever, is used by a lot of drivers, and doesn't cost much to
maintain.
Who: Nick Piggin <npiggin@suse.de>
---------------------------
What: PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment
When: October 2008
Why: The stacking of class devices makes these values misleading and
-3
View File
@@ -511,7 +511,6 @@ prototypes:
void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*);
int (*fault)(struct vm_area_struct*, struct vm_fault *);
struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
int (*page_mkwrite)(struct vm_area_struct *, struct page *);
locking rules:
@@ -519,7 +518,6 @@ locking rules:
open: no yes
close: no yes
fault: no yes
nopage: no yes
page_mkwrite: no yes no
->page_mkwrite() is called when a previously read-only page is
@@ -537,4 +535,3 @@ NULL.
ipc/shm.c::shm_delete() - may need BKL.
->read() and ->write() in many drivers are (probably) missing BKL.
drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
+12
View File
@@ -92,6 +92,18 @@ NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified
when tmpfs is mounted by appending them to the mode before the NodeList.
See Documentation/vm/numa_memory_policy.txt for a list of all available
memory allocation policy mode flags.
=static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES
For example, mpol=bind=static:NodeList, is the equivalent of an
allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES.
Note that trying to mount a tmpfs with an mpol option will fail if the
running kernel does not support NUMA; and will fail if its nodelist
specifies a node which is not online. If your system relies on that
+15
View File
@@ -17,6 +17,21 @@ dmask=### -- The permission mask for the directory.
fmask=### -- The permission mask for files.
The default is the umask of current process.
allow_utime=### -- This option controls the permission check of mtime/atime.
20 - If current process is in group of file's group ID,
you can change timestamp.
2 - Other users can change timestamp.
The default is set from `dmask' option. (If the directory is
writable, utime(2) is also allowed. I.e. ~dmask & 022)
Normally utime(2) checks current process is owner of
the file, or it has CAP_FOWNER capability. But FAT
filesystem doesn't have uid/gid on disk, so normal
check is too unflexible. With this option you can
relax it.
codepage=### -- Sets the codepage number for converting to shortname
characters on FAT filesystem.
By default, FAT_DEFAULT_CODEPAGE setting is used.
+10
View File
@@ -107,6 +107,16 @@ type of GPIO controller, and on one particular board 80-95 with an FPGA.
The numbers need not be contiguous; either of those platforms could also
use numbers 2000-2063 to identify GPIOs in a bank of I2C GPIO expanders.
If you want to initialize a structure with an invalid GPIO number, use
some negative number (perhaps "-EINVAL"); that will never be valid. To
test if a number could reference a GPIO, you may use this predicate:
int gpio_is_valid(int number);
A number that's not valid will be rejected by calls which may request
or free GPIOs (see below). Other numbers may also be rejected; for
example, a number might be valid but unused on a given board.
Whether a platform supports multiple GPIO controllers is currently a
platform-specific implementation issue.
+47 -4
View File
@@ -37,6 +37,11 @@ registration function such as register_kprobe() specifies where
the probe is to be inserted and what handler is to be called when
the probe is hit.
There are also register_/unregister_*probes() functions for batch
registration/unregistration of a group of *probes. These functions
can speed up unregistration process when you have to unregister
a lot of probes at once.
The next three subsections explain how the different types of
probes work. They explain certain things that you'll need to
know in order to make the best use of Kprobes -- e.g., the
@@ -190,10 +195,11 @@ code mapping.
4. API Reference
The Kprobes API includes a "register" function and an "unregister"
function for each type of probe. Here are terse, mini-man-page
specifications for these functions and the associated probe handlers
that you'll write. See the files in the samples/kprobes/ sub-directory
for examples.
function for each type of probe. The API also includes "register_*probes"
and "unregister_*probes" functions for (un)registering arrays of probes.
Here are terse, mini-man-page specifications for these functions and
the associated probe handlers that you'll write. See the files in the
samples/kprobes/ sub-directory for examples.
4.1 register_kprobe
@@ -319,6 +325,43 @@ void unregister_kretprobe(struct kretprobe *rp);
Removes the specified probe. The unregister function can be called
at any time after the probe has been registered.
NOTE:
If the functions find an incorrect probe (ex. an unregistered probe),
they clear the addr field of the probe.
4.5 register_*probes
#include <linux/kprobes.h>
int register_kprobes(struct kprobe **kps, int num);
int register_kretprobes(struct kretprobe **rps, int num);
int register_jprobes(struct jprobe **jps, int num);
Registers each of the num probes in the specified array. If any
error occurs during registration, all probes in the array, up to
the bad probe, are safely unregistered before the register_*probes
function returns.
- kps/rps/jps: an array of pointers to *probe data structures
- num: the number of the array entries.
NOTE:
You have to allocate(or define) an array of pointers and set all
of the array entries before using these functions.
4.6 unregister_*probes
#include <linux/kprobes.h>
void unregister_kprobes(struct kprobe **kps, int num);
void unregister_kretprobes(struct kretprobe **rps, int num);
void unregister_jprobes(struct jprobe **jps, int num);
Removes each of the num probes in the specified array at once.
NOTE:
If the functions find some incorrect probes (ex. unregistered
probes) in the specified array, they clear the addr field of those
incorrect probes. However, other probes in the array are
unregistered correctly.
5. Kprobes Features and Limitations
Kprobes allows multiple probes at the same address. Currently,
+6
View File
@@ -450,3 +450,9 @@ These currently include
there are upper and lower limits (32768, 16). Default is 128.
strip_cache_active (currently raid5 only)
number of active entries in the stripe cache
preread_bypass_threshold (currently raid5 only)
number of times a stripe requiring preread will be bypassed by
a stripe that does not require preread. For fairness defaults
to 1. Setting this to 0 disables bypass accounting and
requires preread stripes to wait until all full-width stripe-
writes are complete. Valid values are 0 to stripe_cache_size.
@@ -2836,6 +2836,39 @@ platforms are moved over to use the flattened-device-tree model.
big-endian;
};
r) Freescale Display Interface Unit
The Freescale DIU is a LCD controller, with proper hardware, it can also
drive DVI monitors.
Required properties:
- compatible : should be "fsl-diu".
- reg : should contain at least address and length of the DIU register
set.
- Interrupts : one DIU interrupt should be describe here.
Example (MPC8610HPCD)
display@2c000 {
compatible = "fsl,diu";
reg = <0x2c000 100>;
interrupts = <72 2>;
interrupt-parent = <&mpic>;
};
s) Freescale on board FPGA
This is the memory-mapped registers for on board FPGA.
Required properities:
- compatible : should be "fsl,fpga-pixis".
- reg : should contain the address and the lenght of the FPPGA register
set.
Example (MPC8610HPCD)
board-control@e8000000 {
compatible = "fsl,fpga-pixis";
reg = <0xe8000000 32>;
};
VII - Marvell Discovery mv64[345]6x System Controller chips
===========================================================
+2 -166
View File
@@ -126,8 +126,8 @@ NOTES:
FULL DUPLEX CHARACTER DEVICE API
================================
See the sample program below for one example showing the use of the full
duplex programming interface. (Although it doesn't perform a full duplex
See the spidev_fdx.c sample program for one example showing the use of the
full duplex programming interface. (Although it doesn't perform a full duplex
transfer.) The model is the same as that used in the kernel spi_sync()
request; the individual transfers offer the same capabilities as are
available to kernel drivers (except that it's not asynchronous).
@@ -141,167 +141,3 @@ and bitrate for each transfer segment.)
To make a full duplex request, provide both rx_buf and tx_buf for the
same transfer. It's even OK if those are the same buffer.
SAMPLE PROGRAM
==============
-------------------------------- CUT HERE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <linux/types.h>
#include <linux/spi/spidev.h>
static int verbose;
static void do_read(int fd, int len)
{
unsigned char buf[32], *bp;
int status;
/* read at least 2 bytes, no more than 32 */
if (len < 2)
len = 2;
else if (len > sizeof(buf))
len = sizeof(buf);
memset(buf, 0, sizeof buf);
status = read(fd, buf, len);
if (status < 0) {
perror("read");
return;
}
if (status != len) {
fprintf(stderr, "short read\n");
return;
}
printf("read(%2d, %2d): %02x %02x,", len, status,
buf[0], buf[1]);
status -= 2;
bp = buf + 2;
while (status-- > 0)
printf(" %02x", *bp++);
printf("\n");
}
static void do_msg(int fd, int len)
{
struct spi_ioc_transfer xfer[2];
unsigned char buf[32], *bp;
int status;
memset(xfer, 0, sizeof xfer);
memset(buf, 0, sizeof buf);
if (len > sizeof buf)
len = sizeof buf;
buf[0] = 0xaa;
xfer[0].tx_buf = (__u64) buf;
xfer[0].len = 1;
xfer[1].rx_buf = (__u64) buf;
xfer[1].len = len;
status = ioctl(fd, SPI_IOC_MESSAGE(2), xfer);
if (status < 0) {
perror("SPI_IOC_MESSAGE");
return;
}
printf("response(%2d, %2d): ", len, status);
for (bp = buf; len; len--)
printf(" %02x", *bp++);
printf("\n");
}
static void dumpstat(const char *name, int fd)
{
__u8 mode, lsb, bits;
__u32 speed;
if (ioctl(fd, SPI_IOC_RD_MODE, &mode) < 0) {
perror("SPI rd_mode");
return;
}
if (ioctl(fd, SPI_IOC_RD_LSB_FIRST, &lsb) < 0) {
perror("SPI rd_lsb_fist");
return;
}
if (ioctl(fd, SPI_IOC_RD_BITS_PER_WORD, &bits) < 0) {
perror("SPI bits_per_word");
return;
}
if (ioctl(fd, SPI_IOC_RD_MAX_SPEED_HZ, &speed) < 0) {
perror("SPI max_speed_hz");
return;
}
printf("%s: spi mode %d, %d bits %sper word, %d Hz max\n",
name, mode, bits, lsb ? "(lsb first) " : "", speed);
}
int main(int argc, char **argv)
{
int c;
int readcount = 0;
int msglen = 0;
int fd;
const char *name;
while ((c = getopt(argc, argv, "hm:r:v")) != EOF) {
switch (c) {
case 'm':
msglen = atoi(optarg);
if (msglen < 0)
goto usage;
continue;
case 'r':
readcount = atoi(optarg);
if (readcount < 0)
goto usage;
continue;
case 'v':
verbose++;
continue;
case 'h':
case '?':
usage:
fprintf(stderr,
"usage: %s [-h] [-m N] [-r N] /dev/spidevB.D\n",
argv[0]);
return 1;
}
}
if ((optind + 1) != argc)
goto usage;
name = argv[optind];
fd = open(name, O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
dumpstat(name, fd);
if (msglen)
do_msg(fd, msglen);
if (readcount)
do_read(fd, readcount);
close(fd);
return 0;
}
+158
View File
@@ -0,0 +1,158 @@
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <linux/types.h>
#include <linux/spi/spidev.h>
static int verbose;
static void do_read(int fd, int len)
{
unsigned char buf[32], *bp;
int status;
/* read at least 2 bytes, no more than 32 */
if (len < 2)
len = 2;
else if (len > sizeof(buf))
len = sizeof(buf);
memset(buf, 0, sizeof buf);
status = read(fd, buf, len);
if (status < 0) {
perror("read");
return;
}
if (status != len) {
fprintf(stderr, "short read\n");
return;
}
printf("read(%2d, %2d): %02x %02x,", len, status,
buf[0], buf[1]);
status -= 2;
bp = buf + 2;
while (status-- > 0)
printf(" %02x", *bp++);
printf("\n");
}
static void do_msg(int fd, int len)
{
struct spi_ioc_transfer xfer[2];
unsigned char buf[32], *bp;
int status;
memset(xfer, 0, sizeof xfer);
memset(buf, 0, sizeof buf);
if (len > sizeof buf)
len = sizeof buf;
buf[0] = 0xaa;
xfer[0].tx_buf = (__u64) buf;
xfer[0].len = 1;
xfer[1].rx_buf = (__u64) buf;
xfer[1].len = len;
status = ioctl(fd, SPI_IOC_MESSAGE(2), xfer);
if (status < 0) {
perror("SPI_IOC_MESSAGE");
return;
}
printf("response(%2d, %2d): ", len, status);
for (bp = buf; len; len--)
printf(" %02x", *bp++);
printf("\n");
}
static void dumpstat(const char *name, int fd)
{
__u8 mode, lsb, bits;
__u32 speed;
if (ioctl(fd, SPI_IOC_RD_MODE, &mode) < 0) {
perror("SPI rd_mode");
return;
}
if (ioctl(fd, SPI_IOC_RD_LSB_FIRST, &lsb) < 0) {
perror("SPI rd_lsb_fist");
return;
}
if (ioctl(fd, SPI_IOC_RD_BITS_PER_WORD, &bits) < 0) {
perror("SPI bits_per_word");
return;
}
if (ioctl(fd, SPI_IOC_RD_MAX_SPEED_HZ, &speed) < 0) {
perror("SPI max_speed_hz");
return;
}
printf("%s: spi mode %d, %d bits %sper word, %d Hz max\n",
name, mode, bits, lsb ? "(lsb first) " : "", speed);
}
int main(int argc, char **argv)
{
int c;
int readcount = 0;
int msglen = 0;
int fd;
const char *name;
while ((c = getopt(argc, argv, "hm:r:v")) != EOF) {
switch (c) {
case 'm':
msglen = atoi(optarg);
if (msglen < 0)
goto usage;
continue;
case 'r':
readcount = atoi(optarg);
if (readcount < 0)
goto usage;
continue;
case 'v':
verbose++;
continue;
case 'h':
case '?':
usage:
fprintf(stderr,
"usage: %s [-h] [-m N] [-r N] /dev/spidevB.D\n",
argv[0]);
return 1;
}
}
if ((optind + 1) != argc)
goto usage;
name = argv[optind];
fd = open(name, O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
dumpstat(name, fd);
if (msglen)
do_msg(fd, msglen);
if (readcount)
do_read(fd, readcount);
close(fd);
return 0;
}
+200 -79
View File
@@ -135,77 +135,58 @@ most general to most specific:
Components of Memory Policies
A Linux memory policy is a tuple consisting of a "mode" and an optional set
of nodes. The mode determine the behavior of the policy, while the
optional set of nodes can be viewed as the arguments to the behavior.
A Linux memory policy consists of a "mode", optional mode flags, and an
optional set of nodes. The mode determines the behavior of the policy,
the optional mode flags determine the behavior of the mode, and the
optional set of nodes can be viewed as the arguments to the policy
behavior.
Internally, memory policies are implemented by a reference counted
structure, struct mempolicy. Details of this structure will be discussed
in context, below, as required to explain the behavior.
Note: in some functions AND in the struct mempolicy itself, the mode
is called "policy". However, to avoid confusion with the policy tuple,
this document will continue to use the term "mode".
Linux memory policy supports the following 4 behavioral modes:
Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
context or scope dependent.
Default Mode--MPOL_DEFAULT: This mode is only used in the memory
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
memory policy in all policy scopes. Any existing non-default policy
will simply be removed when MPOL_DEFAULT is specified. As a result,
MPOL_DEFAULT means "fall back to the next most specific policy scope."
As mentioned in the Policy Scope section above, during normal
system operation, the System Default Policy is hard coded to
contain the Default mode.
For example, a NULL or default task policy will fall back to the
system default policy. A NULL or default vma policy will fall
back to the task policy.
In this context, default mode means "local" allocation--that is
attempt to allocate the page from the node associated with the cpu
where the fault occurs. If the "local" node has no memory, or the
node's memory can be exhausted [no free pages available], local
allocation will "fallback to"--attempt to allocate pages from--
"nearby" nodes, in order of increasing "distance".
When specified in one of the memory policy APIs, the Default mode
does not use the optional set of nodes.
Implementation detail -- subject to change: "Fallback" uses
a per node list of sibling nodes--called zonelists--built at
boot time, or when nodes or memory are added or removed from
the system [memory hotplug]. These per node zonelist are
constructed with nodes in order of increasing distance based
on information provided by the platform firmware.
When a task/process policy or a shared policy contains the Default
mode, this also means "local allocation", as described above.
In the context of a VMA, Default mode means "fall back to task
policy"--which may or may not specify Default mode. Thus, Default
mode can not be counted on to mean local allocation when used
on a non-shared region of the address space. However, see
MPOL_PREFERRED below.
The Default mode does not use the optional set of nodes.
It is an error for the set of nodes specified for this policy to
be non-empty.
MPOL_BIND: This mode specifies that memory must come from the
set of nodes specified by the policy.
The memory policy APIs do not specify an order in which the nodes
will be searched. However, unlike "local allocation", the Bind
policy does not consider the distance between the nodes. Rather,
allocations will fallback to the nodes specified by the policy in
order of numeric node id. Like everything in Linux, this is subject
to change.
set of nodes specified by the policy. Memory will be allocated from
the node in the set with sufficient free memory that is closest to
the node where the allocation takes place.
MPOL_PREFERRED: This mode specifies that the allocation should be
attempted from the single node specified in the policy. If that
allocation fails, the kernel will search other nodes, exactly as
it would for a local allocation that started at the preferred node
in increasing distance from the preferred node. "Local" allocation
policy can be viewed as a Preferred policy that starts at the node
allocation fails, the kernel will search other nodes, in order of
increasing distance from the preferred node based on information
provided by the platform firmware.
containing the cpu where the allocation takes place.
Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. A "distinguished
value of this preferred_node, currently '-1', is interpreted
as "the node containing the cpu where the allocation takes
place"--local allocation. This is the way to specify
local allocation for a specific range of addresses--i.e. for
VMA policies.
preferred_node member of struct mempolicy. When the internal
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
the policy is interpreted as local allocation. "Local" allocation
policy can be viewed as a Preferred policy that starts at the node
containing the cpu where the allocation takes place.
It is possible for the user to specify that local allocation is
always preferred by passing an empty nodemask with this mode.
If an empty nodemask is passed, the policy cannot use the
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
below.
MPOL_INTERLEAVED: This mode specifies that page allocations be
interleaved, on a page granularity, across the nodes specified in
@@ -231,6 +212,154 @@ Components of Memory Policies
the temporary interleaved system default policy works in this
mode.
Linux memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
the user should not be remapped if the task or VMA's set of allowed
nodes changes after the memory policy has been defined.
Without this flag, anytime a mempolicy is rebound because of a
change in the set of allowed nodes, the node (Preferred) or
nodemask (Bind, Interleave) is remapped to the new set of
allowed nodes. This may result in nodes being used that were
previously undesired.
With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
applied to their intersection. If the two sets of nodes do not
overlap, the Default policy is used.
For example, consider a task that is attached to a cpuset with
mems 1-3 that sets an Interleave policy over the same set. If
the cpuset's mems change to 3-5, the Interleave will now occur
over nodes 3, 4, and 5. With this flag, however, since only node
3 is allowed from the user's nodemask, the "interleave" only
occurs over that node. If no nodes from the user's nodemask are
now allowed, the Default behavior is used.
MPOL_F_STATIC_NODES cannot be combined with the
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation).
MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
by the user will be mapped relative to the set of the task or VMA's
set of allowed nodes. The kernel stores the user-passed nodemask,
and if the allowed nodes changes, then that original nodemask will
be remapped relative to the new set of allowed nodes.
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
mempolicy is rebound because of a change in the set of allowed
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
remapped to the new set of allowed nodes. That remap may not
preserve the relative nature of the user's passed nodemask to its
set of allowed nodes upon successive rebinds: a nodemask of
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
allowed nodes is restored to its original state.
With this flag, the remap is done so that the node numbers from
the user's passed nodemask are relative to the set of allowed
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
nodemask, the policy will be effected over the first (and in the
Bind or Interleave case, the third and fifth) nodes in the set of
allowed nodes. The nodemask passed by the user represents nodes
relative to task or VMA's set of allowed nodes.
If the user's nodemask includes nodes that are outside the range
of the new set of allowed nodes (for example, node 5 is set in
the user's nodemask when the set of allowed nodes is only 0-3),
then the remap wraps around to the beginning of the nodemask and,
if not already set, sets the node in the mempolicy nodemask.
For example, consider a task that is attached to a cpuset with
mems 2-5 that sets an Interleave policy over the same set with
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
interleave now occurs over nodes 3,5-6. If the cpuset's mems
then change to 0,2-3,5, then the interleave occurs over nodes
0,3,5.
Thanks to the consistent remapping, applications preparing
nodemasks to specify memory policies using this flag should
disregard their current, actual cpuset imposed memory placement
and prepare the nodemask as if they were always located on
memory nodes 0 to N-1, where N is the number of memory nodes the
policy is intended to manage. Let the kernel then remap to the
set of memory nodes allowed by the task's cpuset, as that may
change over time.
MPOL_F_RELATIVE_NODES cannot be combined with the
MPOL_F_STATIC_NODES flag. It also cannot be used for
MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation).
MEMORY POLICY REFERENCE COUNTING
To resolve use/free races, struct mempolicy contains an atomic reference
count field. Internal interfaces, mpol_get()/mpol_put() increment and
decrement this reference count, respectively. mpol_put() will only free
the structure back to the mempolicy kmem cache when the reference count
goes to zero.
When a new memory policy is allocated, it's reference count is initialized
to '1', representing the reference held by the task that is installing the
new policy. When a pointer to a memory policy structure is stored in another
structure, another reference is added, as the task's reference will be dropped
on completion of the policy installation.
During run-time "usage" of the policy, we attempt to minimize atomic operations
on the reference count, as this can lead to cache lines bouncing between cpus
and NUMA nodes. "Usage" here means one of the following:
1) querying of the policy, either by the task itself [using the get_mempolicy()
API discussed below] or by another task using the /proc/<pid>/numa_maps
interface.
2) examination of the policy to determine the policy mode and associated node
or node lists, if any, for page allocation. This is considered a "hot
path". Note that for MPOL_BIND, the "usage" extends across the entire
allocation process, which may sleep during page reclaimation, because the
BIND policy nodemask is used, by reference, to filter ineligible nodes.
We can avoid taking an extra reference during the usages listed above as
follows:
1) we never need to get/free the system default policy as this is never
changed nor freed, once the system is up and running.
2) for querying the policy, we do not need to take an extra reference on the
target task's task policy nor vma policies because we always acquire the
task's mm's mmap_sem for read during the query. The set_mempolicy() and
mbind() APIs [see below] always acquire the mmap_sem for write when
installing or replacing task or vma policies. Thus, there is no possibility
of a task or thread freeing a policy while another task or thread is
querying it.
3) Page allocation usage of task or vma policy occurs in the fault path where
we hold them mmap_sem for read. Again, because replacing the task or vma
policy requires that the mmap_sem be held for write, the policy can't be
freed out from under us while we're using it for page allocation.
4) Shared policies require special consideration. One task can replace a
shared memory policy while another task, with a distinct mmap_sem, is
querying or allocating a page based on the policy. To resolve this
potential race, the shared policy infrastructure adds an extra reference
to the shared policy during lookup while holding a spin lock on the shared
policy management structure. This requires that we drop this extra
reference when we're finished "using" the policy. We must drop the
extra reference on shared policies in the same query/allocation paths
used for non-shared policies. For this reason, shared policies are marked
as such, and the extra reference is dropped "conditionally"--i.e., only
for shared policies.
Because of this extra reference counting, and because we must lookup
shared policies in a tree structure under spinlock, shared policies are
more expensive to use in the page allocation path. This is expecially
true for shared policies on shared memory regions shared by tasks running
on different NUMA nodes. This extra overhead can be avoided by always
falling back to task or system default policy for shared memory regions,
or by prefaulting the entire shared memory region into memory and locking
it down. However, this might not be appropriate for all applications.
MEMORY POLICY APIs
Linux supports 3 system calls for controlling memory policy. These APIS
@@ -251,7 +380,9 @@ Set [Task] Memory Policy:
Set's the calling task's "task/process memory policy" to mode
specified by the 'mode' argument and the set of nodes defined
by 'nmask'. 'nmask' points to a bit mask of node ids containing
at least 'maxnode' ids.
at least 'maxnode' ids. Optional mode flags may be passed by
combining the 'mode' argument with the flag (for example:
MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
See the set_mempolicy(2) man page for more details
@@ -303,29 +434,19 @@ MEMORY POLICIES AND CPUSETS
Memory policies work within cpusets as described above. For memory policies
that require a node or set of nodes, the nodes are restricted to the set of
nodes whose memories are allowed by the cpuset constraints. If the nodemask
specified for the policy contains nodes that are not allowed by the cpuset, or
the intersection of the set of nodes specified for the policy and the set of
nodes with memory is the empty set, the policy is considered invalid
and cannot be installed.
specified for the policy contains nodes that are not allowed by the cpuset and
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
specified for the policy and the set of nodes with memory is used. If the
result is the empty set, the policy is considered invalid and cannot be
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
onto and folded into the task's set of allowed nodes as previously described.
The interaction of memory policies and cpusets can be problematic for a
couple of reasons:
1) the memory policy APIs take physical node id's as arguments. As mentioned
above, it is illegal to specify nodes that are not allowed in the cpuset.
The application must query the allowed nodes using the get_mempolicy()
API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
restrict itself to those nodes. However, the resources available to a
cpuset can be changed by the system administrator, or a workload manager
application, at any time. So, a task may still get errors attempting to
specify policy nodes, and must query the allowed memories again.
2) when tasks in two cpusets share access to a memory region, such as shared
memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
MAP_SHARED flags, and any of the tasks install shared policy on the region,
only nodes whose memories are allowed in both cpusets may be used in the
policies. Obtaining this information requires "stepping outside" the
memory policy APIs to use the cpuset information and requires that one
know in what cpusets other task might be attaching to the shared region.
Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
allocation is the only valid policy.
The interaction of memory policies and cpusets can be problematic when tasks
in two cpusets share access to a memory region, such as shared memory segments
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
any of the tasks install shared policy on the region, only nodes whose
memories are allowed in both cpusets may be used in the policies. Obtaining
this information requires "stepping outside" the memory policy APIs to use the
cpuset information and requires that one know in what cpusets other task might
be attaching to the shared region. Furthermore, if the cpusets' allowed
memory sets are disjoint, "local" allocation is the only valid policy.
+51 -11
View File
@@ -1,26 +1,61 @@
#
# Kbuild for top-level directory of the kernel
# This file takes care of the following:
# 1) Generate asm-offsets.h
# 2) Check for missing system calls
# 1) Generate bounds.h
# 2) Generate asm-offsets.h (may need bounds.h)
# 3) Check for missing system calls
#####
# 1) Generate asm-offsets.h
# 1) Generate bounds.h
bounds-file := include/linux/bounds.h
always := $(bounds-file)
targets := $(bounds-file) kernel/bounds.s
quiet_cmd_bounds = GEN $@
define cmd_bounds
(set -e; \
echo "#ifndef __LINUX_BOUNDS_H__"; \
echo "#define __LINUX_BOUNDS_H__"; \
echo "/*"; \
echo " * DO NOT MODIFY."; \
echo " *"; \
echo " * This file was generated by Kbuild"; \
echo " *"; \
echo " */"; \
echo ""; \
sed -ne $(sed-y) $<; \
echo ""; \
echo "#endif" ) > $@
endef
# We use internal kbuild rules to avoid the "is up to date" message from make
kernel/bounds.s: kernel/bounds.c FORCE
$(Q)mkdir -p $(dir $@)
$(call if_changed_dep,cc_s_c)
$(obj)/$(bounds-file): kernel/bounds.s Kbuild
$(Q)mkdir -p $(dir $@)
$(call cmd,bounds)
#####
# 2) Generate asm-offsets.h
#
offsets-file := include/asm-$(SRCARCH)/asm-offsets.h
always := $(offsets-file)
targets := $(offsets-file)
always += $(offsets-file)
targets += $(offsets-file)
targets += arch/$(SRCARCH)/kernel/asm-offsets.s
clean-files := $(addprefix $(objtree)/,$(targets))
# Default sed regexp - multiline due to syntax constraints
define sed-y
"/^->/{s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; s:->::; p;}"
"/^->/{s:->#\(.*\):/* \1 */:; \
s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; \
s:->::; p;}"
endef
# Override default regexp for specific architectures
sed-$(CONFIG_MIPS) := "/^@@@/{s/^@@@//; s/ \#.*\$$//; p;}"
quiet_cmd_offsets = GEN $@
define cmd_offsets
@@ -40,7 +75,8 @@ define cmd_offsets
endef
# We use internal kbuild rules to avoid the "is up to date" message from make
arch/$(SRCARCH)/kernel/asm-offsets.s: arch/$(SRCARCH)/kernel/asm-offsets.c FORCE
arch/$(SRCARCH)/kernel/asm-offsets.s: arch/$(SRCARCH)/kernel/asm-offsets.c \
$(obj)/$(bounds-file) FORCE
$(Q)mkdir -p $(dir $@)
$(call if_changed_dep,cc_s_c)
@@ -49,7 +85,7 @@ $(obj)/$(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s Kbuild
$(call cmd,offsets)
#####
# 2) Check for missing system calls
# 3) Check for missing system calls
#
quiet_cmd_syscalls = CALL $<
@@ -58,3 +94,7 @@ quiet_cmd_syscalls = CALL $<
PHONY += missing-syscalls
missing-syscalls: scripts/checksyscalls.sh FORCE
$(call cmd,syscalls)
# Delete all targets during make clean
clean-files := $(addprefix $(objtree)/,$(targets))
+3 -3
View File
@@ -994,7 +994,7 @@ marvel_agp_configure(alpha_agp_info *agp)
* rate, but warn the user.
*/
printk("%s: unknown PLL setting RNGB=%lx (PLL6_CTL=%016lx)\n",
__FUNCTION__, IO7_PLL_RNGB(agp_pll), agp_pll);
__func__, IO7_PLL_RNGB(agp_pll), agp_pll);
break;
}
@@ -1044,13 +1044,13 @@ marvel_agp_translate(alpha_agp_info *agp, dma_addr_t addr)
if (addr < agp->aperture.bus_base ||
addr >= agp->aperture.bus_base + agp->aperture.size) {
printk("%s: addr out of range\n", __FUNCTION__);
printk("%s: addr out of range\n", __func__);
return -EINVAL;
}
pte = aper->arena->ptes[baddr >> PAGE_SHIFT];
if (!(pte & 1)) {
printk("%s: pte not valid\n", __FUNCTION__);
printk("%s: pte not valid\n", __func__);
return -EINVAL;
}
return (pte >> 1) << PAGE_SHIFT;
+9 -15
View File
@@ -336,10 +336,7 @@ t2_direct_map_window1(unsigned long base, unsigned long length)
#if DEBUG_PRINT_FINAL_SETTINGS
printk("%s: setting WBASE1=0x%lx WMASK1=0x%lx TBASE1=0x%lx\n",
__FUNCTION__,
*(vulp)T2_WBASE1,
*(vulp)T2_WMASK1,
*(vulp)T2_TBASE1);
__func__, *(vulp)T2_WBASE1, *(vulp)T2_WMASK1, *(vulp)T2_TBASE1);
#endif
}
@@ -366,10 +363,7 @@ t2_sg_map_window2(struct pci_controller *hose,
#if DEBUG_PRINT_FINAL_SETTINGS
printk("%s: setting WBASE2=0x%lx WMASK2=0x%lx TBASE2=0x%lx\n",
__FUNCTION__,
*(vulp)T2_WBASE2,
*(vulp)T2_WMASK2,
*(vulp)T2_TBASE2);
__func__, *(vulp)T2_WBASE2, *(vulp)T2_WMASK2, *(vulp)T2_TBASE2);
#endif
}
@@ -377,15 +371,15 @@ static void __init
t2_save_configuration(void)
{
#if DEBUG_PRINT_INITIAL_SETTINGS
printk("%s: HAE_1 was 0x%lx\n", __FUNCTION__, srm_hae); /* HW is 0 */
printk("%s: HAE_2 was 0x%lx\n", __FUNCTION__, *(vulp)T2_HAE_2);
printk("%s: HAE_3 was 0x%lx\n", __FUNCTION__, *(vulp)T2_HAE_3);
printk("%s: HAE_4 was 0x%lx\n", __FUNCTION__, *(vulp)T2_HAE_4);
printk("%s: HBASE was 0x%lx\n", __FUNCTION__, *(vulp)T2_HBASE);
printk("%s: HAE_1 was 0x%lx\n", __func__, srm_hae); /* HW is 0 */
printk("%s: HAE_2 was 0x%lx\n", __func__, *(vulp)T2_HAE_2);
printk("%s: HAE_3 was 0x%lx\n", __func__, *(vulp)T2_HAE_3);
printk("%s: HAE_4 was 0x%lx\n", __func__, *(vulp)T2_HAE_4);
printk("%s: HBASE was 0x%lx\n", __func__, *(vulp)T2_HBASE);
printk("%s: WBASE1=0x%lx WMASK1=0x%lx TBASE1=0x%lx\n", __FUNCTION__,
printk("%s: WBASE1=0x%lx WMASK1=0x%lx TBASE1=0x%lx\n", __func__,
*(vulp)T2_WBASE1, *(vulp)T2_WMASK1, *(vulp)T2_TBASE1);
printk("%s: WBASE2=0x%lx WMASK2=0x%lx TBASE2=0x%lx\n", __FUNCTION__,
printk("%s: WBASE2=0x%lx WMASK2=0x%lx TBASE2=0x%lx\n", __func__,
*(vulp)T2_WBASE2, *(vulp)T2_WMASK2, *(vulp)T2_TBASE2);
#endif

Some files were not shown because too many files have changed in this diff Show More