You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge branch 'for-linus' into for-3.1/core
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This commit is contained in:
@@ -518,6 +518,14 @@ N: Zach Brown
|
||||
E: zab@zabbo.net
|
||||
D: maestro pci sound
|
||||
|
||||
M: David Brownell
|
||||
D: Kernel engineer, mentor, and friend. Maintained USB EHCI and
|
||||
D: gadget layers, SPI subsystem, GPIO subsystem, and more than a few
|
||||
D: device drivers. His encouragement also helped many engineers get
|
||||
D: started working on the Linux kernel. David passed away in early
|
||||
D: 2011, and will be greatly missed.
|
||||
W: https://lkml.org/lkml/2011/4/5/36
|
||||
|
||||
N: Gary Brubaker
|
||||
E: xavyer@ix.netcom.com
|
||||
D: USB Serial Empeg Empeg-car Mark I/II Driver
|
||||
|
||||
@@ -0,0 +1,56 @@
|
||||
What: /sys/class/backlight/<backlight>/<ambient light zone>_max
|
||||
What: /sys/class/backlight/<backlight>/l1_daylight_max
|
||||
What: /sys/class/backlight/<backlight>/l2_bright_max
|
||||
What: /sys/class/backlight/<backlight>/l3_office_max
|
||||
What: /sys/class/backlight/<backlight>/l4_indoor_max
|
||||
What: /sys/class/backlight/<backlight>/l5_dark_max
|
||||
Date: Mai 2011
|
||||
KernelVersion: 2.6.40
|
||||
Contact: device-drivers-devel@blackfin.uclinux.org
|
||||
Description:
|
||||
Control the maximum brightness for <ambient light zone>
|
||||
on this <backlight>. Values are between 0 and 127. This file
|
||||
will also show the brightness level stored for this
|
||||
<ambient light zone>.
|
||||
|
||||
What: /sys/class/backlight/<backlight>/<ambient light zone>_dim
|
||||
What: /sys/class/backlight/<backlight>/l2_bright_dim
|
||||
What: /sys/class/backlight/<backlight>/l3_office_dim
|
||||
What: /sys/class/backlight/<backlight>/l4_indoor_dim
|
||||
What: /sys/class/backlight/<backlight>/l5_dark_dim
|
||||
Date: Mai 2011
|
||||
KernelVersion: 2.6.40
|
||||
Contact: device-drivers-devel@blackfin.uclinux.org
|
||||
Description:
|
||||
Control the dim brightness for <ambient light zone>
|
||||
on this <backlight>. Values are between 0 and 127, typically
|
||||
set to 0. Full off when the backlight is disabled.
|
||||
This file will also show the dim brightness level stored for
|
||||
this <ambient light zone>.
|
||||
|
||||
What: /sys/class/backlight/<backlight>/ambient_light_level
|
||||
Date: Mai 2011
|
||||
KernelVersion: 2.6.40
|
||||
Contact: device-drivers-devel@blackfin.uclinux.org
|
||||
Description:
|
||||
Get conversion value of the light sensor.
|
||||
This value is updated every 80 ms (when the light sensor
|
||||
is enabled). Returns integer between 0 (dark) and
|
||||
8000 (max ambient brightness)
|
||||
|
||||
What: /sys/class/backlight/<backlight>/ambient_light_zone
|
||||
Date: Mai 2011
|
||||
KernelVersion: 2.6.40
|
||||
Contact: device-drivers-devel@blackfin.uclinux.org
|
||||
Description:
|
||||
Get/Set current ambient light zone. Reading returns
|
||||
integer between 1..5 (1 = daylight, 2 = bright, ..., 5 = dark).
|
||||
Writing a value between 1..5 forces the backlight controller
|
||||
to enter the corresponding ambient light zone.
|
||||
Writing 0 returns to normal/automatic ambient light level
|
||||
operation. The ambient light sensing feature on these devices
|
||||
is an extension to the API documented in
|
||||
Documentation/ABI/stable/sysfs-class-backlight.
|
||||
It can be enabled by writing the value stored in
|
||||
/sys/class/backlight/<backlight>/max_brightness to
|
||||
/sys/class/backlight/<backlight>/brightness.
|
||||
@@ -21,7 +21,7 @@ information will not be available.
|
||||
To extract cgroup statistics a utility very similar to getdelays.c
|
||||
has been developed, the sample output of the utility is shown below
|
||||
|
||||
~/balbir/cgroupstats # ./getdelays -C "/cgroup/a"
|
||||
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a"
|
||||
sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
|
||||
~/balbir/cgroupstats # ./getdelays -C "/cgroup"
|
||||
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup"
|
||||
sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
|
||||
|
||||
@@ -28,16 +28,19 @@ cgroups. Here is what you can do.
|
||||
- Enable group scheduling in CFQ
|
||||
CONFIG_CFQ_GROUP_IOSCHED=y
|
||||
|
||||
- Compile and boot into kernel and mount IO controller (blkio).
|
||||
- Compile and boot into kernel and mount IO controller (blkio); see
|
||||
cgroups.txt, Why are cgroups needed?.
|
||||
|
||||
mount -t cgroup -o blkio none /cgroup
|
||||
mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||
mkdir /sys/fs/cgroup/blkio
|
||||
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
|
||||
|
||||
- Create two cgroups
|
||||
mkdir -p /cgroup/test1/ /cgroup/test2
|
||||
mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2
|
||||
|
||||
- Set weights of group test1 and test2
|
||||
echo 1000 > /cgroup/test1/blkio.weight
|
||||
echo 500 > /cgroup/test2/blkio.weight
|
||||
echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
|
||||
echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight
|
||||
|
||||
- Create two same size files (say 512MB each) on same disk (file1, file2) and
|
||||
launch two dd threads in different cgroup to read those files.
|
||||
@@ -46,12 +49,12 @@ cgroups. Here is what you can do.
|
||||
echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
dd if=/mnt/sdb/zerofile1 of=/dev/null &
|
||||
echo $! > /cgroup/test1/tasks
|
||||
cat /cgroup/test1/tasks
|
||||
echo $! > /sys/fs/cgroup/blkio/test1/tasks
|
||||
cat /sys/fs/cgroup/blkio/test1/tasks
|
||||
|
||||
dd if=/mnt/sdb/zerofile2 of=/dev/null &
|
||||
echo $! > /cgroup/test2/tasks
|
||||
cat /cgroup/test2/tasks
|
||||
echo $! > /sys/fs/cgroup/blkio/test2/tasks
|
||||
cat /sys/fs/cgroup/blkio/test2/tasks
|
||||
|
||||
- At macro level, first dd should finish first. To get more precise data, keep
|
||||
on looking at (with the help of script), at blkio.disk_time and
|
||||
@@ -68,13 +71,13 @@ Throttling/Upper Limit policy
|
||||
- Enable throttling in block layer
|
||||
CONFIG_BLK_DEV_THROTTLING=y
|
||||
|
||||
- Mount blkio controller
|
||||
mount -t cgroup -o blkio none /cgroup/blkio
|
||||
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
|
||||
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
|
||||
|
||||
- Specify a bandwidth rate on particular device for root group. The format
|
||||
for policy is "<major>:<minor> <byes_per_second>".
|
||||
|
||||
echo "8:16 1048576" > /cgroup/blkio/blkio.read_bps_device
|
||||
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.read_bps_device
|
||||
|
||||
Above will put a limit of 1MB/second on reads happening for root group
|
||||
on device having major/minor number 8:16.
|
||||
@@ -108,7 +111,7 @@ Hierarchical Cgroups
|
||||
CFQ and throttling will practically treat all groups at same level.
|
||||
|
||||
pivot
|
||||
/ | \ \
|
||||
/ / \ \
|
||||
root test1 test2 test3
|
||||
|
||||
Down the line we can implement hierarchical accounting/control support
|
||||
@@ -149,7 +152,7 @@ Proportional weight policy files
|
||||
|
||||
Following is the format.
|
||||
|
||||
#echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device
|
||||
# echo dev_maj:dev_minor weight > blkio.weight_device
|
||||
Configure weight=300 on /dev/sdb (8:16) in this cgroup
|
||||
# echo 8:16 300 > blkio.weight_device
|
||||
# cat blkio.weight_device
|
||||
|
||||
@@ -138,11 +138,11 @@ With the ability to classify tasks differently for different resources
|
||||
the admin can easily set up a script which receives exec notifications
|
||||
and depending on who is launching the browser he can
|
||||
|
||||
# echo browser_pid > /mnt/<restype>/<userclass>/tasks
|
||||
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
|
||||
|
||||
With only a single hierarchy, he now would potentially have to create
|
||||
a separate cgroup for every browser launched and associate it with
|
||||
approp network and other resource class. This may lead to
|
||||
appropriate network and other resource class. This may lead to
|
||||
proliferation of such cgroups.
|
||||
|
||||
Also lets say that the administrator would like to give enhanced network
|
||||
@@ -153,9 +153,9 @@ apps enhanced CPU power,
|
||||
With ability to write pids directly to resource classes, it's just a
|
||||
matter of :
|
||||
|
||||
# echo pid > /mnt/network/<new_class>/tasks
|
||||
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
|
||||
(after some time)
|
||||
# echo pid > /mnt/network/<orig_class>/tasks
|
||||
# echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
|
||||
|
||||
Without this ability, he would have to split the cgroup into
|
||||
multiple separate ones and then associate the new cgroups with the
|
||||
@@ -310,21 +310,24 @@ subsystem, this is the case for the cpuset.
|
||||
To start a new job that is to be contained within a cgroup, using
|
||||
the "cpuset" cgroup subsystem, the steps are something like:
|
||||
|
||||
1) mkdir /dev/cgroup
|
||||
2) mount -t cgroup -ocpuset cpuset /dev/cgroup
|
||||
3) Create the new cgroup by doing mkdir's and write's (or echo's) in
|
||||
the /dev/cgroup virtual file system.
|
||||
4) Start a task that will be the "founding father" of the new job.
|
||||
5) Attach that task to the new cgroup by writing its pid to the
|
||||
/dev/cgroup tasks file for that cgroup.
|
||||
6) fork, exec or clone the job tasks from this founding father task.
|
||||
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||
2) mkdir /sys/fs/cgroup/cpuset
|
||||
3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||
4) Create the new cgroup by doing mkdir's and write's (or echo's) in
|
||||
the /sys/fs/cgroup virtual file system.
|
||||
5) Start a task that will be the "founding father" of the new job.
|
||||
6) Attach that task to the new cgroup by writing its pid to the
|
||||
/sys/fs/cgroup/cpuset/tasks file for that cgroup.
|
||||
7) fork, exec or clone the job tasks from this founding father task.
|
||||
|
||||
For example, the following sequence of commands will setup a cgroup
|
||||
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||
and then start a subshell 'sh' in that cgroup:
|
||||
|
||||
mount -t cgroup cpuset -ocpuset /dev/cgroup
|
||||
cd /dev/cgroup
|
||||
mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||
mkdir /sys/fs/cgroup/cpuset
|
||||
mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
|
||||
cd /sys/fs/cgroup/cpuset
|
||||
mkdir Charlie
|
||||
cd Charlie
|
||||
/bin/echo 2-3 > cpuset.cpus
|
||||
@@ -345,7 +348,7 @@ Creating, modifying, using the cgroups can be done through the cgroup
|
||||
virtual filesystem.
|
||||
|
||||
To mount a cgroup hierarchy with all available subsystems, type:
|
||||
# mount -t cgroup xxx /dev/cgroup
|
||||
# mount -t cgroup xxx /sys/fs/cgroup
|
||||
|
||||
The "xxx" is not interpreted by the cgroup code, but will appear in
|
||||
/proc/mounts so may be any useful identifying string that you like.
|
||||
@@ -354,23 +357,32 @@ Note: Some subsystems do not work without some user input first. For instance,
|
||||
if cpusets are enabled the user will have to populate the cpus and mems files
|
||||
for each new cgroup created before that group can be used.
|
||||
|
||||
As explained in section `1.2 Why are cgroups needed?' you should create
|
||||
different hierarchies of cgroups for each single resource or group of
|
||||
resources you want to control. Therefore, you should mount a tmpfs on
|
||||
/sys/fs/cgroup and create directories for each cgroup resource or resource
|
||||
group.
|
||||
|
||||
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||
# mkdir /sys/fs/cgroup/rg1
|
||||
|
||||
To mount a cgroup hierarchy with just the cpuset and memory
|
||||
subsystems, type:
|
||||
# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
|
||||
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
|
||||
|
||||
To change the set of subsystems bound to a mounted hierarchy, just
|
||||
remount with different options:
|
||||
# mount -o remount,cpuset,blkio hier1 /dev/cgroup
|
||||
# mount -o remount,cpuset,blkio hier1 /sys/fs/cgroup/rg1
|
||||
|
||||
Now memory is removed from the hierarchy and blkio is added.
|
||||
|
||||
Note this will add blkio to the hierarchy but won't remove memory or
|
||||
cpuset, because the new options are appended to the old ones:
|
||||
# mount -o remount,blkio /dev/cgroup
|
||||
# mount -o remount,blkio /sys/fs/cgroup/rg1
|
||||
|
||||
To Specify a hierarchy's release_agent:
|
||||
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
|
||||
xxx /dev/cgroup
|
||||
xxx /sys/fs/cgroup/rg1
|
||||
|
||||
Note that specifying 'release_agent' more than once will return failure.
|
||||
|
||||
@@ -379,17 +391,17 @@ when the hierarchy consists of a single (root) cgroup. Supporting
|
||||
the ability to arbitrarily bind/unbind subsystems from an existing
|
||||
cgroup hierarchy is intended to be implemented in the future.
|
||||
|
||||
Then under /dev/cgroup you can find a tree that corresponds to the
|
||||
tree of the cgroups in the system. For instance, /dev/cgroup
|
||||
Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
|
||||
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
|
||||
is the cgroup that holds the whole system.
|
||||
|
||||
If you want to change the value of release_agent:
|
||||
# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent
|
||||
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
|
||||
|
||||
It can also be changed via remount.
|
||||
|
||||
If you want to create a new cgroup under /dev/cgroup:
|
||||
# cd /dev/cgroup
|
||||
If you want to create a new cgroup under /sys/fs/cgroup/rg1:
|
||||
# cd /sys/fs/cgroup/rg1
|
||||
# mkdir my_cgroup
|
||||
|
||||
Now you want to do something with this cgroup.
|
||||
|
||||
@@ -10,26 +10,25 @@ directly present in its group.
|
||||
|
||||
Accounting groups can be created by first mounting the cgroup filesystem.
|
||||
|
||||
# mkdir /cgroups
|
||||
# mount -t cgroup -ocpuacct none /cgroups
|
||||
# mount -t cgroup -ocpuacct none /sys/fs/cgroup
|
||||
|
||||
With the above step, the initial or the parent accounting group
|
||||
becomes visible at /cgroups. At bootup, this group includes all the
|
||||
tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
|
||||
/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
|
||||
this group which is essentially the CPU time obtained by all the tasks
|
||||
With the above step, the initial or the parent accounting group becomes
|
||||
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
||||
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
|
||||
/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
|
||||
by this group which is essentially the CPU time obtained by all the tasks
|
||||
in the system.
|
||||
|
||||
New accounting groups can be created under the parent group /cgroups.
|
||||
New accounting groups can be created under the parent group /sys/fs/cgroup.
|
||||
|
||||
# cd /cgroups
|
||||
# cd /sys/fs/cgroup
|
||||
# mkdir g1
|
||||
# echo $$ > g1
|
||||
|
||||
The above steps create a new group g1 and move the current shell
|
||||
process (bash) into it. CPU time consumed by this bash and its children
|
||||
can be obtained from g1/cpuacct.usage and the same is accumulated in
|
||||
/cgroups/cpuacct.usage also.
|
||||
/sys/fs/cgroup/cpuacct.usage also.
|
||||
|
||||
cpuacct.stat file lists a few statistics which further divide the
|
||||
CPU time obtained by the cgroup into user and system times. Currently
|
||||
|
||||
@@ -661,21 +661,21 @@ than stress the kernel.
|
||||
|
||||
To start a new job that is to be contained within a cpuset, the steps are:
|
||||
|
||||
1) mkdir /dev/cpuset
|
||||
2) mount -t cgroup -ocpuset cpuset /dev/cpuset
|
||||
1) mkdir /sys/fs/cgroup/cpuset
|
||||
2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
|
||||
the /dev/cpuset virtual file system.
|
||||
the /sys/fs/cgroup/cpuset virtual file system.
|
||||
4) Start a task that will be the "founding father" of the new job.
|
||||
5) Attach that task to the new cpuset by writing its pid to the
|
||||
/dev/cpuset tasks file for that cpuset.
|
||||
/sys/fs/cgroup/cpuset tasks file for that cpuset.
|
||||
6) fork, exec or clone the job tasks from this founding father task.
|
||||
|
||||
For example, the following sequence of commands will setup a cpuset
|
||||
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||
and then start a subshell 'sh' in that cpuset:
|
||||
|
||||
mount -t cgroup -ocpuset cpuset /dev/cpuset
|
||||
cd /dev/cpuset
|
||||
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||
cd /sys/fs/cgroup/cpuset
|
||||
mkdir Charlie
|
||||
cd Charlie
|
||||
/bin/echo 2-3 > cpuset.cpus
|
||||
@@ -710,14 +710,14 @@ Creating, modifying, using the cpusets can be done through the cpuset
|
||||
virtual filesystem.
|
||||
|
||||
To mount it, type:
|
||||
# mount -t cgroup -o cpuset cpuset /dev/cpuset
|
||||
# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
|
||||
|
||||
Then under /dev/cpuset you can find a tree that corresponds to the
|
||||
tree of the cpusets in the system. For instance, /dev/cpuset
|
||||
Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
|
||||
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
|
||||
is the cpuset that holds the whole system.
|
||||
|
||||
If you want to create a new cpuset under /dev/cpuset:
|
||||
# cd /dev/cpuset
|
||||
If you want to create a new cpuset under /sys/fs/cgroup/cpuset:
|
||||
# cd /sys/fs/cgroup/cpuset
|
||||
# mkdir my_cpuset
|
||||
|
||||
Now you want to do something with this cpuset.
|
||||
@@ -765,12 +765,12 @@ wrapper around the cgroup filesystem.
|
||||
|
||||
The command
|
||||
|
||||
mount -t cpuset X /dev/cpuset
|
||||
mount -t cpuset X /sys/fs/cgroup/cpuset
|
||||
|
||||
is equivalent to
|
||||
|
||||
mount -t cgroup -ocpuset,noprefix X /dev/cpuset
|
||||
echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
|
||||
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
|
||||
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
|
||||
|
||||
2.2 Adding/removing cpus
|
||||
------------------------
|
||||
|
||||
@@ -22,16 +22,16 @@ removed from the child(ren).
|
||||
An entry is added using devices.allow, and removed using
|
||||
devices.deny. For instance
|
||||
|
||||
echo 'c 1:3 mr' > /cgroups/1/devices.allow
|
||||
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
|
||||
|
||||
allows cgroup 1 to read and mknod the device usually known as
|
||||
/dev/null. Doing
|
||||
|
||||
echo a > /cgroups/1/devices.deny
|
||||
echo a > /sys/fs/cgroup/1/devices.deny
|
||||
|
||||
will remove the default 'a *:* rwm' entry. Doing
|
||||
|
||||
echo a > /cgroups/1/devices.allow
|
||||
echo a > /sys/fs/cgroup/1/devices.allow
|
||||
|
||||
will add the 'a *:* rwm' entry to the whitelist.
|
||||
|
||||
|
||||
@@ -59,28 +59,28 @@ is non-freezable.
|
||||
|
||||
* Examples of usage :
|
||||
|
||||
# mkdir /containers
|
||||
# mount -t cgroup -ofreezer freezer /containers
|
||||
# mkdir /containers/0
|
||||
# echo $some_pid > /containers/0/tasks
|
||||
# mkdir /sys/fs/cgroup/freezer
|
||||
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
|
||||
# mkdir /sys/fs/cgroup/freezer/0
|
||||
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
|
||||
|
||||
to get status of the freezer subsystem :
|
||||
|
||||
# cat /containers/0/freezer.state
|
||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||
THAWED
|
||||
|
||||
to freeze all tasks in the container :
|
||||
|
||||
# echo FROZEN > /containers/0/freezer.state
|
||||
# cat /containers/0/freezer.state
|
||||
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
|
||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||
FREEZING
|
||||
# cat /containers/0/freezer.state
|
||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||
FROZEN
|
||||
|
||||
to unfreeze all tasks in the container :
|
||||
|
||||
# echo THAWED > /containers/0/freezer.state
|
||||
# cat /containers/0/freezer.state
|
||||
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
|
||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||
THAWED
|
||||
|
||||
This is the basic mechanism which should do the right thing for user space task
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
Memory Resource Controller
|
||||
|
||||
NOTE: The Memory Resource Controller has been generically been referred
|
||||
to as the memory controller in this document. Do not confuse memory
|
||||
controller used here with the memory controller that is used in hardware.
|
||||
NOTE: The Memory Resource Controller has generically been referred to as the
|
||||
memory controller in this document. Do not confuse memory controller
|
||||
used here with the memory controller that is used in hardware.
|
||||
|
||||
(For editors)
|
||||
In this document:
|
||||
@@ -70,6 +70,7 @@ Brief summary of control files.
|
||||
(See sysctl's vm.swappiness)
|
||||
memory.move_charge_at_immigrate # set/show controls of moving charges
|
||||
memory.oom_control # set/show oom controls.
|
||||
memory.numa_stat # show the number of memory usage per numa node
|
||||
|
||||
1. History
|
||||
|
||||
@@ -181,7 +182,7 @@ behind this approach is that a cgroup that aggressively uses a shared
|
||||
page will eventually get charged for it (once it is uncharged from
|
||||
the cgroup that brought it in -- this will happen on memory pressure).
|
||||
|
||||
Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
|
||||
Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.
|
||||
When you do swapoff and make swapped-out pages of shmem(tmpfs) to
|
||||
be backed into memory in force, charges for pages are accounted against the
|
||||
caller of swapoff rather than the users of shmem.
|
||||
@@ -213,7 +214,7 @@ affecting global LRU, memory+swap limit is better than just limiting swap from
|
||||
OS point of view.
|
||||
|
||||
* What happens when a cgroup hits memory.memsw.limit_in_bytes
|
||||
When a cgroup his memory.memsw.limit_in_bytes, it's useless to do swap-out
|
||||
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
|
||||
in this cgroup. Then, swap-out will not be done by cgroup routine and file
|
||||
caches are dropped. But as mentioned above, global LRU can do swapout memory
|
||||
from it for sanity of the system's memory management state. You can't forbid
|
||||
@@ -263,16 +264,17 @@ b. Enable CONFIG_RESOURCE_COUNTERS
|
||||
c. Enable CONFIG_CGROUP_MEM_RES_CTLR
|
||||
d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
|
||||
|
||||
1. Prepare the cgroups
|
||||
# mkdir -p /cgroups
|
||||
# mount -t cgroup none /cgroups -o memory
|
||||
1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
|
||||
# mount -t tmpfs none /sys/fs/cgroup
|
||||
# mkdir /sys/fs/cgroup/memory
|
||||
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
|
||||
|
||||
2. Make the new group and move bash into it
|
||||
# mkdir /cgroups/0
|
||||
# echo $$ > /cgroups/0/tasks
|
||||
# mkdir /sys/fs/cgroup/memory/0
|
||||
# echo $$ > /sys/fs/cgroup/memory/0/tasks
|
||||
|
||||
Since now we're in the 0 cgroup, we can alter the memory limit:
|
||||
# echo 4M > /cgroups/0/memory.limit_in_bytes
|
||||
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
||||
|
||||
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
|
||||
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
|
||||
@@ -280,11 +282,11 @@ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
|
||||
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
|
||||
NOTE: We cannot set limits on the root cgroup any more.
|
||||
|
||||
# cat /cgroups/0/memory.limit_in_bytes
|
||||
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
||||
4194304
|
||||
|
||||
We can check the usage:
|
||||
# cat /cgroups/0/memory.usage_in_bytes
|
||||
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
|
||||
1216512
|
||||
|
||||
A successful write to this file does not guarantee a successful set of
|
||||
@@ -464,6 +466,24 @@ value for efficient access. (Of course, when necessary, it's synchronized.)
|
||||
If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
|
||||
value in memory.stat(see 5.2).
|
||||
|
||||
5.6 numa_stat
|
||||
|
||||
This is similar to numa_maps but operates on a per-memcg basis. This is
|
||||
useful for providing visibility into the numa locality information within
|
||||
an memcg since the pages are allowed to be allocated from any physical
|
||||
node. One of the usecases is evaluating application performance by
|
||||
combining this information with the application's cpu allocation.
|
||||
|
||||
We export "total", "file", "anon" and "unevictable" pages per-node for
|
||||
each memcg. The ouput format of memory.numa_stat is:
|
||||
|
||||
total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||
|
||||
And we have total = file + anon + unevictable.
|
||||
|
||||
6. Hierarchy support
|
||||
|
||||
The memory controller supports a deep hierarchy and hierarchical accounting.
|
||||
@@ -471,13 +491,13 @@ The hierarchy is created by creating the appropriate cgroups in the
|
||||
cgroup filesystem. Consider for example, the following cgroup filesystem
|
||||
hierarchy
|
||||
|
||||
root
|
||||
root
|
||||
/ | \
|
||||
/ | \
|
||||
a b c
|
||||
| \
|
||||
| \
|
||||
d e
|
||||
/ | \
|
||||
a b c
|
||||
| \
|
||||
| \
|
||||
d e
|
||||
|
||||
In the diagram above, with hierarchical accounting enabled, all memory
|
||||
usage of e, is accounted to its ancestors up until the root (i.e, c and root),
|
||||
|
||||
@@ -481,23 +481,6 @@ Who: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: namespace cgroup (ns_cgroup)
|
||||
When: 2.6.38
|
||||
Why: The ns_cgroup leads to some problems:
|
||||
* cgroup creation is out-of-control
|
||||
* cgroup name can conflict when pids are looping
|
||||
* it is not possible to have a single process handling
|
||||
a lot of namespaces without falling in a exponential creation time
|
||||
* we may want to create a namespace without creating a cgroup
|
||||
|
||||
The ns_cgroup is replaced by a compatibility flag 'clone_children',
|
||||
where a newly created cgroup will copy the parent cgroup values.
|
||||
The userspace has to manually create a cgroup and add a task to
|
||||
the 'tasks' file.
|
||||
Who: Daniel Lezcano <daniel.lezcano@free.fr>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: iwlwifi disable_hw_scan module parameters
|
||||
When: 2.6.40
|
||||
Why: Hareware scan is the prefer method for iwlwifi devices for
|
||||
|
||||
@@ -843,6 +843,7 @@ Provides counts of softirq handlers serviced since boot time, for each cpu.
|
||||
TASKLET: 0 0 0 290
|
||||
SCHED: 27035 26983 26971 26746
|
||||
HRTIMER: 0 0 0 0
|
||||
RCU: 1678 1769 2178 2250
|
||||
|
||||
|
||||
1.3 IDE devices in /proc/ide
|
||||
|
||||
@@ -2598,6 +2598,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
|
||||
unlock ejectable media);
|
||||
m = MAX_SECTORS_64 (don't transfer more
|
||||
than 64 sectors = 32 KB at a time);
|
||||
n = INITIAL_READ10 (force a retry of the
|
||||
initial READ(10) command);
|
||||
o = CAPACITY_OK (accept the capacity
|
||||
reported by the device);
|
||||
r = IGNORE_RESIDUE (the device reports
|
||||
|
||||
@@ -11,7 +11,9 @@ with the difference that the orphan objects are not freed but only
|
||||
reported via /sys/kernel/debug/kmemleak. A similar method is used by the
|
||||
Valgrind tool (memcheck --leak-check) to detect the memory leaks in
|
||||
user-space applications.
|
||||
Kmemleak is supported on x86, arm, powerpc, sparc, sh, microblaze and tile.
|
||||
|
||||
Please check DEBUG_KMEMLEAK dependencies in lib/Kconfig.debug for supported
|
||||
architectures.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
@@ -555,7 +555,7 @@ also have
|
||||
sync_min
|
||||
sync_max
|
||||
The two values, given as numbers of sectors, indicate a range
|
||||
withing the array where 'check'/'repair' will operate. Must be
|
||||
within the array where 'check'/'repair' will operate. Must be
|
||||
a multiple of chunk_size. When it reaches "sync_max" it will
|
||||
pause, rather than complete.
|
||||
You can use 'select' or 'poll' on "sync_completed" to wait for
|
||||
|
||||
@@ -520,59 +520,20 @@ Support for power domains is provided through the pwr_domain field of struct
|
||||
device. This field is a pointer to an object of type struct dev_power_domain,
|
||||
defined in include/linux/pm.h, providing a set of power management callbacks
|
||||
analogous to the subsystem-level and device driver callbacks that are executed
|
||||
for the given device during all power transitions, in addition to the respective
|
||||
subsystem-level callbacks. Specifically, the power domain "suspend" callbacks
|
||||
(i.e. ->runtime_suspend(), ->suspend(), ->freeze(), ->poweroff(), etc.) are
|
||||
executed after the analogous subsystem-level callbacks, while the power domain
|
||||
"resume" callbacks (i.e. ->runtime_resume(), ->resume(), ->thaw(), ->restore,
|
||||
etc.) are executed before the analogous subsystem-level callbacks. Error codes
|
||||
returned by the "suspend" and "resume" power domain callbacks are ignored.
|
||||
for the given device during all power transitions, instead of the respective
|
||||
subsystem-level callbacks. Specifically, if a device's pm_domain pointer is
|
||||
not NULL, the ->suspend() callback from the object pointed to by it will be
|
||||
executed instead of its subsystem's (e.g. bus type's) ->suspend() callback and
|
||||
anlogously for all of the remaining callbacks. In other words, power management
|
||||
domain callbacks, if defined for the given device, always take precedence over
|
||||
the callbacks provided by the device's subsystem (e.g. bus type).
|
||||
|
||||
Power domain ->runtime_idle() callback is executed before the subsystem-level
|
||||
->runtime_idle() callback and the result returned by it is not ignored. Namely,
|
||||
if it returns error code, the subsystem-level ->runtime_idle() callback will not
|
||||
be called and the helper function rpm_idle() executing it will return error
|
||||
code. This mechanism is intended to help platforms where saving device state
|
||||
is a time consuming operation and should only be carried out if all devices
|
||||
in the power domain are idle, before turning off the shared power resource(s).
|
||||
Namely, the power domain ->runtime_idle() callback may return error code until
|
||||
the pm_runtime_idle() helper (or its asychronous version) has been called for
|
||||
all devices in the power domain (it is recommended that the returned error code
|
||||
be -EBUSY in those cases), preventing the subsystem-level ->runtime_idle()
|
||||
callback from being run prematurely.
|
||||
|
||||
The support for device power domains is only relevant to platforms needing to
|
||||
use the same subsystem-level (e.g. platform bus type) and device driver power
|
||||
management callbacks in many different power domain configurations and wanting
|
||||
to avoid incorporating the support for power domains into the subsystem-level
|
||||
callbacks. The other platforms need not implement it or take it into account
|
||||
in any way.
|
||||
|
||||
|
||||
System Devices
|
||||
--------------
|
||||
System devices (sysdevs) follow a slightly different API, which can be found in
|
||||
|
||||
include/linux/sysdev.h
|
||||
drivers/base/sys.c
|
||||
|
||||
System devices will be suspended with interrupts disabled, and after all other
|
||||
devices have been suspended. On resume, they will be resumed before any other
|
||||
devices, and also with interrupts disabled. These things occur in special
|
||||
"sysdev_driver" phases, which affect only system devices.
|
||||
|
||||
Thus, after the suspend_noirq (or freeze_noirq or poweroff_noirq) phase, when
|
||||
the non-boot CPUs are all offline and IRQs are disabled on the remaining online
|
||||
CPU, then a sysdev_driver.suspend phase is carried out, and the system enters a
|
||||
sleep state (or a system image is created). During resume (or after the image
|
||||
has been created or loaded) a sysdev_driver.resume phase is carried out, IRQs
|
||||
are enabled on the only online CPU, the non-boot CPUs are enabled, and the
|
||||
resume_noirq (or thaw_noirq or restore_noirq) phase begins.
|
||||
|
||||
Code to actually enter and exit the system-wide low power state sometimes
|
||||
involves hardware details that are only known to the boot firmware, and
|
||||
may leave a CPU running software (from SRAM or flash memory) that monitors
|
||||
the system and manages its wakeup sequence.
|
||||
The support for device power management domains is only relevant to platforms
|
||||
needing to use the same device driver power management callbacks in many
|
||||
different power domain configurations and wanting to avoid incorporating the
|
||||
support for power domains into subsystem-level callbacks, for example by
|
||||
modifying the platform bus type. Other platforms need not implement it or take
|
||||
it into account in any way.
|
||||
|
||||
|
||||
Device Low Power (suspend) States
|
||||
|
||||
@@ -566,11 +566,6 @@ to do this is:
|
||||
pm_runtime_set_active(dev);
|
||||
pm_runtime_enable(dev);
|
||||
|
||||
The PM core always increments the run-time usage counter before calling the
|
||||
->prepare() callback and decrements it after calling the ->complete() callback.
|
||||
Hence disabling run-time PM temporarily like this will not cause any run-time
|
||||
suspend callbacks to be lost.
|
||||
|
||||
7. Generic subsystem callbacks
|
||||
|
||||
Subsystems may wish to conserve code space by using the set of generic power
|
||||
|
||||
@@ -9,7 +9,121 @@ If variable is of Type, use printk format specifier:
|
||||
size_t %zu or %zx
|
||||
ssize_t %zd or %zx
|
||||
|
||||
Raw pointer value SHOULD be printed with %p.
|
||||
Raw pointer value SHOULD be printed with %p. The kernel supports
|
||||
the following extended format specifiers for pointer types:
|
||||
|
||||
Symbols/Function Pointers:
|
||||
|
||||
%pF versatile_init+0x0/0x110
|
||||
%pf versatile_init
|
||||
%pS versatile_init+0x0/0x110
|
||||
%ps versatile_init
|
||||
%pB prev_fn_of_versatile_init+0x88/0x88
|
||||
|
||||
For printing symbols and function pointers. The 'S' and 's' specifiers
|
||||
result in the symbol name with ('S') or without ('s') offsets. Where
|
||||
this is used on a kernel without KALLSYMS - the symbol address is
|
||||
printed instead.
|
||||
|
||||
The 'B' specifier results in the symbol name with offsets and should be
|
||||
used when printing stack backtraces. The specifier takes into
|
||||
consideration the effect of compiler optimisations which may occur
|
||||
when tail-call's are used and marked with the noreturn GCC attribute.
|
||||
|
||||
On ia64, ppc64 and parisc64 architectures function pointers are
|
||||
actually function descriptors which must first be resolved. The 'F' and
|
||||
'f' specifiers perform this resolution and then provide the same
|
||||
functionality as the 'S' and 's' specifiers.
|
||||
|
||||
Kernel Pointers:
|
||||
|
||||
%pK 0x01234567 or 0x0123456789abcdef
|
||||
|
||||
For printing kernel pointers which should be hidden from unprivileged
|
||||
users. The behaviour of %pK depends on the kptr_restrict sysctl - see
|
||||
Documentation/sysctl/kernel.txt for more details.
|
||||
|
||||
Struct Resources:
|
||||
|
||||
%pr [mem 0x60000000-0x6fffffff flags 0x2200] or
|
||||
[mem 0x0000000060000000-0x000000006fffffff flags 0x2200]
|
||||
%pR [mem 0x60000000-0x6fffffff pref] or
|
||||
[mem 0x0000000060000000-0x000000006fffffff pref]
|
||||
|
||||
For printing struct resources. The 'R' and 'r' specifiers result in a
|
||||
printed resource with ('R') or without ('r') a decoded flags member.
|
||||
|
||||
MAC/FDDI addresses:
|
||||
|
||||
%pM 00:01:02:03:04:05
|
||||
%pMF 00-01-02-03-04-05
|
||||
%pm 000102030405
|
||||
|
||||
For printing 6-byte MAC/FDDI addresses in hex notation. The 'M' and 'm'
|
||||
specifiers result in a printed address with ('M') or without ('m') byte
|
||||
separators. The default byte separator is the colon (':').
|
||||
|
||||
Where FDDI addresses are concerned the 'F' specifier can be used after
|
||||
the 'M' specifier to use dash ('-') separators instead of the default
|
||||
separator.
|
||||
|
||||
IPv4 addresses:
|
||||
|
||||
%pI4 1.2.3.4
|
||||
%pi4 001.002.003.004
|
||||
%p[Ii][hnbl]
|
||||
|
||||
For printing IPv4 dot-separated decimal addresses. The 'I4' and 'i4'
|
||||
specifiers result in a printed address with ('i4') or without ('I4')
|
||||
leading zeros.
|
||||
|
||||
The additional 'h', 'n', 'b', and 'l' specifiers are used to specify
|
||||
host, network, big or little endian order addresses respectively. Where
|
||||
no specifier is provided the default network/big endian order is used.
|
||||
|
||||
IPv6 addresses:
|
||||
|
||||
%pI6 0001:0002:0003:0004:0005:0006:0007:0008
|
||||
%pi6 00010002000300040005000600070008
|
||||
%pI6c 1:2:3:4:5:6:7:8
|
||||
|
||||
For printing IPv6 network-order 16-bit hex addresses. The 'I6' and 'i6'
|
||||
specifiers result in a printed address with ('I6') or without ('i6')
|
||||
colon-separators. Leading zeros are always used.
|
||||
|
||||
The additional 'c' specifier can be used with the 'I' specifier to
|
||||
print a compressed IPv6 address as described by
|
||||
http://tools.ietf.org/html/rfc5952
|
||||
|
||||
UUID/GUID addresses:
|
||||
|
||||
%pUb 00010203-0405-0607-0809-0a0b0c0d0e0f
|
||||
%pUB 00010203-0405-0607-0809-0A0B0C0D0E0F
|
||||
%pUl 03020100-0504-0706-0809-0a0b0c0e0e0f
|
||||
%pUL 03020100-0504-0706-0809-0A0B0C0E0E0F
|
||||
|
||||
For printing 16-byte UUID/GUIDs addresses. The additional 'l', 'L',
|
||||
'b' and 'B' specifiers are used to specify a little endian order in
|
||||
lower ('l') or upper case ('L') hex characters - and big endian order
|
||||
in lower ('b') or upper case ('B') hex characters.
|
||||
|
||||
Where no additional specifiers are used the default little endian
|
||||
order with lower case hex characters will be printed.
|
||||
|
||||
struct va_format:
|
||||
|
||||
%pV
|
||||
|
||||
For printing struct va_format structures. These contain a format string
|
||||
and va_list as follows:
|
||||
|
||||
struct va_format {
|
||||
const char *fmt;
|
||||
va_list *va;
|
||||
};
|
||||
|
||||
Do not use this feature without some mechanism to verify the
|
||||
correctness of the format string and va_list arguments.
|
||||
|
||||
u64 SHOULD be printed with %llu/%llx, (unsigned long long):
|
||||
|
||||
@@ -32,4 +146,5 @@ Reminder: sizeof() result is of type size_t.
|
||||
Thank you for your cooperation and attention.
|
||||
|
||||
|
||||
By Randy Dunlap <rdunlap@xenotime.net>
|
||||
By Randy Dunlap <rdunlap@xenotime.net> and
|
||||
Andrew Murray <amurray@mpc-data.co.uk>
|
||||
|
||||
@@ -223,9 +223,10 @@ When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
|
||||
group created using the pseudo filesystem. See example steps below to create
|
||||
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
|
||||
|
||||
# mkdir /dev/cpuctl
|
||||
# mount -t cgroup -ocpu none /dev/cpuctl
|
||||
# cd /dev/cpuctl
|
||||
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||
# mkdir /sys/fs/cgroup/cpu
|
||||
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
|
||||
# cd /sys/fs/cgroup/cpu
|
||||
|
||||
# mkdir multimedia # create "multimedia" group of tasks
|
||||
# mkdir browser # create "browser" group of tasks
|
||||
|
||||
@@ -129,9 +129,8 @@ priority!
|
||||
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
|
||||
CPU bandwidth to task groups.
|
||||
|
||||
This uses the /cgroup virtual file system and
|
||||
"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
|
||||
control group.
|
||||
This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
|
||||
to control the CPU time reserved for each control group.
|
||||
|
||||
For more information on working with control groups, you should read
|
||||
Documentation/cgroups/cgroups.txt as well.
|
||||
@@ -150,7 +149,7 @@ For now, this can be simplified to just the following (but see Future plans):
|
||||
===============
|
||||
|
||||
There is work in progress to make the scheduling period for each group
|
||||
("/cgroup/<cgroup>/cpu.rt_period_us") configurable as well.
|
||||
("<cgroup>/cpu.rt_period_us") configurable as well.
|
||||
|
||||
The constraint on the period is that a subgroup must have a smaller or
|
||||
equal period to its parent. But realistically its not very useful _yet_
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user