Subjct: Revert memory cgroup dirty_ratio Documentation.
The commit ece72400c2 adds documentation
for memcg's dirty ratio. But the function is not implemented yet.
Remove the documentation for avoiding confusing users.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
o Allow hierarchical cgroup creation for blkio controller
o Currently we disallow it as both the io controller policies (throttling
as well as proportion bandwidth) do not support hierarhical accounting
and control. But the flip side is that blkio controller can not be used with
libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
<top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
o So this patch will allow creation of cgroup hierarhcy but at the backend
everything will be treated as flat. So if somebody created a an hierarchy
like as follows.
root
/ \
test1 test2
|
test3
CFQ and throttling will practically treat all groups at same level.
pivot
/ | \ \
root test1 test2 test3
o Once we have actual support for hierarchical accounting and control
then we can introduce another cgroup tunable file "blkio.use_hierarchy"
which will be 0 by default but if user wants to enforce hierarhical
control then it can be set to 1. This way there should not be any
ABI problems down the line.
o The only not so pretty part is introduction of extra file "use_hierarchy"
down the line. Kame-san had mentioned that hierarhical accounting is
expensive in memory controller hence they keep it off by default. I
suspect same will be the case for IO controller also as for each IO
completion we shall have to account IO through hierarchy up to the root.
if yes, then it probably is not a very bad idea to introduce this extra
file so that it will be used only when somebody needs it and some people
might enable hierarchy only in part of the hierarchy.
o This is how basically memory controller also uses "use_hierarhcy" and
they also allowed creation of hierarchies when actual backend support
was not available.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The ns_cgroup is a control group interacting with the namespaces. When a
new namespace is created, a corresponding cgroup is automatically created
too. The cgroup name is the pid of the process who did 'unshare' or the
child of 'clone'.
This cgroup is tied with the namespace because it prevents a process to
escape the control group and use the post_clone callback, so the child
cgroup inherits the values of the parent cgroup.
Unfortunately, the more we use this cgroup and the more we are facing
problems with it:
(1) when a process unshares, the cgroup name may conflict with a
previous cgroup with the same pid, so unshare or clone return -EEXIST
(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will
automatically create several cgroups in his back and let them on the
cgroupfs (eg. a vrf based on the network namespace).
(3) the mix of (1) and (2) force an administrator to regularly check
and clean these cgroups.
This patchset removes the ns_cgroup by adding a new flag to the cgroup and
the cgroupfs mount option. It enables the copy of the parent cgroup when
a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add
the task to a cgroup, which is consistent with the cgroup framework.
This patch:
Sent as an answer to a previous thread around the ns_cgroup.
https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
It adds a control file 'clone_children' for a cgroup. This control file
is a boolean specifying if the child cgroup should be a clone of the
parent cgroup or not. The default value is 'false'.
This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.
At present, the cpuset is the only one which had implemented the
post_clone callback.
The option can be set at mount time by specifying the 'clone_children'
mount option.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
Below you will find an updated version from the original series bunching all patches into one big patch
updating broken web addresses that are located in Documentation/*
Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
Now there are also some addresses pointing to .spec files some are located, but some(after searching
on the companies site)where still no where to be found. In this case I just changed the address
to the company site this way the users can contact the company and they can locate them for the users.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Thomas Weber <weber@corscience.de>
Signed-off-by: Mike Frysinger <vapier.adi@gmail.com>
Cc: Paulo Marques <pmarques@grupopie.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch adds support for moving charge of file pages, which include
normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
Unlike the case of anonymous pages, file pages(and swaps) in the range
mmapped by the task will be moved even if the task hasn't done page fault,
i.e. they might not be the task's "RSS", but other task's "RSS" that maps
the same file. And mapcount of the page is ignored(the page can be moved
even if page_mapcount(page) > 1). So, conditions that the page/swap
should be met to be moved is that it must be in the range mmapped by the
target task and it must be charged to the old cgroup.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a feature to disable oom-killer for memcg, if disabled, of
course, tasks under memcg will stop.
But now, we have oom-notifier for memcg. And the world around memcg is
not under out-of-memory. memcg's out-of-memory just shows memcg hits
limit. Then, administrator or management daemon can recover the situation
by
- kill some process
- enlarge limit, add more swap.
- migrate some tasks
- remove file cache on tmps (difficult ?)
Unlike oom-killer, you can take enough information before killing tasks.
(by gcore, or, ps etc.)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering containers or other resource management softwares in userland,
event notification of OOM in memcg should be implemented. Now, memcg has
"threshold" notifier which uses eventfd, we can make use of it for oom
notification.
This patch adds oom notification eventfd callback for memcg. The usage is
very similar to threshold notifier, but control file is memory.oom_control
and no arguments other than eventfd is required.
% cgroup_event_notifier /cgroup/A/memory.oom_control dummy
(About cgroup_event_notifier, see Documentation/cgroup/)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
This patch fixes few usability and configurability issues.
o All the cgroup based controller options are configurable from
"Genral Setup/Control Group Support/" menu. blkio is the only exception.
Hence make this option visible in above menu and make it configurable from
there to bring it inline with rest of the cgroup based controllers.
o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
This option currently does two things.
- Enable printing of cgroup paths in blktrace
- Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
files in cgroup.
If we are using group scheduling, blktrace data is of not really much use
if cgroup information is not present. To get this data, currently one has to
also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
all the additional debug stat files which is not desired.
Hence, this patch moves printing of cgroup paths under
CONFIG_CFQ_GROUP_IOSCHED.
This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
can be enabled through config menu.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Divyesh Shah <dpshah@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
1) group_wait_time - This is the amount of time the cgroup had to wait to get a
timeslice for one of its queues from when it became busy, i.e., went from 0
to 1 request queued. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup waiting
in the scheduler queue. This stat is a great way to find out any jobs in the
fleet that are being starved or waiting for longer than what is expected (due
to an IO controller bug or any other issue).
2) empty_time - This is the amount of time a cgroup spends w/o any pending
requests. This stat is useful when a job does not seem to be able to use its
assigned disk share by helping check if that is happening due to an IO
controller bug or because the job is not submitting enough IOs.
3) idle_time - This is the amount of time spent by the IO scheduler idling
for a given cgroup in anticipation of a better request than the exising ones
from other queues/cgroups.
All these stats are recorded using start and stop events. When reading these
stats, we do not add the delta between the current time and the last start time
if we're between the start and stop events. We avoid doing this to make sure
that these numbers are always monotonically increasing when read. Since we're
using sched_clock() which may use the tsc as its source, it may induce some
inconsistency (due to tsc resync across cpus) if we included the current delta.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>