Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling alloc_pages_exact_node() means the allocation only passes the
zonelist of a single node into the page allocator. If that node isn't
online, it's zonelist may never have been initialized causing a strange
oops that may not immediately be clear.
I recently debugged an issue where node 0 wasn't online and an allocator
was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
would be nice to catch this a bit earlier.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.
This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.
When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.
The default value of debug_guardpage_minorder is zero: no change from
current behaviour.
[akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can place this in definitions that we expect the compiler to remove by
dead code elimination. If this assertion fails, we get a nice error
message at build time.
The GCC function attribute error("message") was added in version 4.3, so
we define a new macro __linktime_error(message) to expand to this for
GCC-4.3 and later. This will give us an error diagnostic from the
compiler on the line that fails. For other compilers
__linktime_error(message) expands to nothing, and we have to be content
with a link time error, but at least we will still get a build error.
BUILD_BUG() expands to the undefined function __build_bug_failed() and
will fail at link time if the compiler ever emits code for it. On GCC-4.3
and later, attribute((error())) is used so that the failure will be noted
at compile time instead.
Signed-off-by: David Daney <david.daney@cavium.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: DM <dm.n9107@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Cross reported;
Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.
In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.
The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.
This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this may
cause suspend to abort but that is better than a livelock.
[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: new helper - d_make_root()
dcache: use a dispose list in select_parent
ceph: d_alloc_root() may fail
ext4: fix failure exits
isofs: inode leak on mount failure
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
igmp: Avoid zero delay when receiving odd mixture of IGMP queries
netdev: make net_device_ops const
bcm63xx: make ethtool_ops const
usbnet: make ethtool_ops const
net: Fix build with INET disabled.
net: introduce netif_addr_lock_nested() and call if when appropriate
net: correct lock name in dev_[uc/mc]_sync documentations.
net: sk_update_clone is only used in net/core/sock.c
8139cp: fix missing napi_gro_flush.
pktgen: set correct max and min in pktgen_setup_inject()
smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
asix: fix infinite loop in rx_fixup()
net: Default UDP and UNIX diag to 'n'.
r6040: fix typo in use of MCR0 register bits
net: fix sock_clone reference mismatch with tcp memcontrol
clock management changes for i.MX
Another simple series related to clock management, this time only for
imx.
* tag 'clk' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: mxs: select HAVE_CLK_PREPARE for clock
clk: add config option HAVE_CLK_PREPARE into Kconfig
ASoC: mxs-saif: convert to clk_prepare/clk_unprepare
video: mxsfb: convert to clk_prepare/clk_unprepare
serial: mxs-auart: convert to clk_prepare/clk_unprepare
net: flexcan: convert to clk_prepare/clk_unprepare
mtd: gpmi-lib: convert to clk_prepare/clk_unprepare
mmc: mxs-mmc: convert to clk_prepare/clk_unprepare
dma: mxs-dma: convert to clk_prepare/clk_unprepare
net: fec: add clk_prepare/clk_unprepare
ARM: mxs: convert platform code to clk_prepare/clk_unprepare
clk: add helper functions clk_prepare_enable and clk_disable_unprepare
Fix up trivial conflicts in drivers/net/ethernet/freescale/fec.c due to
commit 0ebafefcaa ("net: fec: add clk_prepare/clk_unprepare") clashing
trivially with commit e163cc97f9 ("net/fec: fix the .remove code").
Driver specific changes
Again, a lot of platforms have changes in here: pxa, samsung, omap,
at91, imx, ...
* tag 'drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (54 commits)
ARM: sa1100: clean up of the clock support
ARM: pxa: add dummy clock for sa1100-rtc
RTC: sa1100: support sa1100, pxa and mmp soc families
RTC: sa1100: remove redundant code of setting alarm
RTC: sa1100: Clean out ost register
Input: zylonite-wm97xx - replace IRQ_GPIO() with gpio_to_irq()
pcmcia: pxa: replace IRQ_GPIO() with gpio_to_irq()
ARM: EXYNOS: Modified files for SPI consolidation work
ARM: S5P64X0: Enable SDHCI support
ARM: S5P64X0: Add lookup of sdhci-s3c clocks using generic names
ARM: S5P64X0: Add HSMMC setup for host Controller
ARM: EXYNOS: Add USB OHCI support to ORIGEN board
USB: Add Samsung Exynos OHCI diver
ARM: EXYNOS: Add USB OHCI support to SMDKV310 board
ARM: EXYNOS: Add USB OHCI device
net: macb: fix build break with !CONFIG_OF
i2c: tegra: Support DVC controller in device tree
i2c: tegra: Add __devinit/exit to probe/remove
net/at91_ether: use gpio_is_valid for phy IRQ line
ARM: at91/net: add macb ethernet controller in 9g45/9g20 DT
...
New feature development
This adds support for new features, and contains stuff from most
platforms. A number of these patches could have fit into other
branches, too, but were small enough not to cause too much
confusion here.
* tag 'devel' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (28 commits)
mfd/db8500-prcmu: remove support for early silicon revisions
ARM: ux500: fix the smp_twd clock calculation
ARM: ux500: remove support for early silicon revisions
ARM: ux500: update register files
ARM: ux500: register DB5500 PMU dynamically
ARM: ux500: update ASIC detection for U5500
ARM: ux500: support DB8520
ARM: picoxcell: implement watchdog restart
ARM: OMAP3+: hwmod data: Add the default clockactivity for I2C
ARM: OMAP3: hwmod data: disable multiblock reads on MMC1/2 on OMAP34xx/35xx <= ES2.1
ARM: OMAP: USB: EHCI and OHCI hwmod structures for OMAP4
ARM: OMAP: USB: EHCI and OHCI hwmod structures for OMAP3
ARM: OMAP: hwmod data: Add support for AM35xx UART4/ttyO3
ARM: Orion: Remove address map info from all platform data structures
ARM: Orion: Get address map from plat-orion instead of via platform_data
ARM: Orion: mbus_dram_info consolidation
ARM: Orion: Consolidate the address map setup
ARM: Kirkwood: Add configuration for MPP12 as GPIO
ARM: Kirkwood: Recognize A1 revision of 6282 chip
ARM: ux500: update the MOP500 GPIO assignments
...
Device tree conversions for samsung and tegra
Both platforms had some initial device tree support, but this adds
much more to actually make it usable.
* tag 'dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (45 commits)
ARM: dts: Add intial dts file for EXYNOS4210 SoC, SMDKV310 and ORIGEN
ARM: EXYNOS: Add Exynos4 device tree enabled board file
rtc: rtc-s3c: Add device tree support
input: samsung-keypad: Add device tree support
ARM: S5PV210: Modify platform data for pl330 driver
ARM: S5PC100: Modify platform data for pl330 driver
ARM: S5P64x0: Modify platform data for pl330 driver
ARM: EXYNOS: Add a alias for pdma clocks
ARM: EXYNOS: Limit usage of pl330 device instance to non-dt build
ARM: SAMSUNG: Add device tree support for pl330 dma engine wrappers
DMA: PL330: Add device tree support
ARM: EXYNOS: Modify platform data for pl330 driver
DMA: PL330: Infer transfer direction from transfer request instead of platform data
DMA: PL330: move filter function into driver
serial: samsung: Fix build for non-Exynos4210 devices
serial: samsung: add device tree support
serial: samsung: merge probe() function from all SoC specific extensions
serial: samsung: merge all SoC specific port reset functions
ARM: SAMSUNG: register uart clocks to clock lookup list
serial: samsung: remove all uses of get_clksrc and set_clksrc
...
Fix up fairly trivial conflicts in arch/arm/mach-s3c2440/clock.c and
drivers/tty/serial/Kconfig both due to just adding code close to
changes.
Cleanups on various subarchitectures
Cleanup patches for various ARM platforms and some of their associated
drivers, the bulk of these is for mach-91.
Arnd ended up pulling in the restart branch from Russell in order to
fix up some simple but annoying merge conflicts.
* tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (44 commits)
arm/at91: fix build of stamp9g20
ARM: u300: delete memory.h
MAINTAINERS: add maintainer entry for Picochip picoxcell
ARM: picoxcell: move io mappings to common.c
ARM: picoxcell: don't reserve irq_descs
ARM: picoxcell: remove mach/memory.h
ARM: at91: delete the pcontrol_g20_defconfig
arm/tegra: Remove code that's ifndef CONFIG_ARM_GIC
arm/tegra: remove unused defines
arm/tegra: fix variable formatting in makefile
ARM: davinci: vpif: move code to driver core header from platform
ARM: at91/gpio: fix display of number of irq setuped
ARM: at91/gpio: drop PIN_BASE
ARM: at91/udc: use gpio_is_valid to check the gpio
ARM: at91/ohci: use gpio_is_valid to check the gpio
ARM: at91/nand: use gpio_is_valid to check the gpio
ARM: at91/mmc: use gpio_is_valid to check the gpio
ARM: at91/ide: use gpio_is_valid to check the gpio
ARM: at91/pata: use gpio_is_valid to check the gpio
ARM: at91/soc: use gpio_is_valid to check the gpio
...
Including trace/events/*.h TRACE_EVENT() macro headers in other headers
can cause strange side effects if another trace/event/*.h header
includes that header. Having trace/events/kmem.h inside slab_def.h
caused a compile error in sparc64 when changes were done to some header
files. Moving the kmem.h trace header out of slab.h and into slab.c
fixes the problem.
Note, both slub.c and slob.c already include the trace/events/kmem.h
file. Only slab.c had it missing.
Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> net/core/sock.c: In function 'sk_update_clone':
> net/core/sock.c:1278:3: error: implicit declaration of function 'sock_update_memcg'
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Remove irqsafe_cpu_xxx variants
Fix up conflict in arch/x86/include/asm/percpu.h due to clash with
cebef5beed ("x86: Fix and improve percpu_cmpxchg{8,16}b_double()")
which edited the (now removed) irqsafe_cpu_cmpxchg*_double code.
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2/3/4: delete unneeded includes of module.h
ext{3,4}: Fix potential race when setversion ioctl updates inode
udf: Mark LVID buffer as uptodate before marking it dirty
ext3: Don't warn from writepage when readonly inode is spotted after error
jbd: Remove j_barrier mutex
reiserfs: Force inode evictions before umount to avoid crash
reiserfs: Fix quota mount option parsing
udf: Treat symlink component of type 2 as /
udf: Fix deadlock when converting file from in-ICB one to normal one
udf: Cleanup calling convention of inode_getblk()
ext2: Fix error handling on inode bitmap corruption
ext3: Fix error handling on inode bitmap corruption
ext3: replace ll_rw_block with other functions
ext3: NULL dereference in ext3_evict_inode()
jbd: clear revoked flag on buffers before a new transaction started
ext3: call ext3_mark_recovery_complete() when recovery is really needed
dev_uc_sync() and dev_mc_sync() are acquiring netif_addr_lock for
destination device of synchronization. Since netif_addr_lock is
already held at the time for source device, this triggers lockdep
deadlock warning.
There's no way this deadlock can happen so use spin_lock_nested() to
silence the warning.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (466 commits)
net/hyperv: Add support for jumbo frame up to 64KB
net/hyperv: Add NETVSP protocol version negotiation
net/hyperv: Remove unnecessary kmap_atomic in netvsc driver
staging/rtl8192e: Register against lib80211
staging/rtl8192e: Convert to lib80211_crypt_info
staging/rtl8192e: Convert to lib80211_crypt_data and lib80211_crypt_ops
staging/rtl8192e: Add lib80211.h to rtllib.h
staging/mei: add watchdog device registration wrappers
drm/omap: GEM, deal with cache
staging: vt6656: int.c, int.h: Change return of function to void
staging: usbip: removed unused definitions from header
staging: usbip: removed dead code from receive function
staging:iio: Drop {mark,unmark}_in_use callbacks
staging:iio: Drop buffer mark_param_change callback
staging:iio: Drop the unused buffer enable() and is_enabled() callbacks
staging:iio: Drop buffer busy flag
staging:iio: Make sure a device is only opened once at a time
staging:iio: Disallow modifying buffer size when buffer is enabled
staging:iio: Disallow changing scan elements in all buffered modes
staging:iio: Use iio_buffer_enabled instead of open coding it
...
Fix up conflict in drivers/staging/iio/adc/ad799x_core.c (removal of
module_init due to using module_i2c_driver() helper, next to removal of
MODULE_ALIAS due to using MODULE_DEVICE_TABLE instead).
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
tty: serial: imx: move del_timer_sync() to avoid potential deadlock
imx: add polled io uart methods
imx: Add save/restore functions for UART control regs
serial/imx: let probing fail for the dt case without a valid alias
serial/imx: propagate error from of_alias_get_id instead of using -ENODEV
tty: serial: imx: Allow UART to be a source for wakeup
serial: driver for m32 arch should not have DEC alpha errata
serial/documentation: fix documented name of DCD cpp symbol
atmel_serial: fix spinlock lockup in RS485 code
tty: Fix memory leak in virtual console when enable unicode translation
serial: use DIV_ROUND_CLOSEST instead of open coding it
serial: add support for 400 and 800 v3 series Titan cards
serial: bfin-uart: Remove ASYNC_CTS_FLOW flag for hardware automatic CTS.
serial: bfin-uart: Enable hardware automatic CTS only when CTS pin is available.
serial: make FSL errata depend on 8250_CONSOLE, not just 8250
serial: add irq handler for Freescale 16550 errata.
serial: manually inline serial8250_handle_port
serial: make 8250 timeout use the specified IRQ handler
serial: export the key functions for an 8250 IRQ handler
serial: clean up parameter passing for 8250 Rx IRQ handling
...