* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
drivers/net/usb/asix.c: Fix pointer cast.
be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
proc_dointvec: write a single value
hso: add support for new products
Phonet: fix potential use-after-free in pep_sock_close()
ath9k: remove VEOL support for ad-hoc
ath9k: change beacon allocation to prefer the first beacon slot
sock.h: fix kernel-doc warning
cls_cgroup: Fix build error when built-in
macvlan: do proper cleanup in macvlan_common_newlink() V2
be2net: Bug fix in init code in probe
net/dccp: expansion of error code size
ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
wireless: fix sta_info.h kernel-doc warnings
wireless: fix mac80211.h kernel-doc warnings
iwlwifi: testing the wrong variable in iwl_add_bssid_station()
ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
ath9k_htc: dereferencing before check in hif_usb_tx_cb()
rt2x00: Fix rt2800usb TX descriptor writing.
rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
...
This reverts commit 480b02df3a, since
Rafael reports that it causes occasional kernel paging request faults in
load_module().
Dropping the module lock and re-taking it deep in the call-chain is
definitely not the right thing to do. That just turns the mutex from a
lock into a "random non-locking data structure" that doesn't actually
protect what it's supposed to protect.
Requested-and-tested-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Brandon Philips <brandon@ifup.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit 00b7c3395a
"sysctl: refactor integer handling proc code"
modified the behaviour of writing to /proc.
Before the commit, write("1\n") to /proc/sys/kernel/printk succeeded. But
now it returns EINVAL.
This commit supports writing a single value to a multi-valued entry.
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Reviewed-and-tested-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add global mutex zonelists_mutex to fix the possible race:
CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();
In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).
Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).
[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:
1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().
2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()
Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
[<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
[<ffffffff81128407>] __page_cache_alloc+0x67/0x70
[<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
[<ffffffff81132751>] ra_submit+0x21/0x30
[<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
[<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
[<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
[<ffffffff81266cfa>] nfs_file_read+0xca/0x130
[<ffffffff8117b20a>] do_sync_read+0xfa/0x140
[<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
[<ffffffff8117c151>] sys_read+0x51/0x80
[<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
RSP <ffff88000d1e78a8>
---[ end trace 4bda28328b9990db ]
[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable users to online CPUs even if the CPUs belongs to a numa node which
doesn't have onlined local memory.
The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
in system boot/init period, or at the time of local memory online. For a
numa node without onlined local memory, its zonelists are not initialized
at present. As a result, any memory allocation operations executed by
CPUs within this node will fail. In fact, an out-of-memory error is
triggered when attempt to online CPUs before memory comes to online.
This patch tries to create zonelists for such numa nodes, so that the
memory allocation for this node can be fallback'ed to other nodes.
[akpm@linux-foundation.org: remove unneeded export]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: minskey guo<chaohong.guo@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will not
be compacted. This choice is arbitrary and not based on data. To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold. The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.
[randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.
The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:
(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom
This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.
[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1]. It happens only on the kernel that do not do
atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)
But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores. The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory. The reason is like this:
(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom
I can use the attached program reproduce it by the following step:
# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
# echo $$ > /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
<nr_tasks> = max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh
several hours later, oom will happen though there is a lot of free memory.
This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.
This patch:
In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.
So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
the mempolicy's nodes. It is used when there is no real lock to protect
the mempolicy in the read-side. Otherwise we can do rebind work at once.
In order to implement it, we define
enum mpol_rebind_step {
MPOL_REBIND_ONCE,
MPOL_REBIND_STEP1,
MPOL_REBIND_STEP2,
MPOL_REBIND_NSTEP,
};
If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions. Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.
Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed. If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock. So we defined the
following flag to identify it:
#define MPOL_F_REBINDING (1 << 2)
The new functions will be used in the next patch.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 3bbb9ec946 (timers: Introduce the concept of timer slack for
legacy timers) does not take the case into account when the timer is
already expired. This broke wireless drivers.
The solution is not to apply slack to already expired timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
commit 64ce4c2f (time: Clean up warp_clock()) breaks the timezone
update in a very subtle way. To avoid the direct access to timekeeping
internals it adds the timezone delta to the current time with
timespec_add_safe(). This works nicely when the timezone delta is > 0.
If timezone delta is < 0 then the wrap check in timespec_add_safe()
triggers and timespec_add_safe() returns TIME_MAX and screws up
timekeeping completely.
The comment above timespec_add_safe() says:
It's assumed that both values are valid (>= 0)
Add the timezone seconds adjustment directly.
Reported-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* git://git.infradead.org/iommu-2.6:
intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables
intel-iommu: Combine the BIOS DMAR table warning messages
panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I')
panic: Allow warnings to set different taint flags
intel-iommu: intel_iommu_map_range failed at very end of address space
intel-iommu: errors with smaller iommu widths
intel-iommu: Fix boot inside 64bit virtualbox with io-apic disabled
intel-iommu: use physfn to search drhd for VF
intel-iommu: Print out iommu seq_id
intel-iommu: Don't complain that ACPI_DMAR_SCOPE_TYPE_IOAPIC is not supported
intel-iommu: Avoid global flushes with caching mode.
intel-iommu: Use correct domain ID when caching mode is enabled
intel-iommu mistakenly uses offset_pfn when caching mode is enabled
intel-iommu: use for_each_set_bit()
intel-iommu: Fix section mismatch dmar_ir_support() uses dmar_tbl.
* 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: drop the lock while waiting for module to complete initialization.
MODULE_DEVICE_TABLE(isapnp, ...) does nothing
hisax_fcpcipnp: fix broken isapnp device table.
isapnp: move definitions to mod_devicetable.h so file2alias can reach them.
* 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block: (86 commits)
pipe: set lower and upper limit on max pages in the pipe page array
pipe: add support for shrinking and growing pipes
drbd: This is now equivalent to drbd release 8.3.8rc1
drbd: Do not free p_uuid early, this is done in the exit code of the receiver
drbd: Null pointer deref fix to the large "multi bio rewrite"
drbd: Fix: Do not detach, if a bio with a barrier fails
drbd: Ensure to not trigger late-new-UUID creation multiple times
drbd: Do not Oops when C_STANDALONE when uuid gets generated
writeback: fix mixed up arguments to bdi_start_writeback()
writeback: fix problem with !CONFIG_BLOCK compilation
block: improve automatic native capacity unlocking
block: use struct parsed_partitions *state universally in partition check code
block,ide: simplify bdops->set_capacity() to ->unlock_native_capacity()
block: restart partition scan after resizing a device
buffer: make invalidate_bdev() drain all percpu LRU add caches
block: remove all rcu head initializations
writeback: fixups for !dirty_writeback_centisecs
writeback: bdi_writeback_task() must set task state before calling schedule()
writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync
drivers/block/drbd: Use kzalloc
...
We need at least two to guarantee proper POSIX behaviour, so
never allow a smaller limit than that.
Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows
root to define a sane upper limit. Make it default to 16 times the
default size, which is 16 pages.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'dbg-early-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
echi-dbgp: Add kernel debugger support for the usb debug port
earlyprintk,vga,kdb: Fix \b and \r for earlyprintk=vga with kdb
kgdboc: Add ekgdboc for early use of the kernel debugger
x86,early dr regs,kgdb: Allow kernel debugger early dr register access
x86,kgdb: Implement early hardware breakpoint debugging
x86, kgdb, init: Add early and late debug states
x86, kgdb: early trap init for early debug
* 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: (25 commits)
kdb,debug_core: Allow the debug core to receive a panic notification
MAINTAINERS: update kgdb, kdb, and debug_core info
debug_core,kdb: Allow the debug core to process a recursive debug entry
printk,kdb: capture printk() when in kdb shell
kgdboc,kdb: Allow kdb to work on a non open console port
kgdb: Add the ability to schedule a breakpoint via a tasklet
mips,kgdb: kdb low level trap catch and stack trace
powerpc,kgdb: Introduce low level trap catching
x86,kgdb: Add low level debug hook
kgdb: remove post_primary_code references
kgdb,docs: Update the kgdb docs to include kdb
kgdboc,keyboard: Keyboard driver for kdb with kgdb
kgdb: gdb "monitor" -> kdb passthrough
sparc,sunzilog: Add console polling support for sunzilog serial driver
sh,sh-sci: Use NO_POLL_CHAR in the SCIF polled console code
kgdb,8250,pl011: Return immediately from console poll
kgdb: core changes to support kdb
kdb: core for kgdb back end (2 of 2)
kdb: core for kgdb back end (1 of 2)
kgdb,blackfin: Add in kgdb_arch_set_pc for blackfin
...