Commit Graph

322972 Commits

Author SHA1 Message Date
Rusty Russell 7a23eb28fa virtio_balloon: not EXPERIMENTAL any more.
It is not experimental in any vaguely-sane sense.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:14 +09:30
Michael S. Tsirkin 04679f34c1 virtio-balloon: dependency fix
Devices should depend on virtio, not select it.  It's supposed to be
selected by the particular driver, e.g. VIRTIO_PCI.
Make balloon depend on VIRTIO and EXPERIMENTAL
(to match description).

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:14 +09:30
Dan Carpenter f22cf8eb48 virtio-blk: fix NULL checking in virtblk_alloc_req()
Smatch complains about the inconsistent NULL checking here.  Fix it to
return NULL on failure.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed accidental deletion)
2012-09-28 15:05:14 +09:30
Asias He c85a1f91b3 virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
We need to support both REQ_FLUSH and REQ_FUA for bio based path since
it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
based drivers can request.

REQ_FLUSH is emulated by:
A) If the bio has no data to write:
1. Send VIRTIO_BLK_T_FLUSH to device,
2. In the flush I/O completion handler, finish the bio

B) If the bio has data to write:
1. Send VIRTIO_BLK_T_FLUSH to device
2. In the flush I/O completion handler, send the actual write data to device
3. In the write I/O completion handler, finish the bio

REQ_FUA is emulated by:
1. Send the actual write data to device
2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device
3. In the flush I/O completion handler, finish the bio

Changes in v7:
- Using vbr->flags to trace request type
- Dropped unnecessary struct virtio_blk *vblk parameter
- Reuse struct virtblk_req in bio done function

Cahnges in v6:
- Reworked REQ_FLUSH and REQ_FUA emulatation order

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:14 +09:30
Asias He a98755c559 virtio-blk: Add bio-based IO path for virtio-blk
This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Using a slow device e.g. normal SATA disk, the bio-based IO path for
sequential read and write are slower than req-based IO path due to lack
of merge in guest kernel. So we make the bio-based path optional.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
    clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
    clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
    clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
    clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
    clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
    clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
    clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
    clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
    clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
    clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
    clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
    clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
    clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
    clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
    clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
    clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

3) Fio test is performed in a 8 vcpu guest with normal SATA based guest
using kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : -10%, -10%, 4.4%, 0.5%
 Latency improvement: -12%, -15%, 2.5%, 0.8%
Long Version:
 With bio-based IO path:
  read : io=124812KB, bw=36537KB/s, iops=9060 , runt=  3416msec
  write: io=169180KB, bw=24406KB/s, iops=6065 , runt=  6932msec
  read : io=256200KB, bw=2089.3KB/s, iops=520 , runt=122630msec
  write: io=257988KB, bw=1545.7KB/s, iops=384 , runt=166910msec
    clat (msec): min=1 , max=1527 , avg=28.06, stdev=89.54
    clat (msec): min=2 , max=344 , avg=41.12, stdev=38.70
    clat (msec): min=8 , max=1984 , avg=490.63, stdev=207.28
    clat (msec): min=33 , max=4131 , avg=659.19, stdev=304.71
  cpu          : usr=4.85%, sys=17.15%, ctx=31593, majf=0, minf=7
  cpu          : usr=3.04%, sys=11.45%, ctx=39377, majf=0, minf=0
  cpu          : usr=0.47%, sys=1.59%, ctx=262986, majf=0, minf=16
  cpu          : usr=0.47%, sys=1.46%, ctx=337410, majf=0, minf=0

 With request-based IO path:
  read : io=150120KB, bw=40420KB/s, iops=10037 , runt=  3714msec
  write: io=194932KB, bw=27029KB/s, iops=6722 , runt=  7212msec
  read : io=257136KB, bw=2001.1KB/s, iops=498 , runt=128443msec
  write: io=258276KB, bw=1537.2KB/s, iops=382 , runt=168028msec
    clat (msec): min=1 , max=1542 , avg=24.84, stdev=32.45
    clat (msec): min=3 , max=628 , avg=35.62, stdev=39.71
    clat (msec): min=8 , max=2540 , avg=503.28, stdev=236.97
    clat (msec): min=41 , max=4398 , avg=653.88, stdev=302.61
  cpu          : usr=3.91%, sys=15.75%, ctx=26968, majf=0, minf=23
  cpu          : usr=2.50%, sys=10.56%, ctx=19090, majf=0, minf=0
  cpu          : usr=0.16%, sys=0.43%, ctx=20159, majf=0, minf=16
  cpu          : usr=0.18%, sys=0.53%, ctx=81364, majf=0, minf=0

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Alexey Khoroshilov 33e1afc3d8 virtio: console: fix error handling in init() function
If register_virtio_driver() fails, virtio-ports class is not destroyed.
The patch adds error handling of register_virtio_driver().

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Yoshihiro YUNOMAE 5b8fa822b7 tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
pthread flag should not be -lpthread but -pthread using gcc. The -lpthread
links the external multithread library. On the other hand, the -pthread manages
both the gcc's preprocessor and linker to be able to compile with pthread.

Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Yoshihiro YUNOMAE 108fc82596 tools: Add guest trace agent as a user tool
This patch adds a user tool, "trace agent" for sending trace data of a guest to
a Host in low overhead. This agent has the following functions:
 - splice a page of ring-buffer to read_pipe without memory copying
 - splice the page from write_pipe to virtio-console without memory copying
 - write trace data to stdout by using -o option
 - controlled by start/stop orders from a Host

Changes in v2:
 - Cleanup (change fprintf() to pr_err() and an include guard)

Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Masami Hiramatsu 8ca84a50e5 virtio/console: Allocate scatterlist according to the current pipe size
Allocate scatterlist according to the current pipe size.
This allows splicing bigger buffer if the pipe size has
been changed by fcntl.

Changes in v2:
 - Just a minor fix for avoiding a confliction with previous patch.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Masami Hiramatsu d55cb6cf14 ftrace: Allow stealing pages from pipe buffer
Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.

Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Masami Hiramatsu efe75d24a6 virtio/console: Wait until the port is ready on splice
Wait if the port is not connected or full on splice
like as write is doing.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Masami Hiramatsu ec8fc87015 virtio/console: Add a failback for unstealable pipe buffer
Add a failback memcpy path for unstealable pipe buffer.
If buf->ops->steal() fails, virtio-serial tries to
copy the page contents to an allocated page, instead
of just failing splice().

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Masami Hiramatsu eb5e89fc70 virtio/console: Add splice_write support
Enable to use splice_write from pipe to virtio-console port.
This steals pages from pipe and directly send it to host.

Note that this may accelerate only the guest to host path.

Changes in v2:
 - Use GFP_KERNEL instead of GFP_ATOMIC in syscall context function.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Linus Torvalds 925a6f0bf8 Merge tag 'hwspinlock-3.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/hwspinlock
Pull hwspinlock fix from Ohad Ben-Cohen:
 "A single hwspinlock fix by Wei Yongjun, which prevents potential NULL
  dereferences"

* tag 'hwspinlock-3.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/hwspinlock:
  hwspinlock/core: move the dereference below the NULL test
2012-09-18 11:58:54 -07:00
Miklos Szeredi b161dfa693 vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()
IBM reported a soft lockup after applying the fix for the rename_lock
deadlock.  Commit c83ce989cb ("VFS: Fix the nfs sillyrename regression
in kernel 2.6.38") was found to be the culprit.

The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
dentry was killed.  This flag can be set on non-killed dentries too,
which results in infinite retries when trying to traverse the dentry
tree.

This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
only set in d_kill() and makes try_to_ascend() test only this flag.

IBM reported successful test results with this patch.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 11:23:51 -07:00
Linus Torvalds 4651afbbae Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull another workqueue fix from Tejun Heo:
 "Unfortunately, yet another late fix.  This too is discovered and fixed
  by Lai.  This bug was introduced during this merge window by commit
  25511a4776 ("workqueue: reimplement CPU online rebinding to handle
  idle workers") which started using WORKER_REBIND flag for idle rebind
  too.

  The bug is relatively easy to trigger if the CPU rapidly goes through
  off, on and then off (and stay off).  The fix is on the safer side.
  This hasn't been on linux-next yet but I'm pushing early so that it
  can get more exposure before v3.6 release."

* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
2012-09-17 16:05:23 -07:00
Lai Jiangshan 960bd11bf2 workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
(CPU is down again).  This used to be okay because the flag wasn't
used for anything else.

However, after 25511a477 "workqueue: reimplement CPU online rebinding
to handle idle workers", WORKER_REBIND is also used to command idle
workers to rebind.  If not cleared, the worker may confuse the next
CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
prematurely calling idle_worker_rebind().

  WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
 00()
  Hardware name: Bochs
  Modules linked in: test_wq(O-)
  Pid: 33, comm: kworker/1:1 Tainted: G           O 3.6.0-rc1-work+ #3
  Call Trace:
   [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0
   [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20
   [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  ---[ end trace e977cf20f4661968 ]---
  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500
  PGD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: test_wq(O-)
  CPU 0
  Pid: 33, comm: kworker/1:1 Tainted: G        W  O 3.6.0-rc1-work+ #3 Bochs Bochs
  RIP: 0010:[<ffffffff810b3db0>]  [<ffffffff810b3db0>] worker_thread+0x360/0x500
  RSP: 0018:ffff88001e1c9de0  EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
  RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
  R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
  FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
  Stack:
   ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
   ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
   ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
  Call Trace:
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
  RIP  [<ffffffff810b3db0>] worker_thread+0x360/0x500
   RSP <ffff88001e1c9de0>
  CR2: 0000000000000000

There was no reason to keep WORKER_REBIND on failure in the first
place - WORKER_UNBOUND is guaranteed to be set in such cases
preventing incorrectly activating concurrency management.  Always
clear WORKER_REBIND.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-17 15:42:31 -07:00
Linus Torvalds 08077ca849 Merge branch 'akpm' (Andrew's patch-bomb)
Merge fixes from Andrew Morton:
 "13 patches.  12 are fixes and one is a little preparatory thing for
  Andi."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (13 commits)
  memory hotplug: fix section info double registration bug
  mm/page_alloc: fix the page address of higher page's buddy calculation
  drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probe
  compiler.h: add __visible
  pid-namespace: limit value of ns_last_pid to (0, max_pid)
  include/net/sock.h: squelch compiler warning in sk_rmem_schedule()
  slub: consider pfmemalloc_match() in get_partial_node()
  slab: fix starting index for finding another object
  slab: do ClearSlabPfmemalloc() for all pages of slab
  nbd: clear waiting_queue on shutdown
  MAINTAINERS: fix TXT maintainer list and source repo path
  mm/ia64: fix a memory block size bug
  memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread fails
2012-09-17 15:01:14 -07:00
qiuxishi f14851af0e memory hotplug: fix section info double registration bug
There may be a bug when registering section info.  For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.

  node0: start_pfn=0x100,    spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
  node1: start_pfn=0x80000,  spanned_pfn=0x80000,  present_pfn=0x80000, => 0x080000-0x100000
  node2: start_pfn=0x100000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x100000-0x180000
  node3: start_pfn=0x180000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x180000-0x200000

  free_all_bootmem_node()
	register_page_bootmem_info_node()
		register_page_bootmem_info_section()

When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().

  sparse_remove_one_section()
	free_section_usemap()
		free_map_bootmem()
			put_page_bootmem()

[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Li Haifeng 0ba8f2d593 mm/page_alloc: fix the page address of higher page's buddy calculation
The heuristic method for buddy has been introduced since commit
43506fad21 ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists").  But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li <omycle@gmail.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: <stable@vger.kernel.org>	[2.6.38+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Kevin Hilman 8dcebaa9a0 drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probe
On some platforms, bootloaders are known to do some interesting RTC
programming.  Without going into the obscurities as to why this may be
the case, suffice it to say the the driver should not make any
assumptions about the state of the RTC when the driver loads.  In
particular, the driver probe should be sure that all interrupts are
disabled until otherwise programmed.

This was discovered when finding bursty I2C traffic every second on
Overo platforms.  This I2C overhead was keeping the SoC from hitting
deep power states.  The cause was found to be the RTC firing every
second on the I2C-connected TWL PMIC.

Special thanks to Felipe Balbi for suggesting to look for a rogue driver
as the source of the I2C traffic rather than the I2C driver itself.

Special thanks to Steve Sakoman for helping track down the source of the
continuous RTC interrups on the Overo boards.

Signed-off-by: Kevin Hilman <khilman@ti.com>
Cc: Felipe Balbi <balbi@ti.com>
Tested-by: Steve Sakoman <steve@sakoman.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Tested-by: Shubhrajyoti Datta <omaplinuxkernel@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Andi Kleen 9a858dc7ce compiler.h: add __visible
gcc 4.6+ has support for a externally_visible attribute that prevents the
optimizer from optimizing unused symbols away.  Add a __visible macro to
use it with that compiler version or later.

This is used (at least) by the "Link Time Optimization" patchset.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Andrew Vagin 579035dc5d pid-namespace: limit value of ns_last_pid to (0, max_pid)
The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.

The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.

  map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Chuck Lever 35c448a8a3 include/net/sock.h: squelch compiler warning in sk_rmem_schedule()
This warning:

  In file included from linux/include/linux/tcp.h:227:0,
                   from linux/include/linux/ipv6.h:221,
                   from linux/include/net/ipv6.h:16,
                   from linux/include/linux/sunrpc/clnt.h:26,
                   from linux/net/sunrpc/stats.c:22:
  linux/include/net/sock.h: In function `sk_rmem_schedule':
  linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the
-Wextra option.

Commit c76562b670 ("netvm: prevent a stream-specific deadlock")
accidentally replaced the "size" parameter of sk_rmem_schedule() with an
unsigned int.  This changes the semantics of the comparison in the
return statement.

In sk_wmem_schedule we have syntactically the same comparison, but
"size" is a signed integer.  In addition, __sk_mem_schedule() takes a
signed integer for its "size" parameter, so there is an implicit type
conversion in sk_rmem_schedule() anyway.

Revert the "size" parameter back to a signed integer so that the
semantics of the expressions in both sk_[rw]mem_schedule() are exactly
the same.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Joonsoo Kim 8ba00bb68a slub: consider pfmemalloc_match() in get_partial_node()
get_partial() is currently not checking pfmemalloc_match() meaning that
it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
This is a problem in the following situation.  Assume that there is a
request from normal allocation and there are no objects in the per-cpu
cache and no node-partial slab.

In this case, slab_alloc enters the slow path and new_slab_objects() is
called which may return a PFMEMALLOC page.  As the current user is not
allowed to access PFMEMALLOC page, deactivate_slab() is called
([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
checks]) and returns an object from PFMEMALLOC page.

Next time, when we get another request from normal allocation,
slab_alloc() enters the slow-path and calls new_slab_objects().  In
new_slab_objects(), we call get_partial() and get a partial slab which
was just deactivated but is a pfmemalloc page.  We extract one object
from it and re-deactivate.

  "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.

As a result, access to PFMEMALLOC page is not properly restricted and it
can cause a performance degradation due to frequent deactivation.
deactivation frequently.

This patch changes get_partial_node() to take pfmemalloc_match() into
account and prevents the "deactivate -> re-get in get_partial()
scenario.  Instead, new_slab() is called.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00