Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()
commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.
Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.
Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.
splice_shrink_spd() loses its struct pipe_inode_info argument.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg_destroy() caches @blkg->q in local variable @q. While there are
two places which needs @blkg->q, only lockdep_assert_held() used the
local variable leading to unused local variable warning if lockdep is
configured out. Drop the local variable and just use @blkg->q
directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rakesh Iyer <rni@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On module load, creates a debugfs parent 'rssd' in debugfs root. Then for each
device, create a new node with corresponding disk name. Under the new node, two
entries 'registers' and 'flags' are created.
NOTE: These entries were removed from sysfs in the previous patch
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch removes entries 'registers' and 'flags' from sysfs. Updated ABI file
to reflect this change.
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When policy data allocation fails in the middle, blkg_alloc() invokes
blkg_free() to destroy the half constructed blkg. This ends up
calling pd_exit_fn() on policy datas which didn't go through
pd_init_fn(). Fix it by making blkg_alloc() call pd_init_fn()
immediately after each policy data allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq may be built w/ or w/o blkcg support depending on
CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of
related code is ifdef'd out but some part is left dangling -
blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
calls are made on it.
Feeding zero filled policy to blkcg_policy_register() is incorrect and
triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
!CONFIG_CFQ_GROUP_IOSCHED.
------------[ cut here ]------------
WARNING: at block/blk-cgroup.c:867
Modules linked in:
Modules linked in:
CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
Krnl Code: 00000000003d76be: a7280001 lhi %r2,1
00000000003d76c2: a7f4ffdf brc 15,3d7680
#00000000003d76c6: a7f40001 brc 15,3d76c8
>00000000003d76ca: a7c8ffea lhi %r12,-22
00000000003d76ce: a7f4ffce brc 15,3d766a
00000000003d76d2: a7f40001 brc 15,3d76d4
00000000003d76d6: a7c80000 lhi %r12,0
00000000003d76da: a7f4ffc2 brc 15,3d765e
Call Trace:
([<0000000000b6f000>] initcall_debug+0x0/0x4)
[<0000000000989e8a>] cfq_init+0x62/0xd4
[<00000000001000ba>] do_one_initcall+0x3a/0x170
[<000000000096fb60>] kernel_init+0x214/0x2bc
[<0000000000623202>] kernel_thread_starter+0x6/0xc
[<00000000006231fc>] kernel_thread_starter+0x0/0xc
no locks held by swapper/0/1.
Last Breaking-Event-Address:
[<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
---[ end trace b8ef4903fcbf9dd3 ]---
This patch fixes the problem by ensuring all blkcg support code is
inside CONFIG_CFQ_GROUP_IOSCHED.
* blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is
dropped from blkcg_policy_cfq decl.
* blkcg_deactivate_poilcy() invocation is moved inside ifdef. This
also makes the activation logic match cfq_init_queue().
* All blkcg_policy_[un]register() invocations are moved inside ifdef.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq_init() would return zero after kmem cache creation failure. Fix
so that it returns -ENOMEM.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Should be 'exynos5_xxx' instead of 'exonys5_xxx'.
It happened at the commit 30b842889e ("Merge tag 'soc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc")
during v3.5 merge window.
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
[ My bad - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull some left-over PM patches from Rafael J. Wysocki.
* 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification
ACPI / PM: Make __acpi_bus_get_power() cover D3cold correctly
ACPI / PM: Fix error messages in drivers/acpi/bus.c
rtc-cmos / PM: report wakeup event on ACPI RTC alarm
ACPI / PM: Generate wakeup events on fixed power button
This reverts commit 5ceb9ce6fe.
That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.
So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New tmpfs use of !PageUptodate pages for fallocate() is triggering the
WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
is called from migrate_page_copy() for compaction.
It is anomalous that migration should use __set_page_dirty_nobuffers()
on an address_space that does not participate in dirty and writeback
accounting; and this has also been observed to insert surprising dirty
tags into a tmpfs radix_tree, despite tmpfs not using tags at all.
We should probably give migrate_page_copy() a better way to preserve the
tag and migrate accounting info, when mapping_cap_account_dirty(). But
that needs some more work: so in the interim, avoid the warning by using
a simple SetPageDirty on PageSwapBacked pages.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment above it says "Stat data, not accessed from path walking",
but in fact some of inode fields we use for the common stat data was way
down at the end of the inode, causing unnecessary cache misses for the
common stat operations.
The inode structure is pretty big, and this can change padding depending
on field width, but at least on the common 64-bit configurations this
doesn't change the size. Some of our inode layout has historically been
to tro to avoid unnecessary padding fields, but cache locality is at
least as important for layout, if not more.
Noticed by looking at kernel profiles, and noticing that the "i_blkbits"
access stood out like a sore thumb.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."
* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.
Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.
Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:
thin_dump -m <msnap root> <metadata dev>
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:
i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.
A short proof of concept script can be found here:
https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
ii) Migration of thin devices from one pool to another.
iii) Merging snapshots back into an external origin.
iv) Asyncronous replication.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.
With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.
Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.
Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.
If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Pull networking updates from David Miller:
1) Make syn floods consume significantly less resources by
a) Not pre-COW'ing routing metrics for SYN/ACKs
b) Mirroring the device queue mapping of the SYN for the SYN/ACK
reply.
Both from Eric Dumazet.
2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.
3) Validate the length requested when building a paged SKB for a
socket, so we don't overrun the page vector accidently. From Jason
Wang.
4) When netlabel is disabled, we abort all IP option processing when we
see a CIPSO option. This isn't the right thing to do, we should
simply skip over it and continue processing the remaining options
(if any). Fix from Paul Moore.
5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
Apfelbaum.
6) 8139cp enables the receiver before the ring address is properly
programmed, which potentially lets the device crap over random
memory. Fix from Jason Wang.
7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
address reference in jumbo RX frame processing from Bruce Allan and
Sebastian Andrzej Siewior, respectively.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
fec_mpc52xx: fix timestamp filtering
mcs7830: Implement link state detection
e1000e: fix Rapid Start Technology support for i217
e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
r8169: call netif_napi_del at errpaths and at driver unload
tcp: reflect SYN queue_mapping into SYNACK packets
tcp: do not create inetpeer on SYNACK message
8139cp/8139too: terminate the eeprom access with the right opmode
8139cp: set ring address before enabling receiver
cipso: handle CIPSO options correctly when NetLabel is disabled
net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
bql: Avoid possible inconsistent calculation.
bql: Avoid unneeded limit decrement.
bql: Fix POSDIFF() to integer overflow aware.
net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
net/mlx4_core: Fixes for VF / Guest startup flow
net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
net/mlx4_core: Fix number of EQs used in ICM initialisation
net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:
- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints
This reverts the tty layer change to use per-tty locking, because it's
not correct yet, and fixing it will require some more deep surgery.
The main revert is d29f3ef39b ("tty_lock: Localise the lock"), but
there are several smaller commits that built upon it, they also get
reverted here. The list of reverted commits is:
fde86d3108 - tty: add lockdep annotations
8f6576ad47 - tty: fix ldisc lock inversion trace
d3ca8b64b9 - pty: Fix lock inversion
b1d679afd7 - tty: drop the pty lock during hangup
abcefe5fc3 - tty/amiserial: Add missing argument for tty_unlock()
fd11b42e35 - cris: fix missing tty arg in wait_event_interruptible_tty call
d29f3ef39b - tty_lock: Localise the lock
The revert had a trivial conflict in the 68360serial.c staging driver
that got removed in the meantime.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>