Pull block updates from Jens Axboe:
"This is a bit bigger than it should be, but I could (did) not want to
send it off last week due to both wanting extra testing, and expecting
a fix for the bounce regression as well. In any case, this contains:
- Fix for the blk-merge.c compilation warning on gcc 5.x from me.
- A set of back/front SG gap merge fixes, from me and from Sagi.
This ensures that we honor SG gapping for integrity payloads as
well.
- Two small fixes for null_blk from Matias, fixing a leak and a
capacity propagation issue.
- A blkcg fix from Tejun, fixing a NULL dereference.
- A fast clone optimization from Ming, fixing a performance
regression since the arbitrarily sized bio's were introduced.
- Also from Ming, a regression fix for bouncing IOs"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bounce_end_io
block: blk-merge: fast-clone bio when splitting rw bios
block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
block: Copy a user iovec if it includes gaps
block: Refuse adding appending a gapped integrity page to a bio
block: Refuse request/bio merges with gaps in the integrity payload
block: Check for gaps on front and back merges
null_blk: fix wrong capacity when bs is not 512 bytes
null_blk: fix memory leak on cleanup
block: fix bogus compiler warnings in blk-merge.c
When bio bounce is involved, one new bio and its biovecs are
cloned from the comming bio, which can be one fast-cloned bio
from upper layer(such as dm).
So it is obviously wrong to assume the start index of the coming(
original) bio's io vector is zero, which can be any value between
0 and (bi_max_vecs - 1), especially in case of bio split.
This patch fixes Fedora's booting oops on i386, often with the
following kernel log together:
> [ 9.026738] systemd[1]: Switching root.
> [ 9.036467] systemd-journald[149]: Received SIGTERM from PID 1
> (systemd).
> [ 9.082262] BUG: Bad page state in process kworker/u5:1 pfn:372ac
> [ 9.083989] page:f3d32ae0 count:0 mapcount:0 mapping:f2252178
> index:0x16a
> [ 9.085755] flags: 0x40020021(locked|lru|mappedtodisk)
> [ 9.087284] page dumped because: page still charged to cgroup
> [ 9.088772] bad because of flags:
> [ 9.089731] flags: 0x21(locked|lru)
> [ 9.090818] page->mem_cgroup:f2c3e400
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Adam Williamson <awilliam@redhat.com>
Cc: Ming Lin <mlin@kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
biovecs has become immutable since v3.13, so it isn't necessary
to allocate biovecs for the new cloned bios, then we can save
one extra biovecs allocation/copy, and the allocation is often
not fixed-length and a bit more expensive.
For example, if the 'max_sectors_kb' of null blk's queue is set
as 16(32 sectors) via sysfs just for making more splits, this patch
can increase throught about ~70% in the sequential read test over
null_blk(direct io, bs: 1M).
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
This fixes a performance regression introduced by commit 54efd50bfd,
and allows us to take full advantage of the fact that we have immutable
bio_vecs. Hand applied, as it rejected violently with commit
5014c311ba.
Signed-off-by: Jens Axboe <axboe@fb.com>
While making the root blkg unconditional, ec13b1d6f0 ("blkcg: always
create the blkcg_gq for the root blkcg") removed the part which clears
q->root_blkg and ->root_rl.blkg during q exit. This leaves the two
pointers dangling after blkg_destroy_all(). blk-throttle exit path
performs blkg traversals and dereferences ->root_blkg and can lead to
the following oops.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
RIP: 0010:[<ffffffff81389746>] [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
Call Trace:
[<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
[<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
[<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
[<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
[<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
[<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
[<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
[<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
...
While the bug is a straigh-forward use-after-free bug, it is tricky to
reproduce because blkg release is RCU protected and the rest of exit
path usually finishes before RCU grace period.
This patch fixes the bug by updating blkg_destro_all() to clear
q->root_blkg and ->root_rl.blkg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Richard W.M. Jones" <rjones@redhat.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
Fixes: ec13b1d6f0 ("blkcg: always create the blkcg_gq for the root blkcg")
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For drivers that don't support gaps in the SG lists handed to
them we must bounce (copy the user buffers) and pass a bio that
does not include gaps. This doesn't matter for any current user,
but will help to allow iser which can't handle gaps to use the
block virtual boundary instead of using driver-local bounce
buffering when handling SG_IO commands.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is only theoretical at the moment given that the only
subsystems that generate integrity payloads are the block layer
itself and the scsi target (which generate well aligned integrity
payloads). But when we will expose integrity meta-data to user-space,
we'll need to refuse appending a page with a gap (if the queue
virtual boundary is set).
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If a driver sets the block queue virtual boundary mask, it means that
it cannot handle gaps so we must not allow those in the integrity
payload as well.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Fixed up by me to have duplicate integrity merge functions, depending
on whether block integrity is enabled or not. Fixes a compilations
issue with CONFIG_BLK_DEV_INTEGRITY unset.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull blk-cg updates from Jens Axboe:
"A bit later in the cycle, but this has been in the block tree for a a
while. This is basically four patchsets from Tejun, that improve our
buffered cgroup writeback. It was dependent on the other cgroup
changes, but they went in earlier in this cycle.
Series 1 is set of 5 patches that has cgroup writeback updates:
- bdi_writeback iteration fix which could lead to some wb's being
skipped or repeated during e.g. sync under memory pressure.
- Simplification of wb work wait mechanism.
- Writeback tracepoints updated to report cgroup.
Series 2 is is a set of updates for the CFQ cgroup writeback handling:
cfq has always charged all async IOs to the root cgroup. It didn't
have much choice as writeback didn't know about cgroups and there
was no way to tell who to blame for a given writeback IO.
writeback finally grew support for cgroups and now tags each
writeback IO with the appropriate cgroup to charge it against.
This patchset updates cfq so that it follows the blkcg each bio is
tagged with. Async cfq_queues are now shared across cfq_group,
which is per-cgroup, instead of per-request_queue cfq_data. This
makes all IOs follow the weight based IO resource distribution
implemented by cfq.
- Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.
- Other misc review points addressed, acks added and rebased.
Series 3 is the blkcg policy cleanup patches:
This patchset contains assorted cleanups for blkcg_policy methods
and blk[c]g_policy_data handling.
- alloc/free added for blkg_policy_data. exit dropped.
- alloc/free added for blkcg_policy_data.
- blk-throttle's async percpu allocation is replaced with direct
allocation.
- all methods now take blk[c]g_policy_data instead of blkcg_gq or
blkcg.
And finally, series 4 is a set of patches cleaning up the blkcg stats
handling:
blkcg's stats have always been somwhat of a mess. This patchset
tries to improve the situation a bit.
- The following patches added to consolidate blkcg entry point and
blkg creation. This is in itself is an improvement and helps
colllecting common stats on bio issue.
- per-blkg stats now accounted on bio issue rather than request
completion so that bio based and request based drivers can behave
the same way. The issue was spotted by Vivek.
- cfq-iosched implements custom recursive stats and blk-throttle
implements custom per-cpu stats. This patchset make blkcg core
support both by default.
- cfq-iosched and blk-throttle keep track of the same stats
multiple times. Unify them"
* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
blkcg: implement interface for the unified hierarchy
blkcg: misc preparations for unified hierarchy interface
blkcg: separate out tg_conf_updated() from tg_set_conf()
blkcg: move body parsing from blkg_conf_prep() to its callers
blkcg: mark existing cftypes as legacy
blkcg: rename subsystem name from blkio to io
blkcg: refine error codes returned during blkcg configuration
blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
blkcg: remove cfqg_stats->sectors
blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
blkcg: make blkcg_[rw]stat per-cpu
blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
blkcg: consolidate blkg creation in blkcg_bio_issue_check()
blk-throttle: improve queue bypass handling
blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
blkcg: inline [__]blkg_lookup()
...
Pull ext3 removal, quota & udf fixes from Jan Kara:
"The biggest change in the pull is the removal of ext3 filesystem
driver (~28k lines removed). Ext4 driver is a full featured
replacement these days and both RH and SUSE use it for several years
without issues. Also there are some workarounds in VM & block layer
mainly for ext3 which we could eventually get rid of.
Other larger change is addition of proper error handling for
dquot_initialize(). The rest is small fixes and cleanups"
[ I wasn't convinced about the ext3 removal and worried about things
falling through the cracks for legacy users, but ext4 maintainers
piped up and were all unanimously in favor of removal, and maintaining
all legacy ext3 support inside ext4. - Linus ]
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't modify filesystem for read-only mounts
quota: remove an unneeded condition
ext4: memory leak on error in ext4_symlink()
mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
ext4: Improve ext4 Kconfig test
block: Remove forced page bouncing under IO
fs: Remove ext3 filesystem driver
doc: Update doc about journalling layer
jfs: Handle error from dquot_initialize()
reiserfs: Handle error from dquot_initialize()
ocfs2: Handle error from dquot_initialize()
ext4: Handle error from dquot_initialize()
ext2: Handle error from dquot_initalize()
quota: Propagate error from ->acquire_dquot()
We are checking for gaps to previous bio_vec, which can
only detect back merges gaps. Moreover, at the point where
we check for a gap, we don't know if we will attempt a back
or a front merge. Thus, check for gap to prev in a back merge
attempt and check for a gap to next in a front merge attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
[sagig: Minor rename change]
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
The compiler can't figure out that bvprv is initialized whenever 'prev'
is set to 1 as well. Use a pointer to bvprv instead, setting it to NULL
initially, and get rid of the 'prev' tracking. This dumbs it down
enough that gcc is happy.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull SG updates from Jens Axboe:
"This contains a set of scatter-gather related changes/fixes for 4.3:
- Add support for limited chaining of sg tables even for
architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph.
- Add sg chain support to target_rd. From Christoph.
- Fixup open coded sg->page_link in crypto/omap-sham. From
Christoph.
- Fixup open coded crypto ->page_link manipulation. From Dan.
- Also from Dan, automated fixup of manual sg_unmark_end()
manipulations.
- Also from Dan, automated fixup of open coded sg_phys()
implementations.
- From Robert Jarzmik, addition of an sg table splitting helper that
drivers can use"
* 'for-4.3/sg' of git://git.kernel.dk/linux-block:
lib: scatterlist: add sg splitting function
scatterlist: use sg_phys()
crypto/omap-sham: remove an open coded access to ->page_link
scatterlist: remove open coded sg_unmark_end instances
crypto: replace scatterwalk_sg_chain with sg_chain
target/rd: always chain S/G list
scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.
This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.
* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.
* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.
* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.
* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
tg_set_conf() is largely consisted of parsing and setting the new
config and the follow-up application and propagation. This patch
separates out the latter part into tg_conf_updated(). This will be
used to implement interface for the unified hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_conf_prep() expects input to be of the following form
MAJ:MIN NUM
and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects
MAJ:MIN BODY_STR
where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.
To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.
This doesn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.
Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.
As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.
As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that
* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
enabled are guaranteed to return non-NULL and the counterpart in
blk-throttle doesn't have these checks either. Remove the spurious
NULL checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>