Fsync performance for small files achieved by cfq on high-end disks is
lower than what deadline can achieve, due to idling introduced between
the sync write happening in process context and the journal commit.
Moreover, when competing with a sequential reader, a process writing
small files and fsync-ing them is starved.
This patch fixes the two problems by:
- marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
flag set,
- force all queues that have REQ_NOIDLE requests to be put in the noidle
tree.
Having the queue associated to the fsync-ing process and the one associated
to journal commits in the noidle tree allows:
- switching between them without idling,
- fairness vs. competing idling queues, since they will be serviced only
after the noidle tree expires its slice.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When a new disk is being discovered, add_disk() first ties the bdev to gendisk
(via register_disk()->blkdev_get()) and only after that calls
bdi_register_bdev(). Because register_disk() also creates disk's kobject, it
can happen that userspace manages to open and modify the device's data (or
inode) before its BDI is properly initialized leading to a warning in
__mark_inode_dirty().
Fix the problem by registering BDI early enough.
This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312
Cc: stable@kernel.org
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Actual implementation of throttling policy in block layer. Currently it
implements READ and WRITE bytes per second throttling logic. IOPS throttling
comes in later patches.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o cgroup chagnes for throttle policy.
o Introduces READ and WRITE bytes per second throttling rules.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o This patch prepares the base for introducing new IO control policies.
Currently all the code is written knowing there is only one policy
and that is proportional bandwidth. Creating infrastructure for newer
policies to come in.
o Also there were many functions which were generated using macro. It was
very confusing. Got rid of those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Kill extra "dev weight" header which is printed when somebody reads
blkio.weight_device file. This really seems to be out of convention. No other
blkio files are printing any header at the start of file. I think it is ok
to just print values and how to interpret values should be part of
documentation.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This is the third patch in a series which adds support for
storing partition metadata, optionally, off of the hd_struct.
One major use for that data is being able to resolve partition
by other identities than just the index on a block device. Device
enumeration varies by platform and there's a benefit to being able
to use something like EFI GPT's GUIDs to determine the correct
block device and partition to mount as the root.
This change adds that support to root= by adding support for
the following syntax:
root=PARTUUID=hex-uuid
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I'm reposting this patch series as v4 since there have been no additional
comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
are against Linus's tree: 2bfc96a127
(2.6.36-rc3).
Would this patchset be suitable for inclusion in an mm branch?
This changes adds a partition_meta_info struct which itself contains a
union of structures that provide partition table specific metadata.
This change leaves the union empty. The subsequent patch includes an
implementation for CONFIG_EFI_PARTITION-based metadata.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Change type of 2nd parameter of blk_rq_aligned() into unsigned long
and remove unnecessary casting. Now we can call it with 'uaddr'
instead of 'ubuf' in __blk_rq_map_user() so that it can remove
following warnings from sparse:
block/blk-map.c:57:31: warning: incorrect type in argument 2 (different address spaces)
block/blk-map.c:57:31: expected void *addr
block/blk-map.c:57:31: got void [noderef] <asn:1>*ubuf
However blk_rq_map_kern() needs one more local variable to handle it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Some controllers have a hardware limit on the number of protection
information scatter-gather list segments they can handle.
Introduce a max_integrity_segments limit in the block layer and provide
a new scsi_host_template setting that allows HBA drivers to provide a
value suitable for the hardware.
Add support for honoring the integrity segment limit when merging both
bios and requests.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
We have several users of min_not_zero, each of them using their own
definition. Move the define to kernel.h.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
Remove support for barriers on discards, which is unused now. Also
remove the DISCARD_NOBARRIER I/O type in favour of just setting the
rw flags up locally in blkdev_issue_discard.
tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.
REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags. Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
passed through remapping drivers. FSEQ_DATA request incorrectly
followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
rq_disk. Make it follow orig_rq->rq_disk.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
While completing a request from a REQ_FLUSH/FUA sequence, another
request can be pushed to the request queue. If a driver tests
elv_queue_empty() before completing a request and runs the queue again
only if the queue wasn't empty, this may lead to hang. Please note
that most drivers either kick the queue unconditionally or test queue
emptiness after completing the current request and don't have this
problem.
This patch removes this possibility by making REQ_FLUSH/FUA sequence
code kick the queue if the queue was empty before completing a request
from REQ_FLUSH/FUA sequence.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests. Use WRITE_FLUSH instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code. Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are a number of make_request based drivers which don't support
cache flushes. Filter out flush bio's in __generic_make_request() so
that they don't have to worry about them. All FLUSH/FUA requests with
data are converted to regular IO requests and empty ones are completed
immediately.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.
Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.
Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
never failed with -EOPNOTSUPP. If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those noop. IOW, it no
longer distinguishes between writeback cache which doesn't support
cache flush and writethrough/no cache. Devices which have WB cache
w/o flush are very difficult to come by these days and there's nothing
much we can do anyway, so it doesn't make sense to require everyone to
implement -EOPNOTSUPP handling. This will simplify filesystems and
block drivers as they can drop -EOPNOTSUPP retry logic for barriers.
* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.
* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.
* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
copied from bio to request.
* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
WRITE_FLUSH_FUA are added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>