The use of the union reduces the size of closure struct by taking advantage
of the current size of its members. The offset of func in work_struct
equals the size of the first three members, so that work.work_func will
just reference the forth member - fn.
This is smart but dangerous. It can be broken if work_struct or the other
structs get changed, and can be a bit difficult to debug.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The time spent searching for things to write back "counts" for the
actual rate achieved, so don't flush the accumulated rate with each
chunk.
This will maintain better fidelity to user-commanded rates, but it
may slightly increase the burstiness of writeback. The writeback
lock needs improvement to help mitigate this.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The previous code artificially limited writeback rate to 1000000
blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
hardware. The rate limiting code works fine (though with decreased
precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
Additionally, ensure that uint32_t is used as a type for rate throughout
the rate management so that type checking/clamp_t can work properly.
bch_next_delay should be rewritten for increased precision and better
handling of high rates and long sleep periods, but this is adequate for
now.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Coly Li <colyli@suse.de>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This works in conjunction with the new PI controller. Currently, in
real-world workloads, the rate controller attempts to write back 1
sector per second. In practice, these minimum-rate writebacks are
between 4k and 60k in test scenarios, since bcache aggregates and
attempts to do contiguous writes and because filesystems on top of
bcachefs typically write 4k or more.
Previously, bcache used to guarantee to write at least once per second.
This means that the actual writeback rate would exceed the configured
amount by a factor of 8-120 or more.
This patch adjusts to be willing to sleep up to 2.5 seconds, and to
target writing 4k/second. On the smallest writes, it will sleep 1
second like before, but many times it will sleep longer and load the
backing device less. This keeps the loading on the cache and backing
device related to writeback more consistent when writing back at low
rates.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bcache uses a control system to attempt to keep the amount of dirty data
in cache at a user-configured level, while not responding excessively to
transients and variations in write rate. Previously, the system was a
PD controller; but the output from it was integrated, turning the
Proportional term into an Integral term, and turning the Derivative term
into a crude Proportional term. Performance of the controller has been
uneven in production, and it has tended to respond slowly, oscillate,
and overshoot.
This patch set replaces the current control system with an explicit PI
controller and tuning that should be correct for most hardware. By
default, it attempts to write at a rate that would retire 1/40th of the
current excess blocks per second. An integral term in turn works to
remove steady state errors.
IMO, this yields benefits in simplicity (removing weighted average
filtering, etc) and system performance.
Another small change is a tunable parameter is introduced to allow the
user to specify a minimum rate at which dirty blocks are retired.
There is a slight difference from earlier versions of the patch in
integral handling to prevent excessive negative integral windup.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an IO operation fails, and we didn't successfully read data from the
cache, don't writeback invalid/partial data to the backing disk.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Flag for bypass if the IO is for read-ahead or background, unless the
read-ahead request is for metadata (eg, from gfs2).
Bypass if:
bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
!(bio->bi_opf & REQ_META))
Writeback if:
op_is_sync(bio->bi_opf) ||
bio->bi_opf & (REQ_META|REQ_PRIO)
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current partition support of bcache is confusing and buggy. It tries to
trace non-continuous device minor numbers by an ida bit string, and
mistakenly mixed bcache device index with minor numbers. This design
generates several negative results,
- Index of bcache device name is not consecutive under /dev/. If there are
3 bcache devices, they name will be,
/dev/bcache0, /dev/bcache16, /dev/bcache32
Only bcache code indexes bcache device name is such an interesting way.
- First minor number of each bcache device is traced by ida bit string.
One bcache device will occupy 16 bits, this is not a good idea. Indeed
only one bit is enough.
- Because minor number and bcache device index are mixed, a device index
is allocated by ida_simple_get(), but an first minor number is sent into
ida_simple_remove() to release the device. It confused original author
too.
Root cause of the above errors is, bcache code should not handle device
minor numbers at all! A standard process to support multiple partitions in
Linux kernel is,
- Device driver provides major device number, and indexes multiple device
instances.
- Device driver does not allocat nor trace device minor number, only
provides a first minor number of a given device instance, and sets how
many minor numbers (paritions) the device instance may have.
All rested stuffs are handled by block layer code, most of the details can
be found from block/{genhd, partition-generic}.c files.
This patch re-writes multiple partitions support for bcache. It makes
whole things to be more clear, and uses ida bit string in a more efficeint
way.
- Ida bit string only traces bcache device index, not minor number. For a
bcache device with 128 partitions, only one bit in ida bit string is
enough.
- Device minor number and device index are separated in concept. Device
index is used for /dev node naming, and ida bit string trace. Minor
number is calculated from device index and only used to initialize
first_minor of a bcache device.
- It does not follow any standard for 16 partitions on a bcache device.
This patch sets 128 partitions on single bcache device at max, this is
the limitation from GPT (GUID Partition Table) and supported by fdisk.
Considering a typical device minor number is 20 bits width, each bcache
device may have 128 partitions (7 bits), there can be 8192 bcache devices
existing on system. For most common deployment for a single server in
now days, it should be enough.
[minor spelling fixes in commit message by Michael Lyle]
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Code comments in alloc.c:bch_alloc_sectors() mentions a function
name find_data_bucket(), the correct function name should be
pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
function in bcache allocation code, fixing the typo may help
other people to have less confusion.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bcache code, sysfs entries are created before all resources get
allocated, e.g. allocation thread of a cache set.
There is posibility for NULL pointer deference if a resource is accessed
but which is not initialized yet. Indeed Jorg Bornschein catches one on
cache set allocation thread and gets a kernel oops.
The reason for this bug is, when bch_bucket_alloc() is called during
cache set registration and attaching, ca->alloc_thread is not properly
allocated and initialized yet, call wake_up_process() on ca->alloc_thread
triggers NULL pointer deference failure. A simple and fast fix is, before
waking up ca->alloc_thread, checking whether it is allocated, and only
wake up ca->alloc_thread when it is not NULL.
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Jorg Bornschein <jb@capsec.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes below error with clang:
../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
{ return *((uint16_t *) r) - *((uint16_t *) l); }
^
../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
sort(p, n, sizeof(uint16_t), cmp, NULL);
^
2 errors generated.
v2:
rename function to __bch_cache_cmp
Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 4ad23a9764 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.
With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:
linux175:~ # ps aux|grep md|grep D
root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1]
root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec]
linux175:~ # cat /proc/20117/stack
[<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster]
[<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster]
[<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster]
[<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod]
[<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod]
[<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod]
[<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
linux175:~ # cat /proc/20125/stack
[<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
So let's revert the part of code in the commit to resovle the problem since
we can't get lots of benefits of previous change.
Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This contains:
- NVMe pull request from Christoph, one uuid attribute fix, and one
fix for the controller memory buffer address for remapped BARs.
- use-after-free fix for bsg, from Benjamin Block.
- bcache race/use-after-free fix for a list traversal, fixing a
regression in this merge window. From Coly Li.
- null_blk change configfs dependency change from a 'depends' to a
'select'. This is a change from this merge window as well. From me.
- nbd signal fix from Josef, fixing a regression introduced with the
status code changes.
- nbd MAINTAINERS mailing list entry update.
- blk-throttle stall fix from Joseph Qi.
- blk-mq-debugfs fix from Omar, fixing an issue where we don't
register the IO scheduler debugfs directory, if the driver is
loaded with it. Only shows up if you switch through the sysfs
interface"
* 'for-linus' of git://git.kernel.dk/linux-block:
bsg-lib: fix use-after-free under memory-pressure
nvme-pci: Use PCI bus address for data/queues in CMB
blk-mq-debugfs: fix device sched directory for default scheduler
null_blk: change configfs dependency to select
blk-throttle: fix possible io stall when upgrade to max
MAINTAINERS: update list for NBD
nbd: fix -ERESTARTSYS handling
nvme: fix visibility of "uuid" ns attribute
bcache: use llist_for_each_entry_safe() in __closure_wake_up()
Pull device mapper fixes from Mike Snitzer:
- a stable fix for the alignment of the event number reported at the
end of the 'DM_LIST_DEVICES' ioctl.
- a couple stable fixes for the DM crypt target.
- a DM raid health status reporting fix.
* tag 'for-4.14/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm raid: fix incorrect status output at the end of a "recover" process
dm crypt: reject sector_size feature if device length is not aligned to it
dm crypt: fix memory leak in crypt_ctr_cipher_old()
dm ioctl: fix alignment of event number in the device list
We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are three important fields that indicate the overall health and
status of an array: dev_health, sync_ratio, and sync_action. They tell
us the condition of the devices in the array, and the degree to which
the array is synchronized.
This commit fixes a condition that is reported incorrectly. When a member
of the array is being rebuilt or a new device is added, the "recover"
process is used to synchronize it with the rest of the array. When the
process is complete, but the sync thread hasn't yet been reaped, it is
possible for the state of MD to be:
mddev->recovery = [ MD_RECOVERY_RUNNING MD_RECOVERY_RECOVER MD_RECOVERY_DONE ]
curr_resync_completed = <max dev size> (but not MaxSector)
and all rdevs to be In_sync.
This causes the 'array_in_sync' output parameter that is passed to
rs_get_progress() to be computed incorrectly and reported as 'false' --
or not in-sync. This in turn causes the dev_health status characters to
be reported as all 'a', rather than the proper 'A'.
This can cause erroneous output for several seconds at a time when tools
will want to be checking the condition due to events that are raised at
the end of a sync process. Fix this by properly calculating the
'array_in_sync' return parameter in rs_get_progress().
Also, remove an unnecessary intermediate 'recovery_cp' variable in
rs_get_progress().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A recent patch aimed to cause md_write_start() to fail (rather than
block) when the mddev was suspending, so as to avoid deadlocks.
Unfortunately the test in wait_event() was wrong, and it didn't change
behaviour at all.
We wait_event() must wait until the metadata is written OR the array is
suspending.
Fixes: cc27b0c78c ("md: fix deadlock between mddev_suspend() and md_write_start()")
Cc: stable@vger.kernel.org
Reported-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If a crypt mapping uses optional sector_size feature, additional
restrictions to mapped device segment size must be applied in
constructor, otherwise the device activation will fail later.
Fixes: 8f0009a225 ("dm crypt: optionally support larger encryption sector size")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
static checker reports a potential integer overflow. Cap the worker count to
avoid the overflow.
Reported:-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
raid_map calls pers->make_request, which missed the suspend check. Fix it with
the new md_handle_request API.
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
md_submit_flush_data calls pers->make_request, which missed the suspend check.
Fix it with the new md_handle_request API.
Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>