Coverity complains about a possible copy-paste error in the verification
of the namespace atomic parameters (CID 1642811). While the check is
correct, the code (and the intention) is unclear.
Fix this by reworking how the parameters are verified. Peter also
identified that the realize function was not correctly erroring out if
parameters were misconfigured, so fix that too.
Lastly, change the error messages to be more describing.
Coverity: CID 1642811
Fixes: bce51b8370 ("hw/nvme: add atomic boundary support")
Fixes: 3b41acc962 ("hw/nvme: enable ns atomic writes")
Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
This doesn’t seem to be specified anywhere, but is something we probably
want to be clear. I believe it is reasonable to implicitly assume that
callbacks are run in the current thread (unless explicitly noted
otherwise), so codify that assumption.
Some implementations don’t actually fulfill this contract yet. The next
patches should rectify that.
Note: I don’t know of any user-visible bugs produced by not running AIO
callbacks in the original context. AIO functionality is generally
mapped to coroutines through the use of bdrv_co_io_em_complete(), which
can run in any AioContext, and will always wake the yielding coroutine
in its original context. The only benefit here is that running
bdrv_co_io_em_complete() in the original context will make that
aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of
scheduling the wakeup through AioContext.co_schedule_bh.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-17-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_co_wake() is generally safe to call regardless of whether the
coroutine is already yielding or not. If it is not yet yielding, it
will be scheduled to run when it does yield.
Caveats:
- The caller must be independent of the coroutine (to ensure the
coroutine must be yielding if both are in the same AioContext), i.e.
must not be the same coroutine
- The coroutine must yield at some point
Make note of this so callers can reason that their use is safe.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-2-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is pure refactoring: instead of hard-coding permission to
use a protocol prefix when creating an image, the drivers can now pass
in a parameter, comparable to what they could already do for opening a
pre-existing image. This patch is purely mechanical (all drivers pass
in true for now), but it will enable the next patch to cater to
drivers that want to differ in behavior for the primary image vs. any
secondary images that are opened at the same time as creating the
primary image.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-5-eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.
Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.
Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.
There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
ring fd from the file descriptor monitoring io_uring instance. Disk
I/O completions are now picked up directly. Also, sqes are
accumulated in the sq ring until the end of the event loop iteration
and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.
Note that error_setg() messages are not supposed to end with
punctuation, so I removed a '.' for the non-io_uring build error
message.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-15-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.
This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: CQE callbacks
must be placed on a list so that nested event loops can invoke pending
CQE callbacks from parent event loops. If you're wondering why
CqeHandler exists instead of just a callback function pointer, this is
why.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-14-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The ppoll and epoll file descriptor monitoring implementations rely on
the event loop's generic file descriptor, timer, and BH dispatch code to
invoke user callbacks.
The io_uring file descriptor monitoring implementation will need
io_uring-specific dispatch logic for CQE handlers for custom SQEs.
Introduce a new FDMonOps ->dispatch() callback that allows file
descriptor monitoring implementations to invoke user callbacks. The next
patch will use this new callback.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-13-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.
However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-10-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.
Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.
aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-9-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.
On hosts with io_uring support this now enables fdmon-io_uring.c by
default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the
event loop will use io_uring!
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-8-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.
Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.
In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.
A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
descriptors and timers.
- check() is called to determine whether events are ready to be
dispatched after waiting.
- dispatch() is called to process events.
More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html
Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.
Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.
Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
[kwolf: Build fixes; fix AioContext.list_lock use after destroy]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new --limits option to 'qemu-img info' that displays the block
limits for the image and all of its children, making the information
more accessible for human users than in QMP. This option is not enabled
by default because it can be a lot of output that isn't usually relevant
if you're not specifically trying to diagnose some I/O problem.
This makes the same information automatically also available in HMP
'info block -v'.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251024123041.51254-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Patches to expose the limits in QAPI have made clear that the existing
documentation of BlockLimits could be improved: The meaning of
min_mem_alignment and opt_mem_alignment could be clearer, and talking
about better alignment values isn't helpful when we only detect these
values and never choose them.
Make the changes in the BlockLimits documentation now, so that the
patches exposing the fields in QAPI can use descriptions consistent with
it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251024123041.51254-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a filtered child is resized, the size of the parent node is now
also refreshed (recursively for chains of filtered children).
For filter block drivers that do not implement .bdrv_co_getlength(),
this commit does not change the current behavior, because
bdrv_co_refresh_total_sectors() will used the current size via the
passed-in hint. This is the case for block drivers for (some) block
jobs, as well as copy-before-write.
Block jobs already set up a blocker preventing a QMP block_resize
operation while the job is running. That does not directly cover an
associated 'file' node of a 'raw' node, but resizing such a 'file'
node is already prevented too (backup, commit, mirror and stream were
checked).
The other case is copy-before-write. This commit does not change the
fact that the copy-before-write node still has the same size after its
filtered child is resized.
Block drivers that do implement .bdrv_co_getlength() and where
.is_filter is true, already returned the length of the file child, so
there is no change before and after this commit, with two exceptions:
1. preallocate can return an early data_end and otherwise queries the
file child, but that special casing is not changed.
2. blkverify returns the length of the test file. This commit does not
affect that behavior.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250917115509.401015-4-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows stats-intervals to be used for storage
devices with the -device option. It accepts a list of interval
lengths in JSON format.
It configures and collects the stats in the BlockBackend layer
through the storage device that consumes the BlockBackend.
Signed-off-by: Chandan Somani <csomani@redhat.com>
Message-ID: <20251003220039.1336663-1-csomani@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_open_child_common() calls
bdrv_graph_wrlock_drained(), which must be called with the graph
unlocked. Mark it and its two callers bdrv_open_file_child() and
bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily
unlocking in vmdk_parse_extents() and making the locked section
shorter in vmdk_open().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The functions blk_log_writes_close(), blkverify_close(),
quorum_close(), vmdk_close() via vmdk_free_extents(), and other
bdrv_close() implementations call bdrv_graph_wrlock_drained(), which
must be called with the graph unlocked. They are reached via the
BlockDriver's bdrv_close() callback and the bdrv_close() wrapper,
which are also marked as GRAPH_UNLOCKED_PTR and GRAPH_UNLOCKED.
Furthermore, the function bdrv_close() also calls bdrv_drained_begin()
and bdrv_graph_wrlock_drained(), so there are additional reasons for
marking it GRAPH_UNLOCKED.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-47-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>