Commit Graph

909 Commits

Author SHA1 Message Date
Jens Axboe
1251d2025c io_uring/sqpoll: early exit thread if task_context wasn't allocated
Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread to exit.
This can only happen if io_uring_alloc_task_context() fails, which
generally requires fault injection.

Reported-by: Ubisectech Sirius <bugreport@ubisectech.com>
Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-18 20:22:42 -06:00
Jens Axboe
e21e1c45e1 io_uring: clear opcode specific data for an early failure
If failure happens before the opcode prep handler is called, ensure that
we clear the opcode specific area of the request, which holds data
specific to that request type. This prevents errors where opcode
handlers either don't get to clear per-request private data since prep
isn't even called.

Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16 11:24:50 -06:00
Jens Axboe
f3a640cca9 io_uring/net: ensure async prep handlers always initialize ->done_io
If we get a request with IOSQE_ASYNC set, then we first run the prep
async handlers. But if we then fail setting it up and want to post
a CQE with -EINVAL, we use ->done_io. This was previously guarded with
REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any
potential errors, but we need to cover the async setup too.

Fixes: 9817ad8589 ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16 10:33:19 -06:00
Jens Axboe
2b35b8b43e io_uring/waitid: always remove waitid entry for cancel all
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our waitid list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.

Remove the dead check in cancelation as well for the hash_node being
empty or not. We already have a waitid reference check for ownership,
so we don't need to check the list too.

Cc: stable@vger.kernel.org
Fixes: f31ecf671d ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 15:42:49 -06:00
Jens Axboe
30dab608c3 io_uring/futex: always remove futex entry for cancel all
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our futex list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.

Cc: stable@vger.kernel.org
Fixes: 194bb58c60 ("io_uring: add support for futex wake and wait")
Fixes: 8f350194d5 ("io_uring: add support for vectored futex waits")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 15:37:15 -06:00
Pavel Begunkov
5e3afe580a io_uring: fix poll_remove stalled req completion
Taking the ctx lock is not enough to use the deferred request completion
infrastructure, it'll get queued into the list but no one would expect
it there, so it will sit there until next io_submit_flush_completions().
It's hard to care about the cancellation path, so complete it via tw.

Fixes: ef7dfac51d ("io_uring/poll: serialize poll linked timer start with poll removal")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 09:36:56 -06:00
Gabriel Krisman Bertazi
67d1189d10 io_uring: Fix release of pinned pages when __io_uaddr_map fails
Looking at the error path of __io_uaddr_map, if we fail after pinning
the pages for any reasons, ret will be set to -EINVAL and the error
handler won't properly release the pinned pages.

I didn't manage to trigger it without forcing a failure, but it can
happen in real life when memory is heavily fragmented.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Fixes: 223ef47431 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages")
Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 16:08:25 -06:00
Pavel Begunkov
9219e4a9d4 io_uring/kbuf: rename is_mapped
In buffer lists we have ->is_mapped as well as ->is_mmap, it's
pretty hard to stay sane double checking which one means what,
and in the long run there is a high chance of an eventual bug.
Rename ->is_mapped into ->is_buf_ring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 14:50:42 -06:00
Pavel Begunkov
2c5c0ba117 io_uring: simplify io_pages_free
We never pass a null (top-level) pointer, remove the check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 14:50:42 -06:00
Pavel Begunkov
cef59d1ea7 io_uring: clean rings on NO_MMAP alloc fail
We make a few cancellation judgements based on ctx->rings, so let's
zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's
done with the mmap case. Likely, it's not a real problem, but zeroing
is safer and better tested.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12 09:21:36 -06:00
Jens Axboe
0a3737db84 io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
If read multishot is being invoked from the poll retry handler, then we
should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then
a CQE will be posted with -EAGAIN rather than triggering the retry when
the file is flagged as readable again.

Cc: stable@vger.kernel.org
Reported-by: Sargun Dhillon <sargun@meta.com>
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12 08:29:47 -06:00
Jens Axboe
6f0974eccb io_uring: don't save/restore iowait state
This kind of state is per-syscall, and since we're doing the waiting off
entering the io_uring_enter(2) syscall, there's no way that iowait can
already be set for this case. Simplify it by setting it if we need to,
and always clearing it to 0 when done.

Fixes: 7b72d661f1 ("io_uring: gate iowait schedule on having pending requests")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11 15:02:59 -06:00
Linus Torvalds
d2c84bdce2 Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:

 - Make running of task_work internal loops more fair, and unify how the
   different methods deal with them (me)

 - Support for per-ring NAPI. The two minor networking patches are in a
   shared branch with netdev (Stefan)

 - Add support for truncate (Tony)

 - Export SQPOLL utilization stats (Xiaobing)

 - Multishot fixes (Pavel)

 - Fix for a race in manipulating the request flags via poll (Pavel)

 - Cleanup the multishot checking by making it generic, moving it out of
   opcode handlers (Pavel)

 - Various tweaks and cleanups (me, Kunwu, Alexander)

* tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
  io_uring: Fix sqpoll utilization check racing with dying sqpoll
  io_uring/net: dedup io_recv_finish req completion
  io_uring: refactor DEFER_TASKRUN multishot checks
  io_uring: fix mshot io-wq checks
  io_uring/net: add io_req_msg_cleanup() helper
  io_uring/net: simplify msghd->msg_inq checking
  io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
  io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
  io_uring/net: correctly handle multishot recvmsg retry setup
  io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
  io_uring: fix io_queue_proc modifying req->flags
  io_uring: fix mshot read defer taskrun cqe posting
  io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
  io_uring/net: correct the type of variable
  io_uring/sqpoll: statistics of the true utilization of sq threads
  io_uring/net: move recv/recvmsg flags out of retry loop
  io_uring/kbuf: flag request if buffer pool is empty after buffer pick
  io_uring/net: improve the usercopy for sendmsg/recvmsg
  io_uring/net: move receive multishot out of the generic msghdr path
  io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
  ...
2024-03-11 11:35:31 -07:00
Gabriel Krisman Bertazi
606559dc4f io_uring: Fix sqpoll utilization check racing with dying sqpoll
Commit 3fcb9d1720 ("io_uring/sqpoll: statistics of the true
utilization of sq threads"), currently in Jens for-next branch, peeks at
io_sq_data->thread to report utilization statistics. But, If
io_uring_show_fdinfo races with sqpoll terminating, even though we hold
the ctx lock, sqd->thread might be NULL and we hit the Oops below.

Note that we could technically just protect the getrusage() call and the
sq total/work time calculations.  But showing some sq
information (pid/cpu) and not other information (utilization) is more
confusing than not reporting anything, IMO.  So let's hide it all if we
happen to race with a dying sqpoll.

This can be triggered consistently in my vm setup running
sqpoll-cancel-hang.t in a loop.

BUG: kernel NULL pointer dereference, address: 00000000000007b0
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body+0x1a/0x60
 ? page_fault_oops+0x154/0x440
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_user_addr_fault+0x174/0x7c0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? exc_page_fault+0x63/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? getrusage+0x21/0x3e0
 ? seq_printf+0x4e/0x70
 io_uring_show_fdinfo+0x9db/0xa10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? vsnprintf+0x101/0x4d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_vprintf+0x34/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_printf+0x4e/0x70
 ? seq_show+0x16b/0x1d0
 ? __pfx_io_uring_show_fdinfo+0x10/0x10
 seq_show+0x16b/0x1d0
 seq_read_iter+0xd7/0x440
 seq_read+0x102/0x140
 vfs_read+0xae/0x320
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __do_sys_newfstat+0x35/0x60
 ksys_read+0xa5/0xe0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f786ec1db4d
Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d
RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006
RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60
R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628
 </TASK>
Modules linked in:
CR2: 00000000000007b0
---[ end trace 0000000000000000 ]---
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 3fcb9d1720 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-09 07:27:09 -07:00
Pavel Begunkov
1af04699c5 io_uring/net: dedup io_recv_finish req completion
There are two block in io_recv_finish() completing the request, which we
can combine and remove jumping.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:59:20 -07:00
Pavel Begunkov
e0e4ab52d1 io_uring: refactor DEFER_TASKRUN multishot checks
We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:58:23 -07:00
Pavel Begunkov
3a96378e22 io_uring: fix mshot io-wq checks
When checking for concurrent CQE posting, we're not only interested in
requests running from the poll handler but also strayed requests ended
up in normal io-wq execution. We're disallowing multishots in general
from io-wq, not only when they came in a certain way.

Cc: stable@vger.kernel.org
Fixes: 17add5cea2 ("io_uring: force multishot CQEs into task context")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:58:23 -07:00
Jens Axboe
d9b441889c io_uring/net: add io_req_msg_cleanup() helper
For the fast inline path, we manually recycle the io_async_msghdr and
free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid
that needing doing in the slower path. We already do that in 2 spots, and
in preparation for adding more, add a helper and use it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:57:27 -07:00
Jens Axboe
fb6328bc2a io_uring/net: simplify msghd->msg_inq checking
Just check for larger than zero rather than check for non-zero and
not -1. This is easier to read, and also protects against any errants
< 0 values that aren't -1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:31 -07:00
Jens Axboe
186daf2385 io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:27 -07:00
Jens Axboe
9817ad8589 io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
Ensure that prep handlers always initialize sr->done_io before any
potential failure conditions, and with that, we now it's always been
set even for the failure case.

With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that.
Additionally, we should not overwrite req->cqe.res unless sr->done_io is
actually positive.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:21 -07:00
Jens Axboe
deaef31bc1 io_uring/net: correctly handle multishot recvmsg retry setup
If we loop for multishot receive on the initial attempt, and then abort
later on to wait for more, we miss a case where we should be copying the
io_async_msghdr from the stack to stable storage. This leads to the next
retry potentially failing, if the application had the msghdr on the
stack.

Cc: stable@vger.kernel.org
Fixes: 9bb66906f2 ("io_uring: support multishot in recvmsg")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 17:48:03 -07:00
Jens Axboe
b5311dbc2c io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
This flag should not be persistent across retries, so ensure we clear
it before potentially attemting a retry.

Fixes: c3f9109dbc ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 13:22:05 -07:00
Pavel Begunkov
1a8ec63b2b io_uring: fix io_queue_proc modifying req->flags
With multiple poll entries __io_queue_proc() might be running in
parallel with poll handlers and possibly task_work, we should not be
carelessly modifying req->flags there. io_poll_double_prepare() handles
a similar case with locking but it's much easier to move it into
__io_arm_poll_handler().

Cc: stable@vger.kernel.org
Fixes: 595e52284d ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 11:10:28 -07:00
Pavel Begunkov
70581dcd06 io_uring: fix mshot read defer taskrun cqe posting
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions
are handled but aux should be explicitly disallowed by opcode handlers.

Cc: stable@vger.kernel.org
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 06:30:33 -07:00