4552 Commits

Author SHA1 Message Date
Kent Overstreet
1a2b74d0a2 bcachefs: fix build on 32 bit in get_random_u64_below()
bare 64 bit divides not allowed, whoops

arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below':
drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod'

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 19:45:54 -04:00
Kent Overstreet
90fd9ad5b0 bcachefs: Change btree wb assert to runtime error
We just had a report of the assert for "btree in write buffer for
non-write buffer btree" popping during the 6.14 upgrade.

- 150TB filesystem, after a reboot the upgrade was able to continue from
  where it left off, so no major damage.

But with 6.14 about to come out we want to get this tracked down asap,
and need more data if other users hit this.

Convert the BUG_ON() to an emergency read-only, and print out btree, the
key itself, and stack trace from the original write buffer update (which
did not have this check before).

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 10:25:25 -04:00
Kent Overstreet
9c18ea7ffe bcachefs: bch2_get_random_u64_below()
steal the (clever) algorithm from get_random_u32_below()

this fixes a bug where we were passing roundup_pow_of_two() a 64 bit
number - we're squaring device latencies now:

[  +1.681698] ------------[ cut here ]------------
[  +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[  +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[  +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10
[  +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021
[  +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[  +0.000354] Call Trace:
[  +0.000005]  <TASK>
[  +0.000007]  dump_stack_lvl+0x5d/0x80
[  +0.000018]  ubsan_epilogue+0x5/0x30
[  +0.000008]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6
[  +0.000011]  bch2_rand_range.cold+0x17/0x20 [bcachefs]
[  +0.000231]  bch2_bkey_pick_read_device+0x547/0x920 [bcachefs]
[  +0.000229]  __bch2_read_extent+0x1e4/0x18e0 [bcachefs]
[  +0.000241]  ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs]
[  +0.000180]  ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[  +0.000230]  bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[  +0.000230]  bch2_rbio_retry+0x1fa/0x600 [bcachefs]
[  +0.000224]  ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs]
[  +0.000243]  ? bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[  +0.000278]  bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[  +0.000227]  ? __bch2_read_endio+0x58b/0x870 [bcachefs]
[  +0.000220]  __bch2_read_endio+0x58b/0x870 [bcachefs]
[  +0.000268]  ? try_to_wake_up+0x31c/0x7f0
[  +0.000011]  ? process_one_work+0x176/0x330
[  +0.000008]  process_one_work+0x176/0x330
[  +0.000008]  worker_thread+0x252/0x390
[  +0.000008]  ? __pfx_worker_thread+0x10/0x10
[  +0.000006]  kthread+0xec/0x230
[  +0.000011]  ? __pfx_kthread+0x10/0x10
[  +0.000009]  ret_from_fork+0x31/0x50
[  +0.000009]  ? __pfx_kthread+0x10/0x10
[  +0.000008]  ret_from_fork_asm+0x1a/0x30
[  +0.000012]  </TASK>
[  +0.000046] ---[ end trace ]---

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 12:40:22 -04:00
Kent Overstreet
69a5a13a22 bcachefs: target_congested -> get_random_u32_below()
get_random_u32_below() has a better algorithm than bch2_rand_range(),
it just didn't exist at the time.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 12:39:21 -04:00
Kent Overstreet
3bcde88d38 bcachefs: fix tiny leak in bch2_dev_add()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 00:23:19 -04:00
Kent Overstreet
dbac8feb23 bcachefs: Make sure trans is unlocked when submitting read IO
We were still using the trans after the unlock, leading to this bug in
the retry path:

00255 ------------[ cut here ]------------
00255 kernel BUG at fs/bcachefs/btree_iter.c:3348!
00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from:
00255   u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368  compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0
00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from:
00255   u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336  compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0
00255 Modules linked in:
00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040
00255 Hardware name: linux,dummy-virt (DT)
00255 Workqueue: events_unbound bch2_rbio_retry
00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00255 pc : __bch2_trans_get+0x100/0x378
00255 lr : __bch2_trans_get+0xa0/0x378
00255 sp : ffffff80c865b760
00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880
00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760
00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000
00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff
00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008
00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006
00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8
00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788
00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000
00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006
00255 Call trace:
00255  __bch2_trans_get+0x100/0x378 (P)
00255  bch2_read_io_err+0x98/0x260
00255  bch2_read_endio+0xb8/0x2d0
00255  __bch2_read_extent+0xce8/0xfe0
00255  __bch2_read+0x2a8/0x978
00255  bch2_rbio_retry+0x188/0x318
00255  process_one_work+0x154/0x390
00255  worker_thread+0x20c/0x3b8
00255  kthread+0xf0/0x1b0
00255  ret_from_fork+0x10/0x20
00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000)
00255 ---[ end trace 0000000000000000 ]---
00255 Kernel panic - not syncing: Oops - BUG: Fatal exception
00255 SMP: stopping secondary CPUs
00255 Kernel Offset: disabled
00255 CPU features: 0x000,00000070,00000010,8240500b
00255 Memory Limit: none
00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 11:21:44 -04:00
Roxana Nicolescu
58517f4df8 bcachefs: Initialize from_inode members for bch_io_opts
When there is no inode source, all "from_inode" members in the structure
bhc_io_opts should be set false.

Fixes: 7a7c43a0c1 ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode")
Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb
Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 11:19:33 -04:00
Alan Huang
3a04334d62 bcachefs: Fix b->written overflow
When bset past end of btree node, we should not add sectors to
b->written, which will overflow b->written.

Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com
Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 09:19:23 -04:00
Kent Overstreet
8ba73f53dc bcachefs: copygc now skips non-rw devices
There's no point in doing copygc on non-rw devices: the fragmentation
doesn't matter if we're not writing to them, and we may not have
anywhere to put the data on our other devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06 18:15:01 -05:00
Kent Overstreet
33255c161a bcachefs: Fix bch2_dev_journal_alloc() spuriously failing
Previously, we fixed journal resize spuriousl failing with
-BCH_ERR_open_buckets_empty, but initial journal allocation was missed
because it didn't invoke the "block on allocator" loop at all.

Factor out the "loop on allocator" code to fix that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06 18:15:01 -05:00
Kent Overstreet
4a4f9b5c7c bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested
We shouldn't be setting incompatible bits or the incompatible version
field unless explicitly request or allowed - otherwise we break mounting
with old kernels or userspace.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-28 19:07:33 -05:00
Kent Overstreet
eb54d2695b bcachefs: Fix truncate sometimes failing and returning 1
__bch_truncate_folio() may return 1 to indicate dirtyness of the folio
being truncated, needed for fpunch to get the i_size writes correct.

But truncate was forgetting to clear ret, and sometimes returning it as
an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Alan Huang
677bdb7346 bcachefs: Fix deadlock
This fixes two deadlocks:

1.pcpu_alloc_mutex involved one as pointed by syzbot[1]
2.recursion deadlock.

The root cause is that we hold the bc lock during alloc_percpu, fix it
by following the pattern used by __btree_node_mem_alloc().

[1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/

Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Kent Overstreet
7909d1fb90 bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize
This fixes occasional failures from journal resize.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Kent Overstreet
4804f3ac26 bcachefs: Revert directory i_size
This turned out to have several bugs, which were missed because the fsck
code wasn't properly reporting errors - whoops.

Kicking it out for now, hopefully it can make 6.15.

Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:30:38 -05:00
Kent Overstreet
cf3e696026 bcachefs: fix bch2_extent_ptr_eq()
Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-23 23:35:33 -05:00
Alan Huang
c522093b02 bcachefs: Fix memmove when move keys down
The fix alone doesn't fix [1], but should be applied before debugging
that.

[1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-20 16:40:34 -05:00
Kent Overstreet
68aaa63716 bcachefs: print op->nonce on data update inconsistency
"nonce inconstancy" is popping up again, causing us to go emergency
read-only.

This one looks less serious, i.e. specific to the encryption path and
not indicative of a data corruption bug. But we'll need more info to
track it down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-20 16:39:28 -05:00
Kent Overstreet
b04974f759 bcachefs: Fix srcu lock warning in btree_update_nodes_written()
We don't want to be holding the srcu lock while waiting on btree write
completions - easily fixed.

Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 18:52:42 -05:00
Kent Overstreet
4fd509c10f bcachefs: Fix bch2_indirect_extent_missing_error()
We had some error handling confusion here;
-BCH_ERR_missing_indirect_extent is thrown by
trans_trigger_reflink_p_segment(); at this point we haven't decide
whether we're generating an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 17:33:13 -05:00
Kent Overstreet
b9ddb3e1a8 bcachefs: Fix fsck directory i_size checking
Error handling was wrong, causing unhandled transaction restart errors.

check_directory_size() was also inefficient, since keys in multiple
snapshots would be iterated over once for every snapshot. Convert it to
the same scheme used for i_sectors and subdir count checking.

Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 13:52:27 -05:00
Alan Huang
406e445b3c bcachefs: Reuse transaction
bch2_nocow_write_convert_unwritten is already in transaction context:

00191 ========= TEST   generic/648
00242 kernel BUG at fs/bcachefs/btree_iter.c:3332!
00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
00242 Modules linked in:
00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403
00242 Hardware name: linux,dummy-virt (DT)
00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00242 pc : __bch2_trans_get+0x120/0x410
00242 lr : __bch2_trans_get+0xcc/0x410
00242 sp : ffffff80d89af600
00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705
00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168
00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078
00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000
00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028
00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8
00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc
00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4
00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000
00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e
00242 Call trace:
00242  __bch2_trans_get+0x120/0x410 (P)
00242  bch2_inum_offset_err_msg+0x48/0xb0
00242  bch2_nocow_write_convert_unwritten+0x3d0/0x530
00242  bch2_nocow_write+0xeb0/0x1000
00242  __bch2_write+0x330/0x4e8
00242  bch2_write+0x1f0/0x530
00242  bch2_direct_write+0x530/0xc00
00242  bch2_write_iter+0x160/0xbe0
00242  vfs_write+0x1cc/0x360
00242  ksys_write+0x5c/0xf0
00242  __arm64_sys_write+0x20/0x30
00242  invoke_syscall.constprop.0+0x54/0xe8
00242  do_el0_svc+0x44/0xc0
00242  el0_svc+0x34/0xa0
00242  el0t_64_sync_handler+0x104/0x130
00242  el0t_64_sync+0x154/0x158
00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000)
00242 ---[ end trace 0000000000000000 ]---
00242 Kernel panic - not syncing: Oops - BUG: Fatal exception

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:44:50 -05:00
Alan Huang
531323a2ef bcachefs: Pass _orig_restart_count to trans_was_restarted
_orig_restart_count is unused now, according to the logic, trans_was_restarted
should be using _orig_restart_count.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:40:19 -05:00
Kent Overstreet
9cf6b84b71 bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS
Incorrectly handled transaction restarts can be a source of heisenbugs;
add a mode where we randomly inject them to shake them out.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:40:19 -05:00
Kent Overstreet
9f734cd076 bcachefs: Fix want_new_bset() so we write until the end of the btree node
want_new_bset() returns the address of a new bset to initialize if we
wish to do so in a btree node - either because the previous one is too
big, or because it's been written.

The case for 'previous bset was written' was wrong: it's only supposed
to check for if we have space in the node for one more block, but
because it subtracted the header from the space available it would never
initialize a new bset if we were down to the last block in a node.

Fixing this results in fewer btree node splits/compactions, which fixes
a bug with flushing the journal to go read-only sometimes not
terminating or taking excessively long.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11 10:10:32 -05:00