58 Commits

Author SHA1 Message Date
Ernesto A. Fernández 6fcc69c05d Don't refer to record values as "data"
The paper I followed in the early days of the driver (before the
official reference was published) referred to record values as "data",
so some of my oldest code uses that name as well. This is inconsistent,
and not very clear because the word "data" shows up everywhere. Reword
all instances of this I can identify.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-12-04 21:04:23 -03:00
Ernesto A. Fernández 514a2164d8 Split orphan cleanup among multiple transactions
Gilgamesh reports via email that the following iozone benchmark is
failing:

  ./iozone -a -i 0 -r 1M -n 1g -g 1g -f /mnt/apfs/test

I've tried this myself on a small (3 GiB) filesystem, and it is the
commit after the unlink that fails, complaining about the main free
queue being full:

  APFS (800010g): free queue has too many nodes (348 > 347) (apfs_free_queue_insert_nocache:849)

The problem is that random writes to a big file create a lot of extents,
and when the file is deleted they all get dumped to the main free queue
as separate records. The queue gets full and the commit fails.

Stop forcing file deletion to happen in a single transaction. Instead,
set up a workqueue whenever an orphan inode gets evicted, and leave the
cleanup for another time. Also allow the cleanup to execute only
partially (deleting the first extents like the official driver) and
reschedule itself (again via inode eviction).

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-03-22 21:20:27 -03:00
Ernesto A. Fernández 0313a0fee1 Rework handling of ephemeral objects
In pretty much all containers I've encountered so far, every objects has
a single block. There is an annoying exception though: containers with a
size between ~6.7 TiB and ~7.8 TiB have a multiblock spaceman. I have no
simple way to deal with this because my transactions currently handle
ephemeral objects very roughly, ignoring all the cpm metadata and just
updating each block in the expected header location.

Instead of trying to hack this in somehow, start doing the right thing
as explained in the official reference and keep all ephemeral objects in
memory at all times. So, when the first transaction starts, go through
the cpm metadata and read each object along with its size and oid. When
a new object gets created (or an old one gets deleted), all changes
happen in memory. On commit time, checksums are updated and all objects
get written to disk, but nothing neads to be read.

So this is all much cleaner, and I took this opportunity to support
multiple cpm blocks as well.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-03-01 21:12:56 -03:00
Ernesto A. Fernández bcedfe1752 Read error before setting error pointer to null
I'm in a rush to get the tag out today and I'm making a mess. The
previous patch set an error pointer to null in apfs_node_split(), but I
didn't notice that the actual error would get read later. Reorder this.
I usually would just squash this into the previous patch, but in my rush
I have already pushed it.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-01-07 19:10:40 -03:00
Ernesto A. Fernández 8b2287d91f Don't free error pointers in apfs_node_split()
We mistakenly free an error pointer if apfs_create_node() happens to
fail. Set the node to null before the cleanup, but I guess I should try
to avoid error pointers in the future.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-01-07 19:03:09 -03:00
Ernesto A. Fernández fe13bc9636 Don't refresh queries pointing to deleted nodes
Running generic/013 for a few minutes still leaves the filesystem in a
corrupted state, with a missing inode record. This is another regression
caused by my big rework of the node split logic. When a single-record
node is needed but the parent update fails with -EAGAIN,
apfs_create_single_rec_node() leaves the query pointing to the aborted
new node. The refresh does not change this, so the inode record ends up
abandoned in this ghost node.

In then end all of this is caused by the weird way I originally designed
the query struct, which forces me to always keep a more or less valid
leaf query instead of just nuking the whole thing and starting over. I'd
like to change this eventually, but for now just adjust the cleanup for
apfs_create_single_rec_node().

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-01-06 01:48:44 -03:00
Ernesto A. Fernández e7983cb3fe Restart btree modifications after a node split
I'm encountering some rare failures in generic/075. The problem happens
when a node split is triggered by inserting a record into the first
position. The parent needs to get updated of course, but we've lost the
reference to it after the split, so the filesystem becomes corrupted.

The reality is that this stuff has always been too hacky, and new corner
cases keep popping up. Stop trying to guess which queries need to know
and might have forgotten their ancestors: just restart the whole btree
operation after any split, refreshing the query with a simple rerun.
This has to be the safest possible approach, and I don't expect it to
hurt performance, since node splits should never be a bottleneck.

Big changes are needed to get it to work though, because the b-tree must
be in a consistent state whenever the query gets refreshed. There are
two different considerations here. First, no split can take place unless
the parent has enough room to index the new child node; otherwise the
parent split would trigger a query refresh while half the child's
records are orphaned. To sum up, every cascade of node splits must
happen from root to leaf.

The second consideration is that, even if a node has enough room, no
updates related to the actual requested operation (the leaf record
insertion/deletion/replacement) can take place at any level until all
splits are done. A different policy would mean that a query refresh
might become necessary halfway through the operation, and so the leaf
record could be found to be unexpectedly missing, or covered by the
wrong index key.

So at a given level: the split, if needed, always comes first, followed
by the function call for parent updates, and only then the node gets
modified. After a split an EAGAIN error gets thrown, which then bubbles
back up to the leaf (either through the splits themselves, or through
the parent update calls), so only a single split can happen for the
whole b-tree on each refresh iteration - always at the highest possible
level in the split cascade that would have been triggered before this
patch. The btree, of course, remains valid on EAGAIN, and the leaf
records remain the same.

Inside apfs_node_split() itself, no changes to the old node can be done
until the new one is safely inserted in the b-tree. If that triggers a
split at a higher level, then the new node must be properly cleaned up
(making use of the recent changes to apfs_delete_node()) and, as usual,
EAGAIN must get bubbled up.

A similar cleanup needs to be take place in the one other function that
creates non-root nodes: apfs_create_single_rec_node(). The new
single-record node must be repeatedly destroyed until the parent manages
to index it without any splits.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2024-01-04 22:01:40 -03:00
Ernesto A. Fernández e872d3971a Allow NULL to be passed to apfs_node_free()
Allow NULL node pointers to be "freed" with apfs_node_free(). This will
simplify some cleanups soon.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 23:48:30 -03:00
Ernesto A. Fernández eb6dd65019 Don't let node insert/replace fail with ENOSPC
Currently node insertions/deletions are attempted immediately and they
return -ENOSPC when there is no room, letting the caller know that the
node needs to get split. But in future patches I'll need to reorder the
operations so that the node only gets updated after all the ancestors.

The previous patches already ensured that apfs_node_replace() will never
erroneously report lack of space, no matter how fragmented the node may
be. This was already true of apfs_node_insert(). So from now on we can
safely run space checks ahead of time with only the node metadata, use
that to decide if a split is needed, and only update the records after
all that is done.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 23:22:46 -03:00
Ernesto A. Fernández 0c8beaa186 Also defragment freed space in apfs_node_replace()
In the near future we want to make sure that apfs_node_replace() will
never fail as long as it has enough space. The defragmentation from the
previous patch is not enough for this, because it still leaves the
replaced record in place instead of moving those useless bytes into the
free area.

To solve this as simply as possible, set the table-of-contents entry for
the replaced record to zero length, for either the key or the value (or
both), depending on which we are replacing. The defragmentation code
will take care of the rest. Of course this weird toc entry remains in
place if the function fails, but it will be replaced eventually so it
should not matter.

By the way, this hack cannot be used on trees with fixed key/value
sizes, because those sizes are simply not reported in the toc. Of course
records in such trees can generally be replaced in place, so they won't
need defragmentation. There is one exception though: the ghost records
of the free queue. Luckily we don't currently call replace on free queue
leaves, so we can safely ignore this problem. Leave an assert in place
so that I don't forget.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 22:04:16 -03:00
Ernesto A. Fernández 2adf86cba9 Defragment the node in apfs_node_replace()
Like we already do in apfs_node_insert(), attempt to defragment the node
before returning -ENOSPC in apfs_node_replace(). This shouldn't make
much of a difference right now (other than packing the records a bit
tighter together), but in future patches it will help us prevent
apfs_node_replace() from ever failing in normal operation.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 21:56:02 -03:00
Ernesto A. Fernández f883518a96 Move toc entry update into new function
Define a new apfs_node_update_toc_entry() to be used when replacing
records in a node. This makes apfs_node_replace() somewhat cleaner, but
more importantly it will allow that part to be reused in future patches.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 19:40:24 -03:00
Ernesto A. Fernández 4eb83167a9 Document some node functions that may throw ENOSPC
Improve the documentation of the record allocation functions to make it
clear that they may return -ENOSPC, because that case will often need to
be handled separately in the future.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 19:35:22 -03:00
Ernesto A. Fernández e053f3b2b5 Call apfs_delete_node() with a node, not a query
Some changes I'm working on will require being able to delete nodes that
have not yet been indexed in a btree. This cannot be done with the
current version of apfs_delete_node(), because it handles the parent
node modifications itself. So from now on, let the caller handle that if
it's necessary. This is also more symmetrical with apfs_create_node(),
as would be expected.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-29 18:28:01 -03:00
Ernesto A. Fernández 196ac31122 Don't allow a stale offset for ghost records
When apfs_node_locate_data() gets called for a ghost record (that is, a
record without a value), the function correctly returns the zero length,
but the offset is left unset, and so a stale number may remain in the
query structure. This does not matter of course, because a zero-length
value will never be read, but I've found it very distracting while
debugging a separate issue with query offsets. Just set it to zero.

To be consistent, do the same in apfs_node_insert(), though I haven't
been bothered by this one yet.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-28 19:48:25 -03:00
Ernesto A. Fernández f1775ed0ef Turn the search key into a query field
I've been working on making some very needed changes to how node splits
work. The plan is to always rerun queries after a split, to keep the
query chain valid. One problem I've encountered while implementing this
is that the search key is not really preserved by apfs_update_inode().
We have a pointer to it, but it was allocated in the stack of some other
function, so the data is just noise.

I've always thought it was silly to allocate both a query and a key
before a lookup. Instead, turn the search key into a field within the
query structure, and copy it to the children as the query gets run. This
will ensure that the key remains available for future reruns.

One thing to note is that, while this simplifies most callers a little
bit, some of them still allocate a key in the stack and later copy it to
the query. I didn't want to make any big changes and risk introducing
bugs at this point, since I'm busy with other stuff.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-12-20 20:33:03 -03:00
Ernesto A. Fernández 4dab24ef6b Stop triggering kernel panics for silly reasons
I was reading the list of deprecated interfaces at [0] and it mentions
that the use of BUG() and BUG_ON() is discouraged in the kernel, and
that we should be prepared to fail gracefully even for impossible
situations.

I definitely agree with this: so far I've used BUG_ON() to implement
ASSERT(), but in practice I always turn that into a WARN_ON() while
debugging. Just leave it as WARN_ON() permanently.

I also use BUG() in apfs_query_storage() to mark an impossible query
tree flag. I can't imagine how this would ever be reached, but still
keep going and return an invalid storage type. This way the error can
get reported later, after calling either apfs_create_node() or
apfs_read_node(), which have been modified to fail gracefully and
complain with apfs_alert() in this situation. I also considered putting
an assertion in apfs_query_storage(), but that would require reordering
apfs.h and it seems unnecessarily messy.

Warnings should not be abused either, because of the panic_on_warn
option. Replace our one call to WARN_ON(), inside apfs_btree_replace(),
with another apfs_alert(). This function was already failing gracefully.

[0]: https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-10-27 15:44:47 -03:00
Ernesto A. Fernández 2d6cdba8f6 Fix trivial style issues
Some code that wasn't originally written by me has a style that differs
slightly from the rest of the module. Since I've been refactoring
compress.c quite a lot lately, I have made the inconsistencies more
visible, so I think it's a good time to switch everything to the typical
kernel style.

For this purpose I'm running checkpatch.pl for the first time in years,
and it has detected a handful of other issues that I introduced myself.
Fix most of the issues now, but I prefer to leave the zero-length array
stuff for a separate patch.

For the record, at the moment I'm choosing to ignore style inconsistency
in the compression libraries, even in libzbitmap which I wrote myself. I
prefer to keep them as close to upstream as possible.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-10-26 21:44:30 -03:00
Ernesto A. Fernández c54e3a7423 Silence b_blocknr width warnings on old kernels
While fixing the build for old kernels, I noticed that in some of them
you get a warning when you prink a buffer head's b_blocknr as 0x%llx.
I haven't checked but I'm guessing that, at some point, sector_t on
64-bit archs went from being 'unsigned long' to 'unsigned long long',
so these warnings went away. Anyway, just cast b_blocknr before printing
it in all kernels.

At some point I do need to attempt a 32-bit build though. It's been a
long time, and I'm sure there will be much worse problems than this one.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-05-19 18:49:38 -03:00
Ernesto A. Fernández 8b99af84d1 Try to avoid failing insertion on fragmented nodes
I'm getting somewhat regular failures on generic/112 because the ip free
queue for the test image has a one-node limit, and it sometimes gets
full. The thing is, the node actually has plenty of room, but it's all
in the key free list, so I can't allocate values. From now on, if an
insertion fails with ENOSPC, check the available room and defragment
the node if it should be enough. Then retry the insertion instead of
jumping straight to a split.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-04-17 20:05:26 -03:00
Ernesto A. Fernández d4797823ea Improve error reporting
The driver is much closer to being usable, so I might start getting
subtler bug reports soon. To make them easier to handle, put error
messages all over the place. I should have done this from the beginning,
but I guess I didn't fully understand the need back then.

From now my general policy will be to use apfs_warn() for user errors or
unsupported features; apfs_err() for things that are probably corruption
or io errors; and apfs_alert() for things that are most likely bugs.
These last two should be rare, so the same error/alert will be thrown by
several layers in the callstack to provide as much information as
possible. Be careful and don't flood the console on normal situations.

Also, make messages with a log level lower than warning output their
function name and line number, which I think will help debugging more
than the actual messages.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-04-05 21:59:24 -03:00
Ernesto A. Fernández 9c39466269 Make updates to blocks_freed and blocks_alloced
The apfs_total_blocks_freed and apfs_total_blocks_alloced fields of the
volume superblock keep track of the total number of blocks that have
been freed/alloced through the whole lifetime of the volume so far.
Since the value of these fields depends on historical operations, it
can't be calculated from the volume contents, so I figured I could just
ignore them without consequences.

This seems to be mostly correct, but the official fsck does complain if
the freed blocks appear to be more than the alloced blocks. It's just a
warning, not an error, but it does often trigger after the official
driver makes changes to one of my images. So, try to be careful with
these two fields from now on.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-03-06 22:05:29 -03:00
Ernesto A. Fernández 7a5d4085ea Fix apfs_node_min_table_size() for non-leaf nodes
In the previous patch I forgot to consider that index nodes have
different value sizes than leaf nodes, even for fixed key/value size
trees. Fix it.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-02-22 17:30:45 -03:00
Ernesto A. Fernández d29943de45 Fix apfs_node_min_table_size() for new tree types
When I started working with OMAP_SNAPSHOT trees, I forgot to add support
for them inside apfs_node_min_table_size(). The result is that the index
of each of those trees is initialized to a tiny minimum size, as if the
keys and values were of variable length. This went unnoticed for a while
because of other bugs in my fsck, which I've recently fixed in
5a20a212d8f0 ("apfsck: fix checks of index size for fixed kv size"). The
official fsck always complained though, I just wasn't using it:

  ** Checking the snapshot metadata.
  error: btn: invalid btn_table_space (0, 64), given btn_flags (0x7)

So, implement apfs_node_min_table_size() properly for the OMAP_SNAPSHOT
tree type. For consistency with the fsck, also implement the FEXT tree
type, but I don't actually support writes to sealed volumes in the
driver at the moment.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-02-21 22:55:31 -03:00
Ernesto A. Fernández 5cce4b3818 Fix cknodes mount option on readwrite mode
Luflosi has reported a long time ago that the cknodes mount option does
not work well on readwrite mounts:

  https://github.com/linux-apfs/linux-apfs-rw/issues/15

I don't care much for the cknodes option, and I don't know if anybody is
actually using it, but I shouldn't ignore a simple bug like this for
such a long time.

The problem here is that there are several places in the code where an
object will get its checksum verified after getting changed, but before
the transaction is committed and the checksum updated. As a solution,
refuse to verify buffers that are already part of a transaction. Try to
make sure that a check happens before the join.

Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
2023-01-30 20:43:16 -03:00