linux-apfs-rw

mirror of https://github.com/linux-apfs/linux-apfs-rw.git synced 2026-05-01 15:01:34 -07:00

Author	SHA1	Message	Date
Ernesto A. Fernández	6fcc69c05d	Don't refer to record values as "data" The paper I followed in the early days of the driver (before the official reference was published) referred to record values as "data", so some of my oldest code uses that name as well. This is inconsistent, and not very clear because the word "data" shows up everywhere. Reword all instances of this I can identify. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-12-04 21:04:23 -03:00
Ernesto A. Fernández	514a2164d8	Split orphan cleanup among multiple transactions Gilgamesh reports via email that the following iozone benchmark is failing: ./iozone -a -i 0 -r 1M -n 1g -g 1g -f /mnt/apfs/test I've tried this myself on a small (3 GiB) filesystem, and it is the commit after the unlink that fails, complaining about the main free queue being full: APFS (800010g): free queue has too many nodes (348 > 347) (apfs_free_queue_insert_nocache:849) The problem is that random writes to a big file create a lot of extents, and when the file is deleted they all get dumped to the main free queue as separate records. The queue gets full and the commit fails. Stop forcing file deletion to happen in a single transaction. Instead, set up a workqueue whenever an orphan inode gets evicted, and leave the cleanup for another time. Also allow the cleanup to execute only partially (deleting the first extents like the official driver) and reschedule itself (again via inode eviction). Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-03-22 21:20:27 -03:00
Ernesto A. Fernández	0313a0fee1	Rework handling of ephemeral objects In pretty much all containers I've encountered so far, every objects has a single block. There is an annoying exception though: containers with a size between ~6.7 TiB and ~7.8 TiB have a multiblock spaceman. I have no simple way to deal with this because my transactions currently handle ephemeral objects very roughly, ignoring all the cpm metadata and just updating each block in the expected header location. Instead of trying to hack this in somehow, start doing the right thing as explained in the official reference and keep all ephemeral objects in memory at all times. So, when the first transaction starts, go through the cpm metadata and read each object along with its size and oid. When a new object gets created (or an old one gets deleted), all changes happen in memory. On commit time, checksums are updated and all objects get written to disk, but nothing neads to be read. So this is all much cleaner, and I took this opportunity to support multiple cpm blocks as well. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-03-01 21:12:56 -03:00
Ernesto A. Fernández	bcedfe1752	Read error before setting error pointer to null I'm in a rush to get the tag out today and I'm making a mess. The previous patch set an error pointer to null in apfs_node_split(), but I didn't notice that the actual error would get read later. Reorder this. I usually would just squash this into the previous patch, but in my rush I have already pushed it. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-01-07 19:10:40 -03:00
Ernesto A. Fernández	8b2287d91f	Don't free error pointers in apfs_node_split() We mistakenly free an error pointer if apfs_create_node() happens to fail. Set the node to null before the cleanup, but I guess I should try to avoid error pointers in the future. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-01-07 19:03:09 -03:00
Ernesto A. Fernández	fe13bc9636	Don't refresh queries pointing to deleted nodes Running generic/013 for a few minutes still leaves the filesystem in a corrupted state, with a missing inode record. This is another regression caused by my big rework of the node split logic. When a single-record node is needed but the parent update fails with -EAGAIN, apfs_create_single_rec_node() leaves the query pointing to the aborted new node. The refresh does not change this, so the inode record ends up abandoned in this ghost node. In then end all of this is caused by the weird way I originally designed the query struct, which forces me to always keep a more or less valid leaf query instead of just nuking the whole thing and starting over. I'd like to change this eventually, but for now just adjust the cleanup for apfs_create_single_rec_node(). Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-01-06 01:48:44 -03:00
Ernesto A. Fernández	e7983cb3fe	Restart btree modifications after a node split I'm encountering some rare failures in generic/075. The problem happens when a node split is triggered by inserting a record into the first position. The parent needs to get updated of course, but we've lost the reference to it after the split, so the filesystem becomes corrupted. The reality is that this stuff has always been too hacky, and new corner cases keep popping up. Stop trying to guess which queries need to know and might have forgotten their ancestors: just restart the whole btree operation after any split, refreshing the query with a simple rerun. This has to be the safest possible approach, and I don't expect it to hurt performance, since node splits should never be a bottleneck. Big changes are needed to get it to work though, because the b-tree must be in a consistent state whenever the query gets refreshed. There are two different considerations here. First, no split can take place unless the parent has enough room to index the new child node; otherwise the parent split would trigger a query refresh while half the child's records are orphaned. To sum up, every cascade of node splits must happen from root to leaf. The second consideration is that, even if a node has enough room, no updates related to the actual requested operation (the leaf record insertion/deletion/replacement) can take place at any level until all splits are done. A different policy would mean that a query refresh might become necessary halfway through the operation, and so the leaf record could be found to be unexpectedly missing, or covered by the wrong index key. So at a given level: the split, if needed, always comes first, followed by the function call for parent updates, and only then the node gets modified. After a split an EAGAIN error gets thrown, which then bubbles back up to the leaf (either through the splits themselves, or through the parent update calls), so only a single split can happen for the whole b-tree on each refresh iteration - always at the highest possible level in the split cascade that would have been triggered before this patch. The btree, of course, remains valid on EAGAIN, and the leaf records remain the same. Inside apfs_node_split() itself, no changes to the old node can be done until the new one is safely inserted in the b-tree. If that triggers a split at a higher level, then the new node must be properly cleaned up (making use of the recent changes to apfs_delete_node()) and, as usual, EAGAIN must get bubbled up. A similar cleanup needs to be take place in the one other function that creates non-root nodes: apfs_create_single_rec_node(). The new single-record node must be repeatedly destroyed until the parent manages to index it without any splits. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-01-04 22:01:40 -03:00
Ernesto A. Fernández	e872d3971a	Allow NULL to be passed to apfs_node_free() Allow NULL node pointers to be "freed" with apfs_node_free(). This will simplify some cleanups soon. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 23:48:30 -03:00
Ernesto A. Fernández	eb6dd65019	Don't let node insert/replace fail with ENOSPC Currently node insertions/deletions are attempted immediately and they return -ENOSPC when there is no room, letting the caller know that the node needs to get split. But in future patches I'll need to reorder the operations so that the node only gets updated after all the ancestors. The previous patches already ensured that apfs_node_replace() will never erroneously report lack of space, no matter how fragmented the node may be. This was already true of apfs_node_insert(). So from now on we can safely run space checks ahead of time with only the node metadata, use that to decide if a split is needed, and only update the records after all that is done. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 23:22:46 -03:00
Ernesto A. Fernández	0c8beaa186	Also defragment freed space in apfs_node_replace() In the near future we want to make sure that apfs_node_replace() will never fail as long as it has enough space. The defragmentation from the previous patch is not enough for this, because it still leaves the replaced record in place instead of moving those useless bytes into the free area. To solve this as simply as possible, set the table-of-contents entry for the replaced record to zero length, for either the key or the value (or both), depending on which we are replacing. The defragmentation code will take care of the rest. Of course this weird toc entry remains in place if the function fails, but it will be replaced eventually so it should not matter. By the way, this hack cannot be used on trees with fixed key/value sizes, because those sizes are simply not reported in the toc. Of course records in such trees can generally be replaced in place, so they won't need defragmentation. There is one exception though: the ghost records of the free queue. Luckily we don't currently call replace on free queue leaves, so we can safely ignore this problem. Leave an assert in place so that I don't forget. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 22:04:16 -03:00
Ernesto A. Fernández	2adf86cba9	Defragment the node in apfs_node_replace() Like we already do in apfs_node_insert(), attempt to defragment the node before returning -ENOSPC in apfs_node_replace(). This shouldn't make much of a difference right now (other than packing the records a bit tighter together), but in future patches it will help us prevent apfs_node_replace() from ever failing in normal operation. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 21:56:02 -03:00
Ernesto A. Fernández	f883518a96	Move toc entry update into new function Define a new apfs_node_update_toc_entry() to be used when replacing records in a node. This makes apfs_node_replace() somewhat cleaner, but more importantly it will allow that part to be reused in future patches. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 19:40:24 -03:00
Ernesto A. Fernández	4eb83167a9	Document some node functions that may throw ENOSPC Improve the documentation of the record allocation functions to make it clear that they may return -ENOSPC, because that case will often need to be handled separately in the future. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 19:35:22 -03:00
Ernesto A. Fernández	e053f3b2b5	Call apfs_delete_node() with a node, not a query Some changes I'm working on will require being able to delete nodes that have not yet been indexed in a btree. This cannot be done with the current version of apfs_delete_node(), because it handles the parent node modifications itself. So from now on, let the caller handle that if it's necessary. This is also more symmetrical with apfs_create_node(), as would be expected. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-29 18:28:01 -03:00
Ernesto A. Fernández	196ac31122	Don't allow a stale offset for ghost records When apfs_node_locate_data() gets called for a ghost record (that is, a record without a value), the function correctly returns the zero length, but the offset is left unset, and so a stale number may remain in the query structure. This does not matter of course, because a zero-length value will never be read, but I've found it very distracting while debugging a separate issue with query offsets. Just set it to zero. To be consistent, do the same in apfs_node_insert(), though I haven't been bothered by this one yet. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-28 19:48:25 -03:00
Ernesto A. Fernández	f1775ed0ef	Turn the search key into a query field I've been working on making some very needed changes to how node splits work. The plan is to always rerun queries after a split, to keep the query chain valid. One problem I've encountered while implementing this is that the search key is not really preserved by apfs_update_inode(). We have a pointer to it, but it was allocated in the stack of some other function, so the data is just noise. I've always thought it was silly to allocate both a query and a key before a lookup. Instead, turn the search key into a field within the query structure, and copy it to the children as the query gets run. This will ensure that the key remains available for future reruns. One thing to note is that, while this simplifies most callers a little bit, some of them still allocate a key in the stack and later copy it to the query. I didn't want to make any big changes and risk introducing bugs at this point, since I'm busy with other stuff. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-12-20 20:33:03 -03:00
Ernesto A. Fernández	4dab24ef6b	Stop triggering kernel panics for silly reasons I was reading the list of deprecated interfaces at [0] and it mentions that the use of BUG() and BUG_ON() is discouraged in the kernel, and that we should be prepared to fail gracefully even for impossible situations. I definitely agree with this: so far I've used BUG_ON() to implement ASSERT(), but in practice I always turn that into a WARN_ON() while debugging. Just leave it as WARN_ON() permanently. I also use BUG() in apfs_query_storage() to mark an impossible query tree flag. I can't imagine how this would ever be reached, but still keep going and return an invalid storage type. This way the error can get reported later, after calling either apfs_create_node() or apfs_read_node(), which have been modified to fail gracefully and complain with apfs_alert() in this situation. I also considered putting an assertion in apfs_query_storage(), but that would require reordering apfs.h and it seems unnecessarily messy. Warnings should not be abused either, because of the panic_on_warn option. Replace our one call to WARN_ON(), inside apfs_btree_replace(), with another apfs_alert(). This function was already failing gracefully. [0]: https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-10-27 15:44:47 -03:00
Ernesto A. Fernández	2d6cdba8f6	Fix trivial style issues Some code that wasn't originally written by me has a style that differs slightly from the rest of the module. Since I've been refactoring compress.c quite a lot lately, I have made the inconsistencies more visible, so I think it's a good time to switch everything to the typical kernel style. For this purpose I'm running checkpatch.pl for the first time in years, and it has detected a handful of other issues that I introduced myself. Fix most of the issues now, but I prefer to leave the zero-length array stuff for a separate patch. For the record, at the moment I'm choosing to ignore style inconsistency in the compression libraries, even in libzbitmap which I wrote myself. I prefer to keep them as close to upstream as possible. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-10-26 21:44:30 -03:00
Ernesto A. Fernández	c54e3a7423	Silence b_blocknr width warnings on old kernels While fixing the build for old kernels, I noticed that in some of them you get a warning when you prink a buffer head's b_blocknr as 0x%llx. I haven't checked but I'm guessing that, at some point, sector_t on 64-bit archs went from being 'unsigned long' to 'unsigned long long', so these warnings went away. Anyway, just cast b_blocknr before printing it in all kernels. At some point I do need to attempt a 32-bit build though. It's been a long time, and I'm sure there will be much worse problems than this one. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-05-19 18:49:38 -03:00
Ernesto A. Fernández	8b99af84d1	Try to avoid failing insertion on fragmented nodes I'm getting somewhat regular failures on generic/112 because the ip free queue for the test image has a one-node limit, and it sometimes gets full. The thing is, the node actually has plenty of room, but it's all in the key free list, so I can't allocate values. From now on, if an insertion fails with ENOSPC, check the available room and defragment the node if it should be enough. Then retry the insertion instead of jumping straight to a split. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-04-17 20:05:26 -03:00
Ernesto A. Fernández	d4797823ea	Improve error reporting The driver is much closer to being usable, so I might start getting subtler bug reports soon. To make them easier to handle, put error messages all over the place. I should have done this from the beginning, but I guess I didn't fully understand the need back then. From now my general policy will be to use apfs_warn() for user errors or unsupported features; apfs_err() for things that are probably corruption or io errors; and apfs_alert() for things that are most likely bugs. These last two should be rare, so the same error/alert will be thrown by several layers in the callstack to provide as much information as possible. Be careful and don't flood the console on normal situations. Also, make messages with a log level lower than warning output their function name and line number, which I think will help debugging more than the actual messages. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-04-05 21:59:24 -03:00
Ernesto A. Fernández	9c39466269	Make updates to blocks_freed and blocks_alloced The apfs_total_blocks_freed and apfs_total_blocks_alloced fields of the volume superblock keep track of the total number of blocks that have been freed/alloced through the whole lifetime of the volume so far. Since the value of these fields depends on historical operations, it can't be calculated from the volume contents, so I figured I could just ignore them without consequences. This seems to be mostly correct, but the official fsck does complain if the freed blocks appear to be more than the alloced blocks. It's just a warning, not an error, but it does often trigger after the official driver makes changes to one of my images. So, try to be careful with these two fields from now on. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-03-06 22:05:29 -03:00
Ernesto A. Fernández	7a5d4085ea	Fix apfs_node_min_table_size() for non-leaf nodes In the previous patch I forgot to consider that index nodes have different value sizes than leaf nodes, even for fixed key/value size trees. Fix it. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-02-22 17:30:45 -03:00
Ernesto A. Fernández	d29943de45	Fix apfs_node_min_table_size() for new tree types When I started working with OMAP_SNAPSHOT trees, I forgot to add support for them inside apfs_node_min_table_size(). The result is that the index of each of those trees is initialized to a tiny minimum size, as if the keys and values were of variable length. This went unnoticed for a while because of other bugs in my fsck, which I've recently fixed in 5a20a212d8f0 ("apfsck: fix checks of index size for fixed kv size"). The official fsck always complained though, I just wasn't using it: ** Checking the snapshot metadata. error: btn: invalid btn_table_space (0, 64), given btn_flags (0x7) So, implement apfs_node_min_table_size() properly for the OMAP_SNAPSHOT tree type. For consistency with the fsck, also implement the FEXT tree type, but I don't actually support writes to sealed volumes in the driver at the moment. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-02-21 22:55:31 -03:00
Ernesto A. Fernández	5cce4b3818	Fix cknodes mount option on readwrite mode Luflosi has reported a long time ago that the cknodes mount option does not work well on readwrite mounts: https://github.com/linux-apfs/linux-apfs-rw/issues/15 I don't care much for the cknodes option, and I don't know if anybody is actually using it, but I shouldn't ignore a simple bug like this for such a long time. The problem here is that there are several places in the code where an object will get its checksum verified after getting changed, but before the transaction is committed and the checksum updated. As a solution, refuse to verify buffers that are already part of a transaction. Try to make sure that a check happens before the join. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-01-30 20:43:16 -03:00

1 2 3

58 Commits