linux-apfs-rw

mirror of https://github.com/linux-apfs/linux-apfs-rw.git synced 2026-05-01 15:01:34 -07:00

Author	SHA1	Message	Date
Ernesto A. Fernández	4819db04c8	Keep query valid on apfs_node_insert() failure Update the offset of the current query record key if apfs_node_insert() expands the table of contents. Otherwise, the function may still find later on that there's no room for the new record, and return ENOSPC leaving an inconsistent query. A caller that attempts to refresh such a query would run into trouble. This bug was discovered with generic/013 of xfstests. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-09-07 16:25:25 -03:00
Ernesto A. Fernández	21fa24c7e5	Set query parent to null on node defragmentation The header for apfs_node_split() claims the the query parent will always be set to NULL on success, but this isn't done correctly if the node is only defragmented. The problem is that apfs_btree_insert() relies on this behaviour to decide if a split has been attempted yet, and the current implementation error will make it loop forever. So fix apfs_node_split() and make it act as documented. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-09-04 01:15:06 -03:00
Ernesto A. Fernández	088db933fb	Deal with huge records that block new insertions I never implemented a way to deal with huge records that take over a whole node because I thought such a thing wouldn't happen unless xattrs were used. I was wrong: generic/013 from xfstests regularly aborts transactions when it tries to create a large inode record right next to a large (symlink) xattr record. The difficulty here is that my code always inserts a new record right after another, in the same node. The node is split and defragmented if full, but the two records will always be consecutive. If they overflow the node by themselves, nothing can be done. So, implement a new insertion mechanism for this special case: just create a whole new node and put the new record in it. This is probably bad in terms of disk usage, but it should be rare outside of xfstests. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-09-04 00:41:12 -03:00
Ernesto A. Fernández	5b577d247c	Fix npd on multiple node split After a node gets split, its query will no longer have the parent field set. If the record we intend to insert is sill too big to fit in the resulting node half, a second split will be triggered, and the lack of a parent in which to update the index records will result in null pointer dereference. To prevent this, check if a query's parent is gone before a split and, if that's the case, refresh it. To be safe, also add a check for the parent inside apfs_node_split(). This is all necessary to fix crashes in generic/013 of xfstests, but it doesn't fully solve the problems with splits yet. When inserting a sufficiently large inline xattr, it's possible that a defragmented node with a single record will still be unable to receive it. Add a check for this and return -EOPNOTSUPP, to prevent an infinite loop; but I will need to deal with it for real at some point. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-27 16:49:43 -03:00
Ernesto A. Fernández	0f42b1ef15	Fix allocation from big entries in node free lists When a new record key or value is allocated inside the slot left behind by a larger one that was removed, the free space count gets corrupted. The problem is that a new entry is added for the room remaining in the slot, which increases the free space count even though that space was already counted. Fix this. After the previous commit `ff6e5be159` ("Avoid splitting nodes with a single record"), this bug started causing failures in generic/001 of xfstests. This appears to be just a coincidence and not a problem with that patch. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-23 16:48:12 -03:00
Ernesto A. Fernández	ff6e5be159	Avoid splitting nodes with a single record Generic/011 in xfstests causes a lot of internal fragmentation in the catalog records. This eventually creates nodes that have a single record but no free space at all. Splits of this node may be requested, but the current code doesn't support them (there's even a TODO about it), because dealing with the resulting temporarily empty nodes would be very tricky. For now, avoid the split in these situations and just defragment the original node. Sadly, this approach won't be enough to deal with the potentially huge records of inline xattrs: a node holding just one of those may be legitimately full. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-22 20:59:51 -03:00
Ernesto A. Fernández	9a40a290f4	Support node deletion in the extent reference tree Add support for extent reference tree queries to apfs_delete_node(). The nodes are physical just like those of the object map, so it's a simple change. It's needed immediately because generic/001 from xfstests exercises this code path. One thing to note is that extent reference entries are usually deleted when an orphan inode gets cleaned up, so the EOPNOTSUPP error code from this function doesn't reach userland. This made the source of the problem less obvious, so from now on print an error message to the console as well. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-20 21:29:57 -03:00
Ernesto A. Fernández	9c2a26021a	Create free queue entries with multiple blocks Attempts to truncate large files are currently failing. The problem is that a whole new free queue entry gets added for each block in the file, and the free queue soon runs into the limits of having a single checkpoint-mapping block. CPM creation will need to be implemented eventually, but for now just be less wasteful and allow each entry to cover a range. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-16 22:42:53 -03:00
Ernesto A. Fernández	430040e4a7	Never try to delete root nodes Never call apfs_delete_node() for a root node. This is clearly documented, but I somehow made a mess of it anyway, and it will become a problem as soon as we start deleting physical extents. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-09 20:16:02 -03:00
Ernesto A. Fernández	0b5ba5e5a9	Silence compiler warnings in apfs_create_node() Some compilers complain that bno and oid may be used uninitialized in apfs_create_node(). This isn't true in practice, because the code path that triggers the warning will never be run if the function was called correctly; this is documented by an "ASSERT(false)". Deal with it anyway, both to avoid the noise during compilation, and to prevent serious damage if I accidentally misuse this function in the future. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-06-15 17:13:21 -03:00
Ernesto A. Fernández	f426a13582	Try to keep extent and metadata blocks separate Our allocation scheme is very naive for now, we just look for the first available block. We can still allocate large extents this way because we only write to one file at a time, but the process is interrupted as soon as a metadata block is needed. Until I can work on improving allocation, try to avoid this problem by allocating metadata blocks at the end of the disk. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-30 00:27:27 -03:00
Ernesto A. Fernández	54674882fc	Support insertion to the first position of a btree Record insertions are not usually done in the first position of a node: the record would go to the previous node instead. So, the case when there is no previous node needs to be treated separately. This problem wasn't noticed before because catalog insertions are never done in the first position, since the first few are always taken by records for the root directory's parent and for the private directory. It's for the extent reference tree that this becomes noticeable, and when old blocks are actually being freed, which was only recently implemented. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-19 19:56:48 -03:00
Ernesto A. Fernández	5893c0208c	Preallocate toc for nodes with fixed record sizes Currently the Apple fsck gives the following error: error: btn: invalid btn_table_space (0, 320), given btn_flags (0x5) Stan figured out that, when a node has keys and values of a fixed size, we are expected to preallocate a table of contents that covers the whole node. So do that. Weirdly, the node space calculation doesn't seem to take into account the presence of a footer in root nodes. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-15 04:28:03 -03:00
Ernesto A. Fernández	919a4bf0fb	Don't leak a node's freed record space Right now the module doesn't bother reusing free space from a node until it's split. This is a serious problem because it's common for the ip free queue to be banned from ever having more than one node. For reuse to be possible, whenever a record is removed (or resized), keep track of the free space: either move the edges of the key/value data if it's a tail record, or add the new hole to the proper free list. Then, on allocation, try to use space from the free lists first. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-15 00:08:07 -03:00
Ernesto A. Fernández	777191438f	Implement deletion of ephemeral blocks Since I don't deal well with internal fragmentation, free queue nodes that have been flushed end up getting full and splitting anyway. The result are two almost empty nodes, so the following flushes quickly require a node deletion. Implement it. I still need to deal with internal fragmentation soon, though. In my test filesystem, the free queue for the internal pool is not really allowed to ever have more than one node, and ignoring that restriction could bring problems with the Apple driver. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-10 04:50:21 -03:00
Ernesto A. Fernández	59a9be369b	Ignore the volume super in ephemeral node deletion There is no need for the volume superblock to belong to the current transaction while an ehpemeral node is deleted, and in fact it's usually not the case. Move that assertion to a better location to avoid triggering it. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-09 20:25:47 -03:00
Ernesto A. Fernández	4a6c0e2b77	Support simultaneous mount of multiple volumes Separate the in-memory volume and container superblocks, and keep all mounted containers in a linked list. Each container will hold a pointer to its block device; when a mount is requested, we traverse the list to check if it's a new container. Keep one vfs superblock for each mounted volume, but only assign it a fake anonymous bdev; all disk operations must be forwarded to the container's bdev, with the use of the new apfs_sb_bread() and apfs_map_bh() functions. All the mount changes require that we implement our own ->mount() function, closely based on bdev_mount() and the btrfs equivalent. I can't claim to be confident that all changes here are correct, much more testing is needed. To simplify access to the container, define two new helpers similar to APFS_SB(): APFS_NXI() to retrieve the container superblock info, and APFS_SM() to retrieve the space manager. Also move the usual assertion that an object is part of the current transaction to its own inline function; this saves me from rewriting all the callers and has the added benefit of silencing "unused variable" warnings when the module is built without APFS_DEBUG. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-06 23:08:56 -03:00
Ernesto A. Fernández	e89e42bfe4	Don't print uninitialized bnos in apfs_read_node() If the object isn't virtual, the bno variable may naver be set. The block number is saved in the buffer head anyway, so just use that. For some reason, I only get warnings about this with gcc 4.9.2, not with 10.2.0. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-06 21:28:09 -03:00
Ernesto A. Fernández	45f56acb46	Set up a standalone repository for the APFS module Start a new out-of-tree repository, like linux-apfs-oot but with write support. To get the module to build independently, rewrite the Makefile and add a definition for the APFS_SUPER_MAGIC macro. Since the intention is to support a range of kernel versions, use preprocessor checks to handle kernels without statx, without iversion, and without SB_RDONLY. Provide a README file based on the original documentation, but with additional build and mount instructions. Add a LICENSE file as well. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-03-31 17:16:24 -03:00

19 Commits