Update the offset of the current query record key if apfs_node_insert()
expands the table of contents. Otherwise, the function may still find
later on that there's no room for the new record, and return ENOSPC
leaving an inconsistent query. A caller that attempts to refresh such a
query would run into trouble.
This bug was discovered with generic/013 of xfstests.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The header for apfs_node_split() claims the the query parent will always
be set to NULL on success, but this isn't done correctly if the node is
only defragmented. The problem is that apfs_btree_insert() relies on
this behaviour to decide if a split has been attempted yet, and the
current implementation error will make it loop forever.
So fix apfs_node_split() and make it act as documented.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
I never implemented a way to deal with huge records that take over a
whole node because I thought such a thing wouldn't happen unless xattrs
were used. I was wrong: generic/013 from xfstests regularly aborts
transactions when it tries to create a large inode record right next to
a large (symlink) xattr record.
The difficulty here is that my code always inserts a new record right
after another, in the same node. The node is split and defragmented if
full, but the two records will always be consecutive. If they overflow
the node by themselves, nothing can be done.
So, implement a new insertion mechanism for this special case: just
create a whole new node and put the new record in it. This is probably
bad in terms of disk usage, but it should be rare outside of xfstests.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
After a node gets split, its query will no longer have the parent field
set. If the record we intend to insert is sill too big to fit in the
resulting node half, a second split will be triggered, and the lack of a
parent in which to update the index records will result in null pointer
dereference.
To prevent this, check if a query's parent is gone before a split and,
if that's the case, refresh it. To be safe, also add a check for the
parent inside apfs_node_split().
This is all necessary to fix crashes in generic/013 of xfstests, but it
doesn't fully solve the problems with splits yet. When inserting a
sufficiently large inline xattr, it's possible that a defragmented node
with a single record will still be unable to receive it. Add a check for
this and return -EOPNOTSUPP, to prevent an infinite loop; but I will
need to deal with it for real at some point.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
When a new record key or value is allocated inside the slot left behind
by a larger one that was removed, the free space count gets corrupted.
The problem is that a new entry is added for the room remaining in the
slot, which increases the free space count even though that space was
already counted. Fix this.
After the previous commit ff6e5be159 ("Avoid splitting nodes with a
single record"), this bug started causing failures in generic/001 of
xfstests. This appears to be just a coincidence and not a problem with
that patch.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Generic/011 in xfstests causes a lot of internal fragmentation in the
catalog records. This eventually creates nodes that have a single record
but no free space at all. Splits of this node may be requested, but the
current code doesn't support them (there's even a TODO about it),
because dealing with the resulting temporarily empty nodes would be very
tricky.
For now, avoid the split in these situations and just defragment the
original node. Sadly, this approach won't be enough to deal with the
potentially huge records of inline xattrs: a node holding just one of
those may be legitimately full.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Add support for extent reference tree queries to apfs_delete_node().
The nodes are physical just like those of the object map, so it's a
simple change. It's needed immediately because generic/001 from xfstests
exercises this code path.
One thing to note is that extent reference entries are usually deleted
when an orphan inode gets cleaned up, so the EOPNOTSUPP error code from
this function doesn't reach userland. This made the source of the
problem less obvious, so from now on print an error message to the
console as well.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Attempts to truncate large files are currently failing. The problem is
that a whole new free queue entry gets added for each block in the file,
and the free queue soon runs into the limits of having a single
checkpoint-mapping block. CPM creation will need to be implemented
eventually, but for now just be less wasteful and allow each entry to
cover a range.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Never call apfs_delete_node() for a root node. This is clearly
documented, but I somehow made a mess of it anyway, and it will become a
problem as soon as we start deleting physical extents.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Some compilers complain that bno and oid may be used uninitialized in
apfs_create_node(). This isn't true in practice, because the code path
that triggers the warning will never be run if the function was called
correctly; this is documented by an "ASSERT(false)".
Deal with it anyway, both to avoid the noise during compilation, and to
prevent serious damage if I accidentally misuse this function in the
future.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Our allocation scheme is very naive for now, we just look for the first
available block. We can still allocate large extents this way because we
only write to one file at a time, but the process is interrupted as soon
as a metadata block is needed.
Until I can work on improving allocation, try to avoid this problem by
allocating metadata blocks at the end of the disk.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Record insertions are not usually done in the first position of a node:
the record would go to the previous node instead. So, the case when
there is no previous node needs to be treated separately.
This problem wasn't noticed before because catalog insertions are never
done in the first position, since the first few are always taken by
records for the root directory's parent and for the private directory.
It's for the extent reference tree that this becomes noticeable, and
when old blocks are actually being freed, which was only recently
implemented.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Currently the Apple fsck gives the following error:
error: btn: invalid btn_table_space (0, 320), given btn_flags (0x5)
Stan figured out that, when a node has keys and values of a fixed size,
we are expected to preallocate a table of contents that covers the whole
node. So do that.
Weirdly, the node space calculation doesn't seem to take into account
the presence of a footer in root nodes.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Right now the module doesn't bother reusing free space from a node until
it's split. This is a serious problem because it's common for the ip
free queue to be banned from ever having more than one node.
For reuse to be possible, whenever a record is removed (or resized),
keep track of the free space: either move the edges of the key/value
data if it's a tail record, or add the new hole to the proper free list.
Then, on allocation, try to use space from the free lists first.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Since I don't deal well with internal fragmentation, free queue nodes
that have been flushed end up getting full and splitting anyway. The
result are two almost empty nodes, so the following flushes quickly
require a node deletion. Implement it.
I still need to deal with internal fragmentation soon, though. In my
test filesystem, the free queue for the internal pool is not really
allowed to ever have more than one node, and ignoring that restriction
could bring problems with the Apple driver.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
There is no need for the volume superblock to belong to the current
transaction while an ehpemeral node is deleted, and in fact it's usually
not the case. Move that assertion to a better location to avoid
triggering it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Separate the in-memory volume and container superblocks, and keep all
mounted containers in a linked list. Each container will hold a pointer
to its block device; when a mount is requested, we traverse the list to
check if it's a new container. Keep one vfs superblock for each mounted
volume, but only assign it a fake anonymous bdev; all disk operations
must be forwarded to the container's bdev, with the use of the new
apfs_sb_bread() and apfs_map_bh() functions.
All the mount changes require that we implement our own ->mount()
function, closely based on bdev_mount() and the btrfs equivalent. I
can't claim to be confident that all changes here are correct, much more
testing is needed.
To simplify access to the container, define two new helpers similar to
APFS_SB(): APFS_NXI() to retrieve the container superblock info, and
APFS_SM() to retrieve the space manager. Also move the usual assertion
that an object is part of the current transaction to its own inline
function; this saves me from rewriting all the callers and has the added
benefit of silencing "unused variable" warnings when the module is built
without APFS_DEBUG.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
If the object isn't virtual, the bno variable may naver be set. The
block number is saved in the buffer head anyway, so just use that. For
some reason, I only get warnings about this with gcc 4.9.2, not with
10.2.0.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Start a new out-of-tree repository, like linux-apfs-oot but with write
support.
To get the module to build independently, rewrite the Makefile and
add a definition for the APFS_SUPER_MAGIC macro. Since the intention is
to support a range of kernel versions, use preprocessor checks to handle
kernels without statx, without iversion, and without SB_RDONLY.
Provide a README file based on the original documentation, but with
additional build and mount instructions. Add a LICENSE file as well.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>