In pretty much all containers I've encountered so far, every objects has
a single block. There is an annoying exception though: containers with a
size between ~6.7 TiB and ~7.8 TiB have a multiblock spaceman. I have no
simple way to deal with this because my transactions currently handle
ephemeral objects very roughly, ignoring all the cpm metadata and just
updating each block in the expected header location.
Instead of trying to hack this in somehow, start doing the right thing
as explained in the official reference and keep all ephemeral objects in
memory at all times. So, when the first transaction starts, go through
the cpm metadata and read each object along with its size and oid. When
a new object gets created (or an old one gets deleted), all changes
happen in memory. On commit time, checksums are updated and all objects
get written to disk, but nothing neads to be read.
So this is all much cleaner, and I took this opportunity to support
multiple cpm blocks as well.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
While fixing the build for old kernels, I noticed that in some of them
you get a warning when you prink a buffer head's b_blocknr as 0x%llx.
I haven't checked but I'm guessing that, at some point, sector_t on
64-bit archs went from being 'unsigned long' to 'unsigned long long',
so these warnings went away. Anyway, just cast b_blocknr before printing
it in all kernels.
At some point I do need to attempt a 32-bit build though. It's been a
long time, and I'm sure there will be much worse problems than this one.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The driver is much closer to being usable, so I might start getting
subtler bug reports soon. To make them easier to handle, put error
messages all over the place. I should have done this from the beginning,
but I guess I didn't fully understand the need back then.
From now my general policy will be to use apfs_warn() for user errors or
unsupported features; apfs_err() for things that are probably corruption
or io errors; and apfs_alert() for things that are most likely bugs.
These last two should be rare, so the same error/alert will be thrown by
several layers in the callstack to provide as much information as
possible. Be careful and don't flood the console on normal situations.
Also, make messages with a log level lower than warning output their
function name and line number, which I think will help debugging more
than the actual messages.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The apfs_total_blocks_freed and apfs_total_blocks_alloced fields of the
volume superblock keep track of the total number of blocks that have
been freed/alloced through the whole lifetime of the volume so far.
Since the value of these fields depends on historical operations, it
can't be calculated from the volume contents, so I figured I could just
ignore them without consequences.
This seems to be mostly correct, but the official fsck does complain if
the freed blocks appear to be more than the alloced blocks. It's just a
warning, not an error, but it does often trigger after the official
driver makes changes to one of my images. So, try to be careful with
these two fields from now on.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
There are two bugs in the way apfs_fs_alloc_count is calculated after a
snapshot. I had not noticed them so far because it randomly gets the
right number as long as no node splits happen, but the new test I'm
working on (called apfs/006 for now) does split nodes and reports the
corruption.
Anyway, the problems are that the volume superblock should never get
counted in apfs_fs_alloc_count, not even after a snapshot; and that
regular CoW of virtual nodes needs to increase apfs_fs_alloc_count if
the original is preserved in a snapshot. Fix both now.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Luflosi has reported a long time ago that the cknodes mount option does
not work well on readwrite mounts:
https://github.com/linux-apfs/linux-apfs-rw/issues/15
I don't care much for the cknodes option, and I don't know if anybody is
actually using it, but I shouldn't ignore a simple bug like this for
such a long time.
The problem here is that there are several places in the code where an
object will get its checksum verified after getting changed, but before
the transaction is committed and the checksum updated. As a solution,
refuse to verify buffers that are already part of a transaction. Try to
make sure that a check happens before the join.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The build has been broken for 32-bit machines since at least commit
777191438f ("Implement deletion of ephemeral blocks"), from April
2021. At least one person has recently noticed and filed a bug report:
https://github.com/linux-apfs/linux-apfs-rw/issues/32
The problem is the usual: 64-bit divisions. Fix it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Define an in-memory omap structure to be shared by all mounts of the
same volume, including both the current transaction and the snapshots.
The on-disk omap is already shared by all of them, so this will make for
saner code.
The patch fixes two existing issues. One is a regression introduced by
39878b3753be ("Preserve omap records in snapshots"), which was causing
regular failures in generic/013 of xfstests. The problem was with the
hacky way in which I told apart container and volume omaps, by setting
sbi->s_vsb_raw to NULL before calling apfs_map_volume_super(); it seems
that the dentry cache or something needs this to be set at all times.
Now that the "latest_snap" field has been moved from the sbi to the omap
struct, we no longer need to handle containers and volumes differently:
container omaps are the same as volume omaps without snapshots.
The other issue is that snapshot mounts only read the location of the
omap root once (on mount), but subsequent writes to the volume will free
this original block and potentially allocate it for something else. I've
never seen this in practice yet, but I would expect the snapshot mount
to start complaining about filesystem corruption. Now that the omap root
node is shared by all mounts of the volume, this problem should be gone.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
With the previous patch in place we can now take snapshots, but they
become immediately corrupted when a new transaction begins and changes
the omap records. This happens even if file contents aren't modified,
so it becomes impossible to test this new functionality at all.
So, always preserve omap records that belong to a snapshot. For this
check, the disk layout provides om_most_recent_snap in the object map;
add a matching s_latest_snap field to the in-memory superblock structure
to make it easier to work with.
Bigger changes are needed in apfs_omap_lookup_block(), because we need
to work with an omap record's xid, which has so far mostly been ignored.
Replace apfs_bno_from_query() with a new apfs_omap_map_from_query()
which doesn't discard so much valuable metadata. This change also allows
us to start preserving the flags in the omap record, which may also be
important.
Another problem in apfs_omap_lookup_block() is that we need a way to
tell apart volume omaps from the container omap, which won't be affected
by snapshots. This is all a little hacky, and it would be much cleaner
if I had an in-memory omap struct to pass around instead of the root
nodes. I should look into that in the future.
With this patch in place, my fsck reports no corruption after a single
snapshot is created and the volume is unmounted.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Don't call apfs_sb_bread() to work on blocks that have no valuable data
and will just be overwritten. It's wasteful because the driver must
sleep until the unnecessary read is done. Work instead with a new
apfs_getblk(). The result is a 5% performance improvement when rsyncing
a whole volume.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
A division is currently by apfs_fletcher64() to get the size in u32 of
the block to hash. This has a small but clear performance cost, at close
to 2% in my tests. There is no reason not to fix it, so just do it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Attempts to truncate large files are currently failing. The problem is
that a whole new free queue entry gets added for each block in the file,
and the free queue soon runs into the limits of having a single
checkpoint-mapping block. CPM creation will need to be implemented
eventually, but for now just be less wasteful and allow each entry to
cover a range.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Our allocation scheme is very naive for now, we just look for the first
available block. We can still allocate large extents this way because we
only write to one file at a time, but the process is interrupted as soon
as a metadata block is needed.
Until I can work on improving allocation, try to avoid this problem by
allocating metadata blocks at the end of the disk.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Since I don't deal well with internal fragmentation, free queue nodes
that have been flushed end up getting full and splitting anyway. The
result are two almost empty nodes, so the following flushes quickly
require a node deletion. Implement it.
I still need to deal with internal fragmentation soon, though. In my
test filesystem, the free queue for the internal pool is not really
allowed to ever have more than one node, and ignoring that restriction
could bring problems with the Apple driver.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Separate the in-memory volume and container superblocks, and keep all
mounted containers in a linked list. Each container will hold a pointer
to its block device; when a mount is requested, we traverse the list to
check if it's a new container. Keep one vfs superblock for each mounted
volume, but only assign it a fake anonymous bdev; all disk operations
must be forwarded to the container's bdev, with the use of the new
apfs_sb_bread() and apfs_map_bh() functions.
All the mount changes require that we implement our own ->mount()
function, closely based on bdev_mount() and the btrfs equivalent. I
can't claim to be confident that all changes here are correct, much more
testing is needed.
To simplify access to the container, define two new helpers similar to
APFS_SB(): APFS_NXI() to retrieve the container superblock info, and
APFS_SM() to retrieve the space manager. Also move the usual assertion
that an object is part of the current transaction to its own inline
function; this saves me from rewriting all the callers and has the added
benefit of silencing "unused variable" warnings when the module is built
without APFS_DEBUG.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Start a new out-of-tree repository, like linux-apfs-oot but with write
support.
To get the module to build independently, rewrite the Makefile and
add a definition for the APFS_SUPER_MAGIC macro. Since the intention is
to support a range of kernel versions, use preprocessor checks to handle
kernels without statx, without iversion, and without SB_RDONLY.
Provide a README file based on the original documentation, but with
additional build and mount instructions. Add a LICENSE file as well.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>