linux-apfs-rw

mirror of https://github.com/linux-apfs/linux-apfs-rw.git synced 2026-05-01 15:01:34 -07:00

Author	SHA1	Message	Date
Ernesto A. Fernández	0313a0fee1	Rework handling of ephemeral objects In pretty much all containers I've encountered so far, every objects has a single block. There is an annoying exception though: containers with a size between ~6.7 TiB and ~7.8 TiB have a multiblock spaceman. I have no simple way to deal with this because my transactions currently handle ephemeral objects very roughly, ignoring all the cpm metadata and just updating each block in the expected header location. Instead of trying to hack this in somehow, start doing the right thing as explained in the official reference and keep all ephemeral objects in memory at all times. So, when the first transaction starts, go through the cpm metadata and read each object along with its size and oid. When a new object gets created (or an old one gets deleted), all changes happen in memory. On commit time, checksums are updated and all objects get written to disk, but nothing neads to be read. So this is all much cleaner, and I took this opportunity to support multiple cpm blocks as well. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2024-03-01 21:12:56 -03:00
Ernesto A. Fernández	c54e3a7423	Silence b_blocknr width warnings on old kernels While fixing the build for old kernels, I noticed that in some of them you get a warning when you prink a buffer head's b_blocknr as 0x%llx. I haven't checked but I'm guessing that, at some point, sector_t on 64-bit archs went from being 'unsigned long' to 'unsigned long long', so these warnings went away. Anyway, just cast b_blocknr before printing it in all kernels. At some point I do need to attempt a 32-bit build though. It's been a long time, and I'm sure there will be much worse problems than this one. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-05-19 18:49:38 -03:00
Ernesto A. Fernández	d4797823ea	Improve error reporting The driver is much closer to being usable, so I might start getting subtler bug reports soon. To make them easier to handle, put error messages all over the place. I should have done this from the beginning, but I guess I didn't fully understand the need back then. From now my general policy will be to use apfs_warn() for user errors or unsupported features; apfs_err() for things that are probably corruption or io errors; and apfs_alert() for things that are most likely bugs. These last two should be rare, so the same error/alert will be thrown by several layers in the callstack to provide as much information as possible. Be careful and don't flood the console on normal situations. Also, make messages with a log level lower than warning output their function name and line number, which I think will help debugging more than the actual messages. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-04-05 21:59:24 -03:00
Ernesto A. Fernández	9c39466269	Make updates to blocks_freed and blocks_alloced The apfs_total_blocks_freed and apfs_total_blocks_alloced fields of the volume superblock keep track of the total number of blocks that have been freed/alloced through the whole lifetime of the volume so far. Since the value of these fields depends on historical operations, it can't be calculated from the volume contents, so I figured I could just ignore them without consequences. This seems to be mostly correct, but the official fsck does complain if the freed blocks appear to be more than the alloced blocks. It's just a warning, not an error, but it does often trigger after the official driver makes changes to one of my images. So, try to be careful with these two fields from now on. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-03-06 22:05:29 -03:00
Ernesto A. Fernández	8c330866de	Fix volume block count after snapshots There are two bugs in the way apfs_fs_alloc_count is calculated after a snapshot. I had not noticed them so far because it randomly gets the right number as long as no node splits happen, but the new test I'm working on (called apfs/006 for now) does split nodes and reports the corruption. Anyway, the problems are that the volume superblock should never get counted in apfs_fs_alloc_count, not even after a snapshot; and that regular CoW of virtual nodes needs to increase apfs_fs_alloc_count if the original is preserved in a snapshot. Fix both now. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-02-17 22:53:18 -03:00
Ernesto A. Fernández	5cce4b3818	Fix cknodes mount option on readwrite mode Luflosi has reported a long time ago that the cknodes mount option does not work well on readwrite mounts: https://github.com/linux-apfs/linux-apfs-rw/issues/15 I don't care much for the cknodes option, and I don't know if anybody is actually using it, but I shouldn't ignore a simple bug like this for such a long time. The problem here is that there are several places in the code where an object will get its checksum verified after getting changed, but before the transaction is committed and the checksum updated. As a solution, refuse to verify buffers that are already part of a transaction. Try to make sure that a check happens before the join. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-01-30 20:43:16 -03:00
Ernesto A. Fernández	e60c4ca1c0	Fix build for 32-bit architectures The build has been broken for 32-bit machines since at least commit `777191438f` ("Implement deletion of ephemeral blocks"), from April 2021. At least one person has recently noticed and filed a bug report: https://github.com/linux-apfs/linux-apfs-rw/issues/32 The problem is the usual: 64-bit divisions. Fix it. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-01-06 22:56:09 -03:00
Ernesto A. Fernández	c3f67a60ce	Allow multiple mounts of a volume to share an omap Define an in-memory omap structure to be shared by all mounts of the same volume, including both the current transaction and the snapshots. The on-disk omap is already shared by all of them, so this will make for saner code. The patch fixes two existing issues. One is a regression introduced by 39878b3753be ("Preserve omap records in snapshots"), which was causing regular failures in generic/013 of xfstests. The problem was with the hacky way in which I told apart container and volume omaps, by setting sbi->s_vsb_raw to NULL before calling apfs_map_volume_super(); it seems that the dentry cache or something needs this to be set at all times. Now that the "latest_snap" field has been moved from the sbi to the omap struct, we no longer need to handle containers and volumes differently: container omaps are the same as volume omaps without snapshots. The other issue is that snapshot mounts only read the location of the omap root once (on mount), but subsequent writes to the volume will free this original block and potentially allocate it for something else. I've never seen this in practice yet, but I would expect the snapshot mount to start complaining about filesystem corruption. Now that the omap root node is shared by all mounts of the volume, this problem should be gone. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-01-06 19:39:03 -03:00
Ernesto A. Fernández	a6882252e2	Preserve omap records in snapshots With the previous patch in place we can now take snapshots, but they become immediately corrupted when a new transaction begins and changes the omap records. This happens even if file contents aren't modified, so it becomes impossible to test this new functionality at all. So, always preserve omap records that belong to a snapshot. For this check, the disk layout provides om_most_recent_snap in the object map; add a matching s_latest_snap field to the in-memory superblock structure to make it easier to work with. Bigger changes are needed in apfs_omap_lookup_block(), because we need to work with an omap record's xid, which has so far mostly been ignored. Replace apfs_bno_from_query() with a new apfs_omap_map_from_query() which doesn't discard so much valuable metadata. This change also allows us to start preserving the flags in the omap record, which may also be important. Another problem in apfs_omap_lookup_block() is that we need a way to tell apart volume omaps from the container omap, which won't be affected by snapshots. This is all a little hacky, and it would be much cleaner if I had an in-memory omap struct to pass around instead of the root nodes. I should look into that in the future. With this patch in place, my fsck reports no corruption after a single snapshot is created and the volume is unmounted. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2023-01-06 19:39:02 -03:00
Ernesto A. Fernández	dc1ed2cfe7	Rename SPDX license identifier It seems that the license identifier currently in use (GPL-2.0) has been deprecated: https://spdx.org/licenses/GPL-2.0.html I don't know how important this is in practice, but I've received some complaints about it: https://github.com/linux-apfs/linux-apfs-rw/issues/18 So, just run sed -i 's/2\.0/2.0-only/' *.{c,h} sed -i 's/2\.0/2.0-only/' Makefile and change it to GPL-2.0-only. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-11-16 18:41:54 -03:00
Ernesto A. Fernández	e23697ef6c	Don't read disk blocks that will be overwritten Don't call apfs_sb_bread() to work on blocks that have no valuable data and will just be overwritten. It's wasteful because the driver must sleep until the unnecessary read is done. Work instead with a new apfs_getblk(). The result is a 5% performance improvement when rsyncing a whole volume. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-11-11 14:37:57 -03:00
Ernesto A. Fernández	cacea2eff9	Avoid unneeded division in checksum calculation A division is currently by apfs_fletcher64() to get the size in u32 of the block to hash. This has a small but clear performance cost, at close to 2% in my tests. There is no reason not to fix it, so just do it. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-11-10 19:40:47 -03:00
Ernesto A. Fernández	9c2a26021a	Create free queue entries with multiple blocks Attempts to truncate large files are currently failing. The problem is that a whole new free queue entry gets added for each block in the file, and the free queue soon runs into the limits of having a single checkpoint-mapping block. CPM creation will need to be implemented eventually, but for now just be less wasteful and allow each entry to cover a range. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-07-16 22:42:53 -03:00
Ernesto A. Fernández	f426a13582	Try to keep extent and metadata blocks separate Our allocation scheme is very naive for now, we just look for the first available block. We can still allocate large extents this way because we only write to one file at a time, but the process is interrupted as soon as a metadata block is needed. Until I can work on improving allocation, try to avoid this problem by allocating metadata blocks at the end of the disk. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-30 00:27:27 -03:00
Ernesto A. Fernández	777191438f	Implement deletion of ephemeral blocks Since I don't deal well with internal fragmentation, free queue nodes that have been flushed end up getting full and splitting anyway. The result are two almost empty nodes, so the following flushes quickly require a node deletion. Implement it. I still need to deal with internal fragmentation soon, though. In my test filesystem, the free queue for the internal pool is not really allowed to ever have more than one node, and ignoring that restriction could bring problems with the Apple driver. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-10 04:50:21 -03:00
Ernesto A. Fernández	4a6c0e2b77	Support simultaneous mount of multiple volumes Separate the in-memory volume and container superblocks, and keep all mounted containers in a linked list. Each container will hold a pointer to its block device; when a mount is requested, we traverse the list to check if it's a new container. Keep one vfs superblock for each mounted volume, but only assign it a fake anonymous bdev; all disk operations must be forwarded to the container's bdev, with the use of the new apfs_sb_bread() and apfs_map_bh() functions. All the mount changes require that we implement our own ->mount() function, closely based on bdev_mount() and the btrfs equivalent. I can't claim to be confident that all changes here are correct, much more testing is needed. To simplify access to the container, define two new helpers similar to APFS_SB(): APFS_NXI() to retrieve the container superblock info, and APFS_SM() to retrieve the space manager. Also move the usual assertion that an object is part of the current transaction to its own inline function; this saves me from rewriting all the callers and has the added benefit of silencing "unused variable" warnings when the module is built without APFS_DEBUG. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-04-06 23:08:56 -03:00
Ernesto A. Fernández	45f56acb46	Set up a standalone repository for the APFS module Start a new out-of-tree repository, like linux-apfs-oot but with write support. To get the module to build independently, rewrite the Makefile and add a definition for the APFS_SUPER_MAGIC macro. Since the intention is to support a range of kernel versions, use preprocessor checks to handle kernels without statx, without iversion, and without SB_RDONLY. Provide a README file based on the original documentation, but with additional build and mount instructions. Add a LICENSE file as well. Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>	2021-03-31 17:16:24 -03:00

17 Commits