The build is broken for the 6.19 release candidate:
https://github.com/linux-apfs/linux-apfs-rw/issues/109
The problem is that the inode i_state field can no longer be accessed
directly. Switch to the new helpers, using a wrapper to keep the code
consistent for all kernel versions.
Note that I always use the *_raw version of the helpers to avoid a lockdep
check that will not pass. I can't tell for sure if this is a bug.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The build is broken for the 6.18 release candidate:
https://github.com/linux-apfs/linux-apfs-rw/issues/107
The problem is that ->d_name has become const, and that qualifier gets
dropped when we pass it around in the driver. Just treat is as const for
all kernel versions.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The ->mkdir() method now needs to return a dentry struct instead of an
int, but this is meant for clustered filesystems or something so just
return NULL on success, like most filesystems do.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
If a function throws ENOSPC halfway through an operation, the
transaction will get aborted so that the filesystem isn't left in an
inconsistent state. This means that the mount will go read-only, and
that the user will lose any uncommitted changes.
So, it's better if we detect ENOSPC before starting an operation. I
tried this before with commit a52b73ed97 ("Check free space before
starting a transaction"), but the code was messy and it would have
required constant updates, which I neglected of course. The problem was
that I wanted to be as accurate as possible with my calculations of
required space, but I'm not sure there is any point.
Instead, require a large fixed amount of space (512 KiB) for most
operations, and a much smaller but still almost certainly sufficient
amount of space (80 KiB) for deletions. Hopefully this will be enough
for users to maneuver around ENOSPC situations. It's not perfect, but no
CoW filesystem can be truly perfect in this regard.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Gilgamesh reports via email that the following iozone benchmark is
failing:
./iozone -a -i 0 -r 1M -n 1g -g 1g -f /mnt/apfs/test
I've tried this myself on a small (3 GiB) filesystem, and it is the
commit after the unlink that fails, complaining about the main free
queue being full:
APFS (800010g): free queue has too many nodes (348 > 347) (apfs_free_queue_insert_nocache:849)
The problem is that random writes to a big file create a lot of extents,
and when the file is deleted they all get dumped to the main free queue
as separate records. The queue gets full and the commit fails.
Stop forcing file deletion to happen in a single transaction. Instead,
set up a workqueue whenever an orphan inode gets evicted, and leave the
cleanup for another time. Also allow the cleanup to execute only
partially (deleting the first extents like the official driver) and
reschedule itself (again via inode eviction).
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The previous patch introduced a regression in generic/013, seen when the
test is allowed to run for 20 minutes or so. The result is a corrupted
filesystem where one inode appears to have too many dentries, and
another inode has too few to balance.
The problem is that query refreshes just rerun the query once and expect
the correct result, but dentry queries for an existing record actually
need to be called repeatedly to deal with hash collisions. For
normalization-sensitive filesystems (which I use for most of my tests)
"collisions" are constant because there is no hash. So if a split
happens during a dentry deletion, chances are very high that the refresh
will end up pointing to the wrong record.
Don't add awkward code to deal with this to apfs_query_refresh().
Instead transform the multiple query from apfs_dentry_lookup() into a
regular exact query before returning, setting the key to the
unnormalized name. The refresh code will have no trouble working with
that.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
I've been working on making some very needed changes to how node splits
work. The plan is to always rerun queries after a split, to keep the
query chain valid. One problem I've encountered while implementing this
is that the search key is not really preserved by apfs_update_inode().
We have a pointer to it, but it was allocated in the stack of some other
function, so the data is just noise.
I've always thought it was silly to allocate both a query and a key
before a lookup. Instead, turn the search key into a field within the
query structure, and copy it to the children as the query gets run. This
will ensure that the key remains available for future reruns.
One thing to note is that, while this simplifies most callers a little
bit, some of them still allocate a key in the stack and later copy it to
the query. I didn't want to make any big changes and risk introducing
bugs at this point, since I'm busy with other stuff.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The build appears to be broken for some configurations on the Linux 6.5
release candidate:
https://github.com/linux-apfs/linux-apfs-rw/issues/49
Some sort of static overflow check is going on for strcpy(), and several
of my calls are triggering false positives. Just be safe and switch them
all to strscpy(). The qstr stuff in particular looked a bit too weird
anyway.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The driver is much closer to being usable, so I might start getting
subtler bug reports soon. To make them easier to handle, put error
messages all over the place. I should have done this from the beginning,
but I guess I didn't fully understand the need back then.
From now my general policy will be to use apfs_warn() for user errors or
unsupported features; apfs_err() for things that are probably corruption
or io errors; and apfs_alert() for things that are most likely bugs.
These last two should be rare, so the same error/alert will be thrown by
several layers in the callstack to provide as much information as
possible. Be careful and don't flood the console on normal situations.
Also, make messages with a log level lower than warning output their
function name and line number, which I think will help debugging more
than the actual messages.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Make a few changes on orphan link behaviour to better match the official
implementation. It's not clear to me that this is mandatory, but I want
my fsck to work properly with official images: I've already modified it
to that end, so these changes need to be applied here as well to avoid
potential reports of corruption.
Anyway, the changes to orphan creation are four:
- Report the inode's link count as 0, that is, don't count the orphan
link itself.
- Preserve the original primary name for the inode.
- Preserve the inode's parent id.
- Update the volume's file count immediately, instead of waiting until
the orphan gets purged.
In essence, stop treating orphaned inodes as real files.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The official apfs implementation uses the "0x%llx-dead" template to name
its orphan files. It's not really mandatory that all implementations
follow the same scheme, but try to do it anyway.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
I'm trying to build with "-W" for the first time, and it reports that
apfs_free_query() doesn't use the 'sb' parameter. Get rid of it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Add a 'data' field to in-memory objects and point it to the buffer
head's 'b_data'; from now on, always access the data through this field
instead of the bh. To make sure I remembered to clean up all instances,
rename the buffer head in the object from 'bh' to 'o_bh'.
The point is to allow nodes to work with a data block that exists only
in memory, and the next patch will take advantage of this to improve
apfs_node_split(). Since objects that are not nodes will not need this
feature, only adjust the cleanup done in apfs_node_release() for now,
though maybe object cleanups should all be done in one place.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
The 'hashed' argument is never used by apfs_init_drec_key(), since we
can figure that out from the superblock. Get rid of it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Profiling has shown that the module spends too much time inside
apfs_lookup(). I had never really checked if the dentry cache was
working, so naturally that turned out to be the problem: I had always
assumed that the strings inside qstr structures were null-terminated,
but they may actually be a single component in a pathname, terminated by
a forward slash. The result was that we always searched the cache for
full pathnames, which were naturally never found.
To fix this, always pass the filename length from qstr to the unicode
handlers, and make them work without assuming a null-termination.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
It seems that d_instantiate_new() was introduced in 4.17 to fix a
problem with lockdep. I won't use lockdep myself in such an old kernel,
and I don't expect users to care. So, to get the module to build with
those kernels, just revert the code to unlock_new_inode/d_instantiate,
as was done before.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
I'm running generic/257 of xfstests, and it reports that getdents64()
returns repeated values for d_off. It's been years since I implemented
this stuff, but it would seem that I thought only the final value of the
context position would matter, and I neglected to update it after each
call to dir_emit(). Fix this.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Implement the methods to support all syscalls such as sync() and
fsync(). Do this by just committing the whole transaction, which is not
very performant but should be technically valid. I would like to go back
to this eventually and get it right, but it will probably be tricky and
I would need to understand how btrfs and others handle it.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>
Don't write inode metadata (including the cached extent) to the
memory-mapped blocks until the transaction is about to be committed.
This should allow several consecutive metadata changes to be done in a
single operation. Most importantly, a new extent for appended content
will only be written once, in the end. Right now it's rewritten after
each block.
The performance improvement from this is not as big as I had hoped, it's
under 20%. I'm also sure this isn't the right way to approach this,
since I could be using the dirty inode lists from the vfs. So I'll need
to revise this patch at some point, but I'm leaving it for now.
Signed-off-by: Ernesto A. Fernández <ernesto@corellium.com>