Currently, when CHASE_PARENT is specified, we chase the parent directory
of the symlink itself. Let's change this and chase the parent directory
of the symlink target so that trying to open the actual file later with
O_NOFOLLOW doesn't fail with ELOOP.
To get the current behavior, callers can add CHASE_NOFOLLOW to chase
the parent directory of the symlink itself.
Currently, when CHASE_MKDIR_0755 is specified, we create all components
of the path as directories. Instead, let's change the flag to only create
parent directories and leave the final component of the PATH untouched.
Also, allow CHASE_NONEXISTENT with CHASE_MKDIR_0755 now that it doesn't
create all components anymore.
Finally, rework chase_symlinks_and_open() and chase_symlinkat_at_and_open()
to always chase the parent directory and use xopenat() to open the final
component of the path. This allows us to pass O_CREAT to create the file or
directory (O_DIRECTORY) if it is missing. If CHASE_PARENT is configured, we
just reopen the parent directory that we chased.
Use the flags MEMFD_EXEC or MEMFD_NOEXEC_SEAL as applicable.
These warnings instruct the kernel wether the memfd is executable or
not.
Without specifying those flags the kernel will emit the following
warning since version 6.3,
commit 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC"):
kernel: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
For nspawn and services we first turn off two-way propagation of mounts
from host to sandbox via MS_SLAVE, and then set MS_SHARED again, so that
we create a new mount prop peer group again, and that we provide
behaviour similar to what we provide on the host further down the tree.
Let's do the same in detach_mount_namespace(), which we use for the
temporary mounts in the implementation of --image= in various tools.
This doesn't fix any immediate issue, but ensures we expose somewhat
systematic behaviour: whenever we detach mount namespaces we always set
things back to MS_SLAVE in the child.
On some ARM platforms, the dynamic linker could use PROT_BTI memory protection
flag with `mprotect(..., PROT_BTI | PROT_EXEC)` to enable additional memory
protection for executable pages. But `MemoryDenyWriteExecute=yes` blocks this
with seccomp filter denying all `mprotect(..., x | PROT_EXEC)`.
Newly preferred method is to use prctl(PR_SET_MDWE) on supported kernels. Then
in-kernel implementation can allow PROT_BTI as necessary, without weakening
MDWE. In-kernel version may also be extended to more sophisticated protections
in the future.
All daemons use a similar scheme to read their main config files and theirs
drop-ins. The main config files are always stored in /etc/systemd directory and
it's easy enough to construct the name of the drop-in directories based on the
name of the main config file.
Hence the new helper does that internally, which allows to reduce and simplify
the args passed previously to config_parse_many_nulstr().
Besides the overall code simplification it results:
16 files changed, 87 insertions(+), 159 deletions(-)
it allows to identify clearly the locations in the code where configuration
files are parsed.