Previously it was possible to set delegate property for scope, but you
were not able to allow unprivileged process to manage the scope's cgroup
hierarchy. This is useful when launching manager process that will run
unprivileged but is supposed to manage its own (scope) sub-hierarchy.
Fixes#21683
since in this specific case (r == 0) `errno` is irrelevant and most likely
set to zero, leading up to a confusing message:
```
[ 120.595085] H systemd[1]: session-5.scope: No PIDs left to attach to the scope's control group, refusing: Success
[ 120.595144] H systemd[1]: session-5.scope: Failed with result 'resources'.
```
If all processes we are supposed to add are gone by the time we are
ready to do so, let's fail.
THis is heavily based on Cunlong Li's work, who thankfully tracked this
down.
Replaces: #20577
In general we almost never hit those asserts in production code, so users see
them very rarely, if ever. But either way, we just need something that users
can pass to the developers.
We have quite a few of those asserts, and some have fairly nice messages, but
many are like "WTF?" or "???" or "unexpected something". The error that is
printed includes the file location, and function name. In almost all functions
there's at most one assert, so the function name alone is enough to identify
the failure for a developer. So we don't get much extra from the message, and
we might just as well drop them.
Dropping them makes our code a tiny bit smaller, and most importantly, improves
development experience by making it easy to insert such an assert in the code
without thinking how to phrase the argument.
This mirrors the change done for systemd-resolved in
9793530228. Quoting that patch:
> We generally operate on the assumption that a source is "gone" as soon as we
> unref it. This is generally true because we have the only reference. But if
> something else holds the reference, our unref doesn't really stop the source
> and it could fire again.
In particular, we take temporary references from sd-event code, and when called
from an sd-event callback, we could temporarily see this elevated reference
count. This patch doesn't seem to change anything, but I think it's nicer to do
the same change as in other places and not rely on _unref() immediately
disabling the source.
Commit 428a9f6f1d freed u->pids which is
problematic since the references to this unit in m->watch_pids were no more
removed when the unit was freed.
This patch makes sure to clean all this refs up before freeing u->pids by
calling unit_unwatch_all_pids().
Otherwise if a daemon-reload happens somewhere between the enqueue of the job
start for the scope unit and scope_start() then u->pids might be lost and none
of the processes specified by "PIDs=" will be moved into the scope cgroup.
If processes remain in the unit's cgroup after the final SIGKILL is
sent and the unit has exceeded stop timeout, don't release the unit's
cgroup information. Pid1 will have failed to `rmdir` the cgroup path due
to processes remaining in the cgroup and releasing would leave the cgroup
path on the file system with no tracking for pid1 to clean it up.
Instead, keep the information around until the last process exits and pid1
sends the cgroup empty notification. The service/scope can then prune
the cgroup if the unit is inactive/failed.
In all the other cases, I think the code was clearer with the static table.
Here, not so much. And because of the existing dump code, the vtables cannot
be made static and need to remain exported. I still think it's worth to do the
change to have the cmdline introspection, but I'm disappointed with how this
came out.
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
Similar, refuse triggering deps on units that cannot trigger.
And rework how we ignore After= dependencies on device units, to work
the same way.
See: #14142
Just as `RuntimeMaxSec=` is supported for service units, add support for
it to scope units. This will gracefully kill a scope after the timeout
expires from the moment the scope enters the running state.
This could be used for time-limited login sessions, for example.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Fixes: #12035
No functional change, just adjusting code to follow the same pattern
everywhere. In particular, never call _verify() on an already loaded unit,
but return early from the caller instead. This makes the code a bit easier
to follow.
unit_load_fragment_and_dropin() and unit_load_fragment_and_dropin_optional()
are really the same, with one minor difference in behaviour. Let's drop
the second function.
"_optional" in the name suggests that it's the "dropin" part that is optional.
(Which it is, but in this case, we mean the fragment to be optional.)
I think the new version with a flag is easier to understand.
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.