Files
Frederik Lotter 27b88f8362 systemd: fix snapd systemd-unit stop progress notifications (#11340)
* systemd: fix snapd systemd-unit stop progress notifications

The current behaviour of the systemd stop progress code is to print
the list of systemd units not stopped after 250ms, and then after
every 20 seconds (notice ticker), erroneously print out the same list
even though some of the units may have stopped in the meantime.

However, the bigger issue is that the non-mocked (real) code starts the
stop sequence by making a blocking systemctl stop request and then
afterwords (once all units are stopped) proceed to the progress code,
which has no effect and exits immediately.

The unit tests does not pick this up because the mock systemctl does
not no anything, and therefore the simulation of progress works as
expected because the output of the 'systemctl show' progress calls are
mocked to simulate certain outcomes.

Finally, the 'Stop()' function provided a 'timeout' argument which
suggested the user could control when a stop attempt should timeout
and fail. However, this feature did not work as the original call was
blocking. Furthermore, this features does not make sense because it
creates a second place where timeouts are introduced. The first place
is in the systemd back-end where 'TimeoutStopSec' is optionally
configured by snapd. In contrast with systemd's actual timeout
mechanism described, the 'Stop()' call offered another timeout, but
this mechanism is inconsistently used all over the code-base with
different hard-coded timeout values, not matching the underlying
service defaults to configured values. In the worse case scenario
the snapd 'Stop()' would return an error due to timeout while systemd
is still busy stopping a unit, because it was a longer unit timeout
configure. The timeout value offered by snapd has no real purpose if
carefully analysed.

The following changes were made:

1. Remove the systemd Stop() timeout argument. The timeout is now
   automatically provided by systemdctl when it returns (blocking).
   This allows the timeout to always be aligned with the actual
   timeout 'TimeoutStopSec' provided by the unit, and takes into
   account systemd timeout behaviour related to 'ExecStop' commands.

2. Change the blocking systemctl stop command to async run in a go
   routine allowing progress tracking to continue in parallel. Note
   that the call to Stop() is still blocking because the progress
   code waits until either units are all stopped or the stop command
   returns.

3. Change the notice ticker to rather implement an initial notice
   silence period (1s). After this expires, the first notification
   is generated with the list of units still not stopped. This
   delay was chosen based on empirical tests performed on a
   Raspberry Pi 3 (representing slower targets) under heavy IO and
   CPU load (where it took up to 750ms for 4 services to stop).

4. Change the progress logic to always send a notification after
   the notice silence period on any change to the list of unit
   we are still waiting for, and always only poll the state of the
   remaining units.

5. Add additional unit tests to better cover the new progress and
   notification logic, and make existing tests more thorough.

6. Add critical sections inside the mocked systemctl functions to
   prevent corruption during concurrent modifications to some of
   the test suite logs and counters. This was needed because the
   Stop() is now issued from a go routine.

The following example shows the expected output for 6 systemd units
terminated with the 'Stop' function. The last four (test3 - test6)
has a random post SIGTERM delay before the process exits. The following
notifications are generated for this example after the 1s notification
silence period expires:

Waiting for "test3", "test4", "test5", "test6" to stop.
Waiting for "test3", "test4", "test5" to stop.
Waiting for "test3", "test5" to stop.
Waiting for "test3" to stop.

This patch is required by https://github.com/snapcore/snapd/pull/11113

Signed-off-by: Fred Lotter <fred.lotter@canonical.com>

* typos

* Improved the system stop progress logic.

* Number of small review improvements

* Remove redundant channel write

* reviews: add thread safety into MockSystemctl

* many: have a separate systemd.MockSystemctlWithDelay

keep MockSystemctl signature as before

as we need to introduce controlled delays only in the systemd package
own tests so far

Co-authored-by: Samuele Pedroni <pedronis@lucediurna.net>
2022-05-20 17:29:05 +02:00
..
2021-07-16 18:10:25 +02:00
2021-07-16 18:10:25 +02:00