302 Commits

Author SHA1 Message Date
Samuele Pedroni
d92fa5844e wrappers/services.go: do not restart disabled or inactive services
Merge pull request #10266 from mardy/service-restart


This implementes the suggestion described in the forum:
https://forum.snapcraft.io/t/command-line-interface-to-manipulate-services/262/47

Fixes: https://bugs.launchpad.net/snapd/+bug/1803212
2021-06-18 09:48:18 +02:00
Samuele Pedroni
54e84fb8db many: fix imports order (according to gci)
had to make the comment in wrappers/services_test.go a one-liner
otherwise half of it is lost

last set of files needing changing (as per current master)
2021-06-16 09:54:31 +02:00
Ian Johnson
bcbc27a816 systemd: refactor "[not set]" handling in getPropertyUintValue
As suggested by Michael

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-06-11 09:54:22 -05:00
Alberto Mardegan
256dafa6c2 many: address review comments 2021-06-11 11:36:40 +03:00
Ian Johnson
00c84e7e2e systemd: refactor property parsers for int values in CurrentTasksCount, etc
As suggested by Maciej, Michael and Samuele in snapcore/snapd#10354.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-06-10 15:31:31 -05:00
Alberto Mardegan
8d7b3e68be systemd: create a systemdtest package 2021-06-09 16:53:03 +03:00
Alberto Mardegan
0bb437a669 many: address review comments 2021-06-09 08:55:44 +03:00
Alberto Mardegan
1f6d7efd01 Merge branch 'master' into service-restart 2021-06-09 08:38:56 +03:00
Alberto Mardegan
f5fba92f9e Merge branch 'master' into service-restart 2021-06-08 08:55:17 +03:00
Ian Johnson
e71b122c53 systemd/systemd: use ParseUint and parse a 64 bit max size for MemoryCurrent
This is needed because on older systemd systems such as Xenial, CentOS 7 and
Amazon Linux 2, systemd reports the current memory usage as 16 Exbibytes, which
does not fit into a 32-bit unsigned int.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-06-07 09:18:11 -05:00
Ian Johnson
14a709fe81 systemd: add CurrentTasksCount to get number of tasks in a slice/service
This will be used for quota group tracking purposes.

Also refactor to share some common code around getting various properties from
systemctl like this that share some parsing code.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-06-07 09:17:57 -05:00
Ian Johnson
5904235e4a systemd: also recognize "unknown" is-active output as being inactive
CentOS 7 and Amazon Linux 2 with systemd 219 report an unknown, non-existent
slice unit when queried whether it is active or not a "unknown", so add this
string to the cases that mean the unit is not active.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-06-07 09:17:11 -05:00
Samuele Pedroni
ea65e1c4fc systemd: rename Error.exitCodeErr to runErr for clarity 2021-06-03 14:33:36 +02:00
Michael Vogt
887a7f2523 systemd: improve systemctl error reporting
We recently had a build failure on trusty in a schroot environment.
It turned out that the issue was that no systemctl binary was installed
there. But the error was a bit misleading:
```
[--version] failed with exit status 0:
```

This commit improves the error reporting for non-exit code errors
and errors that do not have an associated exit status.
2021-06-02 18:08:16 +02:00
Ian Johnson
73b89dac11 Merge pull request #10306 from anonymouse64/feature/quota-groups-9
snap/quota: add CurrentMemoryUsage for current memory usage of a quota group

The function returns 0 for slices that don't currently exist.
2021-05-26 12:32:41 -05:00
Ian Johnson
a93e9eff30 systemd/systemd.go: adjust error message for consistency
Thanks to Samuele for the suggestion

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-05-26 11:57:32 -05:00
Ian Johnson
9f3904f674 systemd/systemd.go: adjust error messages
Thanks to Michael and Samuele for the suggestions.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-05-26 09:51:08 -05:00
Ian Johnson
42a8bc304e systemd/systemd.go: use systemctlError interface instead of *systemd.Error
This lets us actually mock an inactive service result outside of the systemd
package without mocking the systemctl command itself.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-05-25 21:03:47 -05:00
Ian Johnson
172ed92467 systemd: add CurrentMemoryUsage to get current memory usage for a unit
The unit can be a service or a slice. If the unit or slice doesn't exist or is
not active, then an error is returned.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
2021-05-25 14:44:20 -05:00
Alberto Mardegan
7ddeddf110 systemd: move mock to systemd package 2021-05-18 12:46:37 +03:00
Pawel Stolowski
c2479d7bdd systemd: wait for zfs mounts
* Wait for zfs mounts.

* Add a comment linking to https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1922293

* Update mountunit test for After=zfs-mount.service.
2021-04-23 12:12:52 +02:00
Ian Johnson
406dbc0c96 o/servicestate/servicemgr.go: add ensure loop for snap service units (#10164)
This fixes LP #1924805 with a few different parts.

We stop using Requires=usr-lib-snapd.mount in snap app service unit files, since this causes those services to be killed when, as part of a normal snapd refresh (not an app snap refresh) we call systemctl stop usr-lib-snapd.mount).
We introduce a new servicestate manager Ensure() function which will use EnsureSnapServices to re-write snap service unit files at most once per-snapd start. This ensures that units which were written with Requires=usr-lib-snapd.mount are re-written now to use Wants=usr-lib-snapd.mount instead.
This new servicestate Ensure() function will also track what changes were made to services, and if any services were modified to no longer include Requires= (thus fixing the bug for future snapd refreshes), then we will go and decide if any of those modified units were running and killed by systemd during the snapd snap refresh process. If they were running, then we will restart them.
If we identify services which were killed and we tried to start them back up again, but we failed for whatever reason, we will fall back to rebooting immediately to try and recover the system to a working state.

Fixes: https://bugs.launchpad.net/snapd/+bug/1924805

* o/servicestate/servicemgr.go: add ensure loop for snap service units

This is a first attempt to fix LP #1924805 by having snapd when it starts up
refresh service unit definitions and also manually start services which were
killed by systemd when we stopped usr-lib-snapd.mount during a snapd snap
refresh.

Also change Requires=usr-lib-snapd.mount to Wants=usr-lib-snapd.mount to
prevent future snap services from being killed when the snapd snap is
refreshed.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* wrappers,o/snapstate/backend: try Wants instead of Requires for usr-lib-snapd.mount

* daemon/api_base_test.go: improve test comments

This was confusing to debug when failing, including the actual response Result
in the comment and also explaining that the revision must be of type string
should help folks in the future debugging failures here.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* wrappers: export SnapdToolingMountUnit

This might be better served in dirs long-term, but for now we ought to use it
from servicestate regardless of where it lives to reduce duplication.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* systemd, o/servicestate: move InactiveEnter... to systemd package; refactor

Refactor the bug fixing restarting bits of ServiceManager.Ensure() to a
separate function so it can be more easily tested and is simpler to read.

Also use the exported constant for usr-lib-snapd.mount, use the newly exported
systemd.InactiveEnterTimestamp method to get the times for units, and fix some
subtle bugs such as not running for the snapd snap, using snapstate for getting
the device context for easier mocking, and not passing a nil timings to
StartServices which panics on a nil timings.

Also add a MockEnsureSnapServices to easily disable running this logic from
other packages.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* tests/daemon, overlord: disable running servicemgr.Ensure() during tests

The servicestate manager's Ensure() loop now tries to use systemctl during it's
start, so it will race with the existing tests over which bit of code uses the
mocked systemctl first, sometimes failing the test when the Ensure() loop wins.

Instead, just disable running that Ensure() loop entirely.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr_test.go: add some basic unit tests

This is not yet a complete set of unit tests, but is a good start.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* systemd: add basic unit tests

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>

* o/servicestate: handle vitality-rank in service options, restart on failure

Also refactor the unit tests so they are less duplicative and add cases for:
* a service using VitalityRank
* needing to reboot because of failures
* disabled services not getting restarted

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr.go: rename helper

This makes it more clear that this helper is only called when we have a snapd
snap.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr_test.go: use a common time format variable

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate: rename mock helper, check that it is only used from inside test

Thanks to Samuele for the suggestion.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr.go: delete implemented TODO

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/managers_test.go: add first high level managers test for service state

More to follow...

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr_test.go: check errors from Chtimes

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/managers_test.go: refactor current ensure snap services test, add another

Refactor the existing test to handle the case where snapd needs to restart
services because they were killed, and add another test case where snapd fails
to start the services and thus falls back to attempting a reboot instead.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr_test.go: refactor systemctl checks

The issue is that the cleanup function that was deferred was being run as part
of the tear down function, and thus the Check() was not being raised up.
Instead making it something that is deferred directly by the Test*() functions
ensures that the fixtures are still properly setup and so they fail the test
properly.

Also fix 2 unit tests which were passing, but this was not caught due to the
above issue, and would only fail with the tests/unit/go spread test.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* systemd: handle extra whitespace at the end of output in InactiveEnterTimestamp

In real systemd output, there is a newline at the end of this string.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr.go: truncate all times to second precision

When we switched to using systemctl show --property InactiveEnterTimestamp
instead of using d-bus, we lost a significant amount of precision of the
timestamp, and now we only get times accurate to the second, while the time
precision for the modification time of the usr-lib-snapd.mount unit file is
much more accurate, ironically actually leading to the seemingly true fact that
the usr-lib-snapd.mount file was modified _after_ usr-lib-snapd.mount was
successfully stopped, since we only have the time that the usr-lib-snapd.mount
unit being stopped to the second. For example, these are the times from a VM:

usr-lib-snapd.mount file modification time = 2021-04-21 02:58:03.142412741
usr-lib-snapd.mount InactiveEnterTimestamp = 2021-04-21 02:58:03
snap.test-snap.svc1.service InactiveEnterTimestamp = 2021-04-21 02:58:03

This means that the time range we end up forming is the nonsensical range:

[ 2021-04-21 02:58:03.142412741 , 2021-04-21 02:58:03 ]

and the code is written such that the candidate timestamp of
2021-04-21 02:58:03 is less than the lower end of the range, and thus must be
outside of the range.

The fix implemented here is to truncate all times to the least precise time we
have, which is that of systemctl, meaning that all times get truncated to
seconds and the above example then becomes:

usr-lib-snapd.mount file modification time = 2021-04-21 02:58:03
usr-lib-snapd.mount InactiveEnterTimestamp = 2021-04-21 02:58:03
snap.test-snap.svc1.service InactiveEnterTimestamp = 2021-04-21 02:58:03

and has the very specific range of a single second:

[ 2021-04-21 02:58:03 , 2021-04-21 02:58:03 ]

which at least does contain the time of the snap service being stopped and thus
ends up fixing the bug.

In addition, in order to have the most predictable behavior here, we should
also always fix our time range here so that if the lower time bound somehow
ends up being in the future significantly compared to the upper time bound, we
just fix the range to be only the upper time bound, since that number at least
should be of the same precision that the candidate snap service stop times.

We should eventually switch back to using D-Bus for getting this property to
have greater precision.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* overlord: test/adjust ServiceManager accessor

* systemd: move method to match interface order

* o/servicestate: move helper function

* o/servicestate: adjust to new EnsureSnapServices signature/behavior

* o/servicestate: try to restart services only conditonally

if there was no Requires=usr-lib-snapd.mount there is no so far known
reason for any service to have been stopped by snapd refresh code

* o/servicestate: drop addressed TODOs

* overlord/servicestate: implement shortcut for file mod time before boot time

Implement a TODO in the code about taking a shortcut where we know that the
usr-lib-snapd.mount unit file was written before the time of the current boot,
we know that no services could have been killed due to a refresh of snapd
during this boot (and if we rebooted during the change, but after
usr-lib-snapd.mount was written then services would not be killed and we would
revert anyways).

Add some unit tests around this.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* o/servicestate/servicemgr.go: use only MockCommand tests over func mocking

The two tests here were basically identical and there's not much to be gained
from having the additional mocking style here when we can just use MockCommand
for both cases in relatively simple implementations.

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

* overlord: fixup incomplete/confusing comments

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>

Co-authored-by: Samuele Pedroni <pedronis@lucediurna.net>
Co-authored-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
2021-04-22 16:33:04 +02:00
Paweł Stołowski
ebf9f687d2 Fix format. 2021-04-06 14:53:03 +00:00
Paweł Stołowski
21b0210ab1 Add doc comment for Installed flag. 2021-04-06 14:53:03 +00:00
Paweł Stołowski
25ff4f2b20 Enrich UnitStatus returned by systemd.Status() with Installed flag; it's
true if UnitFileState property is not empty.
2021-04-06 14:53:03 +00:00