#rb PJ.Kack
#jira UE-203381
-Some users have reported seeing their long cooks fail due to a machine having an unreliable connection or networkcard. In these cases the network outage can be swiftly fixed but if VA failed to pull a payload before then, the cook will terminate and need to be restarted which can cost a lot of time.
- It was requested that we add an optional way to have the system retry payload pulling when running in unattended mode but also to wait for X time (usually many minutes) before trying again. If it is likely that the connection will be restored within those few minutes then waiting will be much less costly than restarting a cook.
- That payloads can be pulled on many threads at the same time makes the logic a little tricky, so rather than counting how many payloads have failed vs the retry counter we count how many times we've logged a message to the user as this logging is protected by a critical section and acts as a way to "group" together failed pulls that occur around about the same time. We then reset this counter to 0 if we detect a successful pull.
- It is possible that a pull fails because the payload is missing, in which case this logic will probably cause the counter to reset frequently and the error to not become fatal for quite some time (possibly until the cook has almost finished) but it is quite unlikly to occur and due to this I have favored erring towards simple code rather than trying to track individual payload failures vs grouped failures vs successful pulls.
- Note: That when backends fail to pull payloads they generally log errors, which will eventually cause most of our processes to return non zero to indicate failure. VA should not log errors while we are inside of a retry loop and only print out errors when we detect a problem that we cannot solve to avoid this. This is being addressed as it's own work item.
[CL 30930392 by paul chipchase in 5.4 branch]
#rb PJ.Kack
#jira UE-180383
#rnx
- With the new error flow we block the process on a failed pull and then quit the process if the connection cannot be recovered, this means that we never reach the code that reports which payloads failed.
- Added similar reporting to FVirtualizationManager::OnPayloadPullError so we can get a log of which payloads failed to pull.
-- A pull failure should not really be dependant on the payload itself so this info isn't that important but it can be useful to eliminate potential issues by allowing us to confirm that the payload does in fact exist in persistent storage and the user was just unable to access it.
[CL 28833473 by paul chipchase in ue5-main branch]
#rb none
- If we want to use slate dialogs we cannot use LazyInitConnections, however slate dialogs only work on programs that a) support slate and b) are running in some interactive form. Commandline programs, or the editor runnning in unattended mode or running a commandlet cannot make use of the slate dialog.
- If this new flag 'DisableLazyInitIfInteractive' is set to true and 'LazyInitConnections' is set to true then the VA system will lazy init its connections unless we are running an interactive program that could make use of the slate dialogs, in which case 'LazyInitConnections' is disabled for safety.
- Ideally we'd merge these two values into a single enum but I am hoping that we can change how we connect in the near term in which case we will simple deprecate 'DisableLazyInitIfInteractive'. The use case is pretty niche at the moment so this will probably be easier for the next stage of work.
[CL 26230486 by paul chipchase in ue5-main branch]
#rb none
#rnx
#preflight 645c95abaa3c584c0b1c91f0
- A url can now be provided via 'ini:Engine:[Core.VirtualizationModule]:ConnectionHelpUrl=""' that will be added to various connection errors. If the error message is via FMessageLog then the user will be able to click the url to go straight to the link.
- This sort of help will be very company specific, this url will allow teams to build up their own FAQ for their users.
- Currently used by the source control backend if the initial connection fails but will be expanded to other areas.
- The url is currently parsed and owned by the virtualization manager but at the moment the connection error is issued by each backend. So right now there is a static accessor allowing backends to get the value to use. This would not work properly for 3rd party backends and will be changed in the near term.
[CL 25424000 by paul chipchase in ue5-main branch]
#rb none
#jira UE-181418
#rnx
#preflight 6447c5482804595a04d1e375
- Each pull operation now takes a FText that can be filled in with additional info about any failures that will cause the new error dialog to be displayed.
-- At the moment we only have info to provide from the source control backend if the connection itself failed.
-- In the future we might want to completely overhaul our error logging and display all messages to the user but our logging system is not really built for that level of redirection at the moment.
- Only pass on the error if it came from a persistent backend, we don't need to worry the user about cached backends which are allowed to fail.
[CL 25178752 by paul chipchase in ue5-main branch]
#rb none
#jira UE-182205
#preflight 643d2901c947f6523a2f5845
- Most backends support some form of existence check which is run before a push to prevent uploading data that is already there. Sometimes we want to ignore that check (bugs in the backend etc) and need a way to force that to happen.
- The new config option [Core.VirtualizationModule]:ForceCachingOnPull (default false) will do this when set to true.
- The file and DDC backends will respect the flag but for now the source control backend will continue to check to avoid accidental pointless revisions being added.
[CL 25066347 by paul chipchase in ue5-main branch]
#rb none
#jira UE-180383
#preflight 64183a83691c5ebc155d21a2
- The payload pull error dialog now suports an additional custom line in the dialog that can be set via Engine:[Core.VirtualizationModule]:AdditionalPullErrorMsg. The intent is to allow teams to provide additional help and info to their users.
- The default error text is now localized.
[CL 24714178 by paul chipchase in ue5-main branch]
#rb none
#jira UE-180383
#preflight 64147be21c44ff98b969b1aa
- The new flow will display an error dialog to the user informing them of the problem and ask if they want to retry or quit the process. This is quite a shift if how we have been handling failed payload pulls up until and this is the first rough implementation so for now it is off by default.
- We only show one dialog at a time and if a new error occurs while we are waiting for user input, that thread is set to wait on the existing dialogs outcome.
[CL 24689856 by paul chipchase in ue5-main branch]
#rb PJ.Kack
#jira UE-177013
#preflight 63ece283956709374a03eb8a
- The existing config file option 'EnableCacheAfterPull' is now deprecated. If we detect that it is still in the config file we will warn the user about it but continue to apply the value to maintain compatibility.
- Add a new config file option 'EnableCacheOnPull' which controls if a payload is pushed to cached storage after it has been pulled from persistent storage. (Default true)
- Add a new config file option 'EnableCacheOnPush' which controls if a payload is pushed to cached storage before it is pushed to persistent storage. (Default true)
- We now store the info about caching in a single enum bitfield ECachingPolicy.
- Added an up to date comment section about ini file set up for FVirtualizationManager which was previously missing.
[CL 24230636 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-177015
#preflight 63eb71ea956709374ab61d5e
- So far nobody has proposed any worthwhile use case for virtualizing engine content given the following downsides:
-- We cannot ever do so internally as we would be unable to distribute the virtualized packages.
-- Projects and the engine packages cannot always be accessed via the same client spec, leading to problems if we try.
-- Engine content is typically small and not suited to virtualization in the first place.
- If we remove the option then we can exclude the engine packages from our VA commandlets which will give less confusing output (at the moment if you try to virtualize a project containing a single package we will report that several thousand packages were found as we need to parse engine content too)
- Deprecated the ini file options, if we detect them we will warn the user that the option is still present and that engine content now will never virtualize.
[CL 24208899 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-169626
#rnx
#preflight 63bd670a71079a8d1c0e837b
- Since the API was forcing the caller to pass in a results structure to be filled in, we might as well make it the return value.
- Added a ::WasSuccessful method to the results structures that can be used instead of checking if the result had errors or not.
- Remove the reset method from FVirtualizationResult/FRehydrationResult as they no longer need it.
- The older deprecated methods still use the results enum, so we cannot easily deprecate those enums yet.
[CL 23626072 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-169626
#rnx
#preflight 63bd0ebfd862fdd347bce1fe
### VA System
- This allows us to provide the user with more ways to customize the rehydration and return more detailed info about it if the calling code wishes to log additional info. In both cases we can extend the options and the data returned without changing the API.
- At the moment the only flag we support is 'Checkout', which requests that the rehydration process checkout any package that it needs to modify rather than giving an error. This means that the user does not need to check packages out from revision control before running the rehydration process.
-- We still check if packages can be modified and warn the user if they can't, as package files could be locked in other ways.
- The rehydration process will now long the time taken if verbose logging is set for the category 'LogVirtualization'
### UnrealVirtualizationTool
- The virtualize command now reports how many packages were checked out if the flag was set.
- The rehydration command now supports a '-Checkout' commandline flag, which when enabled will use the new api to checkout the packages that need to be checked out when rehydrated.
[CL 23625132 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-169626
#rnx
#preflight 639c4112012902cb8db43e13
- This allows us to provide the user with more ways to customize the virtualization and return more detailed info about it if the calling code wishes to log additional info. In both cases we can extend the options and the data returned without changing the API.
- Previously if we virtualized a package that was not checked out in revision control we would warn the user and then skip updating the package file on disk. This means the payloads would be uploaded but the user would be left with no local changes. Since sometimes we know we don't need to check out any package (virtualizing the packages in a change list for example) we don't want to always incur the cost of polling reivision control to see which packages do need checking out. This is why we now allow the caller to request package files be checked out via the new options enum EVirtualizationOptions.
-- If the EVirtualizationOptions::Checkout flag is provided we will poll the revision control status of all package files and then check out those which need it.
-- We still check if packages can be modified and warn the user if they can't, as package files could be locked in other ways.
- Added a new utility function to SourceControlUtilties to make it easier to check out packages. There is similar functionality elsewhere in the code base but the virtualization module is too low level to make use of it.
- Updated existing code that calls ::TryVirtualizePackages and add cases of ''using namespace UE::Virtualization' where required to improve readability.
- The UnrealVirtualizationTool now supports a new cmdline option "-checkout" that can be used when virtualizing packages. This will checkout any package that was actually virtualized so the result can be saved back out to the workspace domain. This means we no longer require the caller to have checked out the packages before running the tool.
[CL 23536832 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-163093
#rnx
#preflight 63526abcae33b04ec1ee4a65
- The caller can now request more than one payload to be pulled froim the system at a time. In theory this will allow backends to pull data quicker.
-- The file system backend works with the new system but makes no attempt to take advantage as the backend is not intended for serious production use.
-- The DDC backend should be taking full advantage of the new batching API
-- Note that although the source control API does attempt to take advantage of batching, the internal source control API implementation is not doing so. This will be addressed in a future submit.
- This does make the pulling logic a bit more complicated in FVirtualizationManager as we need to deal with some payloads being found in the first backend, some in the second and some in the third etc.
-- To help deal with this a new class FPullRequestCollection has been added which abstracts a lot of complexity away.
[CL 22707955 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#rnx
#preflight 634805f7f93be0f634b8ed5b
- Added a new FArchive 'FFixedBufferWriterArchive' that will let me serialize to an already allocated, fixed sized memory buffer.
- Moved the code for opening and reading packages to shared utility functions so that it is easier to keep the error messages consistent.
- Moved common code to a new function ::TryRehydrateBuilder so that both rehydration methods can share it.
- Note that technically we are changing the API since now we accept a TConstArrayView for the package file paths for both the virtualization and rehydration paths, but existing code will be compatible with this change.
[CL 22511406 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#rnx
#preflight 633ad936c250f60502b2d8cb
- Everything was working but when a cvar is set via a device profile we expect the cvar to be registered sooner, hence the warning.
- Added a new define UE_USE_GLOBAL_CVAR to FVirtualizationManager.cpp, when set we use a static global cvar for 'VA.AllowPkgVirtualization' rather then creating and registering one when the system is mounted.
- Since the cvar can now outlive the manager we also needed to add a way to remove the bound lambda from the cvar when the manager is destroyed.
[CL 22322164 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-164278
#rnx
#preflight 63284a248c3def91aa62bd83
- This replaces the old cvar 'VA.DisableVirtualization' which is being removed.
- We will log each time the value is switched to aid in future debugging.
- The name needed to be changed as we needed to invert how it worked, also "DiableVirtualization" was a bit misleading and made people think it turned the entire system off, not just the process of virtualizing packages on submit.
-- The member bEnablePayloadVirtualization was also renamed to bAllowPackageVirtualization to make its purpose clearer.
- The cmdline version was changed from 'VA-DisableVirtualization' to 'VA-SkipPkgVirtualization' for similar reasons.
[CL 22146910 by paul chipchase in ue5-main branch]
#rb Per.Larsson
#jira UE-163103
#rnx
#preflight 6318989c2b7fe03eb664e9f0
### VirtualizationSystem/VirtualizationManager
- Added an overload of ::Push taking just one FPUshRequest so that people don't have to keep adding MakeArrayView boiler plate when pushing a single request
- Change the order of the last two parameters for the raw ::Push call as this will group all of the payload specific parameters together and leave the target storage type at the end. It is unlikely that anyone is calling the older version but it has been deprecated for safety.
### IVirtualizationBackend
- The none FPushRequest overload of ::PushData is no longer virtual, it just converts the parameters to FPushRequest and calls that overload instead. In this way we now only have one pushing code path in our backends. We could probably look into removing this overload at this point (since the higher level IVirtualizationSystem will now convert all push requests into a FPushRequest form) but it is not considered worth it at the moment when the simple overload covers our needs.
- Removed EPushResult in favour of just returning true/false for the overall operation.If the caller needs a more detailed breakdown then they will have to use an overload that takes an FPushRequest over raw parameters.
-- At the moment FPushRequest does not contain a full breakdown of what happened, so with this submit we are effectively losing the ability to find out if the payload was already in the backends or not, however the batch version of push was already not returning this info so it is not a big loss. Fixing FPushRequest to return a better break down of what happened will be done in UE-160942
- Removed the none batch Push paths from the source control and ddc backends as they already supported batch pushing.
- File backend needed to be converted to supporting batch pushing, which is pretty much the same code as before except we need to iterate over the container of FPushRequests.
-- The backend does not early out on error as it tends to be quite fast. We might want to consider an official policy for the VA system, if we should early out of errors or not.
[CL 21907558 by paul chipchase in ue5-main branch]
#rb Devin.Doucette
#jira UE-161599
#rnx
#preflight 6303c8d65a5d4e4624e7bf52
- There are some use cases that require the VA system to be initialized and configured correctly but would prefer that the backend connections only run if absolutely needed (usually when a payload is pulled or pushed for the first time), this change provides four different ways of doing this:
-- Setting [Core.VirtualizationModule]LazyInitConnections=true in the Engine ini file
-- Setting the define 'UE_VIRTUALIZATION_CONNECTION_LAZY_INIT' to 1 in a programs .target.cs
-- Running with the commandline option -VA-LazyInitConnections
-- Setting the cvar 'VA.LazyInitConnections' to 1 (only works if it is set before the VA system is initialized, changing it mid editor via the console does nothing)
--- Note that after the config file, each setting there only opts into lazy initializing the connections, setting the cvar to 0 for example will not prevent the cmdline from opting in etc.
- In the future we will allow the connection code to run async, so the latency can be hidden behind the editor loading, but for the current use case we are taking the minimal approach.
-- This means we only support the backend being in 3 states. No connection has been made yet, the connection is broken and the connection is working.
-- To keep things simple we only record if we have attempted to connect the backends or not. We don't check individual backends nor do we try to reconnect failed ones etc. This is all scheduled for a future work item.
- If the connections are not initialized when the VA system is, we wait until the first time someone calls one of the virtualization methods that will actually use a connection: Push/Pull/Query
-- We try connecting all of the backends at once, even if they won't be used in the call to keep things simple.
- Only the source control backend makes use of the connection system. The horde storage (http) backend could take advantage too, but it is currently unused and most likely going to just be deleted so there seemed little point updating it.
- If we try to run an operation on an unconnected backend we only log to verbose. This is to maintain existing behaviour where a failed backend would not be mounted at all. This logging will likely be revisited in a future work item.
[CL 21511855 by paul chipchase in ue5-main branch]