Add config file options so that the virtualization system is able to retry pulling failed payloads when running in unattended mode (defaults to off)

#rb PJ.Kack #jira UE-203381 -Some users have reported seeing their long cooks fail due to a machine having an unreliable connection or networkcard. In these cases the network outage can be swiftly fixed but if VA failed to pull a payload before then, the cook will terminate and need to be restarted which can cost a lot of time. - It was requested that we add an optional way to have the system retry payload pulling when running in unattended mode but also to wait for X time (usually many minutes) before trying again. If it is likely that the connection will be restored within those few minutes then waiting will be much less costly than restarting a cook. - That payloads can be pulled on many threads at the same time makes the logic a little tricky, so rather than counting how many payloads have failed vs the retry counter we count how many times we've logged a message to the user as this logging is protected by a critical section and acts as a way to "group" together failed pulls that occur around about the same time. We then reset this counter to 0 if we detect a successful pull. - It is possible that a pull fails because the payload is missing, in which case this logic will probably cause the counter to reset frequently and the error to not become fatal for quite some time (possibly until the cook has almost finished) but it is quite unlikly to occur and due to this I have favored erring towards simple code rather than trying to track individual payload failures vs grouped failures vs successful pulls. - Note: That when backends fail to pull payloads they generally log errors, which will eventually cause most of our processes to return non zero to indicate failure. VA should not log errors while we are inside of a retry loop and only print out errors when we detect a problem that we cannot solve to avoid this. This is being addressed as it's own work item. [CL 30930392 by paul chipchase in 5.4 branch]
2026-03-26 18:15:20 -07:00 · 2024-01-26 12:57:29 -05:00
parent a834e64e3c
commit 0d0bba910f
3 changed files with 75 additions and 5 deletions
--- a/Engine/Source/Developer/Virtualization/Private/VirtualizationManager.h
+++ b/Engine/Source/Developer/Virtualization/Private/VirtualizationManager.h
@@ -135,6 +135,17 @@ struct FAnalyticsEventAttribute;
 *												before it was pulled from a backend later in the hierarchy. Can be used to try and skip
 *												expensive existence checks, or if a backend is in a bad state where it believes it has the payload
 *												but is unable to actually return the data. [Default=false]
+ * UnattendedRetryCount [int32]:				How many times the process should retry pulling payloads after a failure is encountered if the
+												process is unattended. Usually when a payload pull fails we ask the user to try and fix the issue
+												and retry, but in unattended mode we just log an error and terminate the process. In some cases
+												such as build machines with unreliable internet it is possible that the process could recover in
+												which case setting this value might help. Zero or negative values will disable the system. 
+												Note: If many pulls are occurring at the same time on many threads a very short network outage might
+												spawn many errors in which case we try to group these errors into a single 'try' so 32 errors on 32
+												threads would not immediately blow past a retry count of 30 for example. [Default=0]
+ * UnattendedRetryTimer [int32]					If 'UnattendedRetryCount' is set to a positive value then this value sets how long (in seconds)
+ *												the process should wait after a failure is encountered before retrying the pull. Depending on the
+												likely cause of the failure you may want to set this value to several minutes. [Default=0]
 */

 namespace UE::Virtualization
@@ -274,6 +285,9 @@ private:
 	/** Determines if the default filtering behavior is to virtualize a payload or not */
 	bool ShouldVirtualizeAsDefault() const;

+	/** Returns if the process will attempt to retry a failed pull when the process is unattended mode */
+	bool ShouldRetryWhenUnattended() const;
+
 	void BroadcastEvent(TConstArrayView<FPullRequest> Ids, ENotification Event);
 	
 private:
@@ -331,6 +345,13 @@ private:

 	/** Optional url used to augment connection failure error messages */
 	static FString ConnectionHelpUrl;
+
+	/** The number of times to retry pulling when errors are encountered in an unattended process, values <0 disable the system */
+	int32 UnattendedRetryCount = 0;
+
+	/** The how long (in seconds) to wait after payload pulling errors before retrying. Does nothing if 'UnattendedRetryCount' is disabled */
+	int32 UnattendedRetryTimer = 0;
+
 private:

 	/** The name of the current project */
@@ -360,6 +381,9 @@ private:
 	/** Our notification Event */
 	FOnNotification NotificationEvent;

+	/** Track how many times we've displayed a message about failed payload pulls since the last successful pull (only used in unattended mode)*/
+	mutable std::atomic<int32> UnattendedFailureMsgCount = 0;
+
 	// Members after this point at used for debugging operations only!

 	struct FDebugValues