Detecting Activity failures
A Workflow can detect different kinds of Activity Execution failures through the following timeouts:
Schedule-To-Start Timeoutβ
What is a Schedule-To-Start Timeout in Temporal?
A Schedule-To-Start Timeout is the maximum amount of time that is allowed from when an Activity Task is scheduled (that is, placed in a Task Queue) to when a Worker starts (that is, picks up from the Task Queue) that Activity Task. In other words, it's a limit for how long an Activity Task can be enqueued.
The moment that the Task is picked by the Worker, from the Task Queue, is considered to be the start of the Activity Task Execution for the purposes of the Schedule-To-Start Timeout and associated metrics. This definition of "Start" avoids issues that a clock difference between the Temporal Service and a Worker might create.
Schedule-To-Start Timeout period
"Schedule" in Schedule-To-Start and Schedule-To-Close have different frequency guarantees.
The Schedule-To-Start Timeout is enforced for each Activity Task, whereas the Schedule-To-Close Timeout is enforced once per Activity Execution. Thus, "Schedule" in Schedule-To-Start refers to the scheduling moment of every Activity Task in the sequence of Activity Tasks that make up the Activity Execution, while "Schedule" in Schedule-To-Close refers to the first Activity Task in that sequence.
A Retry Policy attached to an Activity Execution retries an Activity Task.
Start-To-Close Timeout period with retries
This timeout has two primary use cases:
- Detect whether an individual Worker has crashed.
- Detect whether the fleet of Workers polling the Task Queue is not able to keep up with the rate of Activity Tasks.
The default Schedule-To-Start Timeout is β (infinity).
If this timeout is used, we recommend setting this timeout to the maximum time a Workflow Execution is willing to wait for an Activity Execution in the presence of all possible Worker outages, and have a concrete plan in place to reroute Activity Tasks to a different Task Queue. This timeout does not trigger any retries regardless of the Retry Policy, as a retry would place the Activity Task back into the same Task Queue. We do not recommend using this timeout unless you know what you are doing.
In most cases, we recommend monitoring the temporal_activity_schedule_to_start_latency
metric to know when Workers slow down picking up Activity Tasks, instead of setting this timeout.
Start-To-Close Timeoutβ
What is a Start-To-Close Timeout in Temporal?
A Start-To-Close Timeout is the maximum time allowed for a single Activity Task Execution.
The default Start-To-Close Timeout is the same as the default Schedule-To-Close Timeout.
An Activity Execution must have either this timeout (Start-To-Close) or the Schedule-To-Close Timeout set. We recommend always setting this timeout; however, make sure that Start-To-Close Timeout is always set to be longer than the maximum possible time for the Activity Execution to complete. For long running Activity Executions, we recommend also using Activity Heartbeats and Heartbeat Timeouts.
We strongly recommend setting a Start-To-Close Timeout.
The Temporal Server doesn't detect failures when a Worker loses communication with the Server or crashes. Therefore, the Temporal Server relies on the Start-To-Close Timeout to force Activity retries.
The main use case for the Start-To-Close timeout is to detect when a Worker crashes after it has started executing an Activity Task.
Start-To-Close Timeout period
A Retry Policy attached to an Activity Execution retries an Activity Task Execution. Thus, the Start-To-Close Timeout is applied to each Activity Task Execution within an Activity Execution.
If the first Activity Task Execution returns an error the first time, then the full Activity Execution might look like this:
Start-To-Close Timeout period with retries
If this timeout is reached, the following actions occur:
- An ActivityTaskTimedOut Event is written to the Workflow Execution's mutable state.
- If a Retry Policy dictates a retry, the Temporal Service schedules another Activity Task.
- The attempt count increments by 1 in the Workflow Execution's mutable state.
- The Start-To-Close Timeout timer is reset.
Schedule-To-Close Timeoutβ
What is a Schedule-To-Close Timeout in Temporal?
A Schedule-To-Close Timeout is the maximum amount of time allowed for the overall Activity Execution, from when the first Activity Task is scheduled to when the last Activity Task, in the chain of Activity Tasks that make up the Activity Execution, reaches a Closed status.
Schedule-To-Close Timeout period
Example Schedule-To-Close Timeout period for an Activity Execution that has a chain Activity Task Executions:
Schedule-To-Close Timeout period with a retry
The default Schedule-To-Close Timeout is β (infinity).
An Activity Execution must have either this timeout (Schedule-To-Close) or Start-To-Close set. This timeout can be used to control the overall duration of an Activity Execution in the face of failures (repeated Activity Task Executions), without altering the Maximum Attempts field of the Retry Policy.
We strongly recommend setting a Start-To-Close Timeout.
The Temporal Server doesn't detect failures when a Worker loses communication with the Server or crashes. Therefore, the Temporal Server relies on the Start-To-Close Timeout to force Activity retries.
Activity Heartbeatβ
What is an Activity Heartbeat in Temporal?
An Activity Heartbeat is a ping from the Worker that is executing the Activity to the Temporal Service. Each ping informs the Temporal Service that the Activity Execution is making progress and the Worker has not crashed.
Activity Heartbeats work in conjunction with a Heartbeat Timeout.
Activity Heartbeats are implemented within the Activity Definition. Custom progress information can be included in the Heartbeat which can then be used by the Activity Execution should a retry occur.
An Activity Heartbeat can be recorded as often as needed (e.g. once a minute or every loop iteration). It is often a good practice to Heartbeat on anything but the shortest Activity Function Execution. Temporal SDKs control the rate at which Heartbeats are sent to the Temporal Service.
Heartbeating is not required from Local Activities, and does nothing.
For long-running Activities, we recommend using a relatively short Heartbeat Timeout and a frequent Heartbeat. That way if a Worker fails it can be handled in a timely manner.
A Heartbeat can include an application layer payload that can be used to save Activity Execution progress. If an Activity Task Execution times out due to a missed Heartbeat, the next Activity Task can access and continue with that payload.
Activity Cancellations are delivered to Activities from the Temporal Service when they Heartbeat. Activities that don't Heartbeat can't receive a Cancellation. Heartbeat throttling may lead to Cancellation getting delivered later than expected.
Throttlingβ
Heartbeats may not always be sent to the Temporal Serviceβthey may be throttled by the Worker. The throttle interval is the smaller of the following:
- If
heartbeatTimeout
is provided,heartbeatTimeout * 0.8
; otherwise,defaultHeartbeatThrottleInterval
maxHeartbeatThrottleInterval
defaultHeartbeatThrottleInterval
is 30 seconds by default, and maxHeartbeatThrottleInterval
is 60 seconds by default.
Each can be set in Worker options.
Throttling is implemented as follows:
- After sending a Heartbeat, the Worker sets a timer for the throttle interval.
- The Worker stops sending Heartbeats, but continues receiving Heartbeats from the Activity and remembers the most recent one.
- When the timer fires, the Worker:
- Sends the most recent Heartbeat.
- Sets the timer again.
Throttling allows the Worker to reduce network traffic and load on the Temporal Service by suppressing Heartbeats that arenβt necessary to prevent a Heartbeat Timeout. Throttling does not apply to the final Heartbeat message in the case of Activity Failure. If an Activity fails just after recording progress information in a Heartbeat message, that progress information will be available during the next retry attempt, provided that the Worker itself did not crash before delivering it to the Temporal Service.
Which Activities should Heartbeat?β
Heartbeating is best thought about not in terms of time, but in terms of "How do you know you are making progress?" For short-term operations, progress updates are not a requirement. However, checking the progress and status of Activity Executions that run over long periods is almost always useful.
Consider the following when setting Activity Hearbeats:
-
Your underlying task must be able to report definite progress. Note that your Workflow cannot read this progress information while the Activity is still executing (or it would have to store it in Event History). You can report progress to external sources if you need it exposed to the user.
-
Your Activity Execution is long-running, and you need to verify whether the Worker that is processing your Activity is still alive and has not run out of memory or silently crashed.
For example, the following scenarios are suitable for Heartbeating:
- Reading a large file from Amazon S3.
- Running a ML training job on some local GPUs.
And the following scenarios are not suitable for Heartbeating:
- Making a quick API call.
- Reading a small file from disk.
Heartbeat Timeoutβ
What is a Heartbeat Timeout in Temporal?
A Heartbeat Timeout is the maximum time between Activity Heartbeats.
Heartbeat Timeout periods
If this timeout is reached, the Activity Task fails and a retry occurs if a Retry Policy dictates it.