Skip to main content

Retry

Automatically retry failed tasks before marking them as failed. Retries can be immediate or use exponential backoff to avoid overwhelming a recovering service.

Usage

Immediate Retry

Retry up to 3 times with no delay between attempts:

getLoad := plan.TaskFunc("get-load",
func(
ctx context.Context,
c *client.Client,
) (*orchestrator.Result, error) {
resp, err := c.Node.Load(ctx, "_any")
if err != nil {
return nil, err
}

return orchestrator.CollectionResult(
resp.Data,
func(r client.LoadResult) orchestrator.HostResult {
return orchestrator.HostResult{
Hostname: r.Hostname,
Changed: r.Changed,
Error: r.Error,
}
},
), nil
},
)
getLoad.OnError(orchestrator.Retry(3))

Retry with Exponential Backoff

Add exponential backoff between retry attempts using WithRetryBackoff. The delay doubles on each attempt, clamped to the max interval:

// Retry 3 times: ~1s, ~2s, ~4s between attempts.
getLoad.OnError(orchestrator.Retry(3,
orchestrator.WithRetryBackoff(1*time.Second, 30*time.Second),
))

Plan-Level Retry with Backoff

Set as the default strategy for all tasks in the plan:

plan := orchestrator.NewPlan(client,
orchestrator.OnError(orchestrator.Retry(3,
orchestrator.WithRetryBackoff(
500*time.Millisecond,
10*time.Second,
),
)),
)

OnRetry Hook

Use the OnRetry hook to observe retry attempts:

hooks := orchestrator.Hooks{
OnRetry: func(
task *orchestrator.Task,
attempt int,
err error,
) {
fmt.Printf("[retry] %s attempt=%d error=%q\n",
task.Name(), attempt, err)
},
}

Transient Poll Errors

Job polling automatically retries transient HTTP errors (404, 500) with exponential backoff. This handles the race where the agent hasn't written results yet when the SDK first polls. Non-transient errors (401, 403, network failures) fail immediately.

Example

See examples/sdk/orchestrator/features/retry.go for a complete working example that simulates transient failures to demonstrate all three retry strategies.