Retries & Error Handling
Last updated April 7, 2026
RetryPolicy, continueOnError, onFailure hooks, and the DLQ.
Things fail. Networks flake, disks fill up, APIs return 503. The workflow engine gives you several layers of error handling — retry policies at the job level, continueOnError at the step level, onFailure hooks for cleanup, and the dead-letter queue for runs that exhausted every option. Pick the right layer for the failure you're dealing with.
The layers
| Layer | Scope | Use for |
|---|---|---|
retries (job) | Whole job re-runs | Flaky tests, transient network errors, rate limits |
continueOnError (step) | Single step failure | Non-critical side effects (notifications, analytics) |
hooks.onFailure (job) | Cleanup after failure | Releasing locks, rolling back, alerting |
| DLQ | Terminal run state | Manual intervention after everything else gave up |
Each operates on a different unit. Retries re-run the whole job; continueOnError keeps a job going past a broken step; hooks fire after the fact; the DLQ is where runs go to die.
RetryPolicy
Declared on a job:
interface RetryPolicy {
max: number; // non-negative int
backoff: 'exp' | 'lin'; // default 'exp'
initialIntervalMs: number; // positive int, default 1000
maxIntervalMs?: number; // positive int
}When a job fails, the engine waits according to the backoff and re-queues the whole job. max is the number of retries, not the total attempt count — max: 2 means up to 3 total attempts (1 original + 2 retries).
Backoff modes
Exponential ('exp', default):
attempt 1: initialInterval
attempt 2: initialInterval × 2
attempt 3: initialInterval × 4
...
capped at maxIntervalMsLinear ('lin'):
attempt 1: initialInterval
attempt 2: initialInterval × 2
attempt 3: initialInterval × 3
...
capped at maxIntervalMsWith initialIntervalMs: 1000 (the default), exponential gives you 1s, 2s, 4s, 8s, 16s…; linear gives you 1s, 2s, 3s, 4s, 5s…. Exponential is usually right for external API failures (gives the upstream time to recover); linear is fine for in-process flakiness.
Example: flaky integration tests
{
"jobs": {
"test": {
"runsOn": "sandbox",
"steps": [
{ "name": "Run tests", "uses": "builtin:shell", "with": { "command": "pnpm test:integration" } }
],
"retries": {
"max": 3,
"backoff": "exp",
"initialIntervalMs": 5000,
"maxIntervalMs": 60000
}
}
}
}Up to 4 attempts total. Waits 5s, 10s, 20s between retries (capped at 60s if the interval exceeds it).
No retries by default
Without a retries field, a job fails on the first failure. max: 0 is the same as omitting retries entirely.
Retries re-run the whole job
This is important. Retries are a job-level concept — the engine doesn't retry individual steps. If step 3 of 5 fails and retries: { max: 2 } is set, the engine retries the entire job from step 1. Steps that succeeded the first time run again on the retry.
If your steps have side effects (writing files, creating resources, posting to Slack), think about idempotency. If that's painful, consider splitting the risky work into its own job so retries don't clobber the rest.
Retries don't mask failed in the history
Every retry attempt is recorded in the job run as a separate attempt value. The run history shows all attempts, not just the final outcome. Studio visualizes this as "attempt 1 failed, attempt 2 succeeded" — useful for spotting flakiness trends.
continueOnError on steps
continueOnError?: boolean // default falseWhen true, a failed step is recorded as failed but the job continues to the next step. Subsequent steps run normally; the job's final status is determined by whether any non-continueOnError step failed.
When to use it
Non-critical side effects. Analytics posting, notification sending, advisory lint checks, optional cleanup.
{
"steps": [
{ "name": "Build", "uses": "builtin:shell", "with": { "command": "pnpm build" } },
{ "name": "Test", "uses": "builtin:shell", "with": { "command": "pnpm test" } },
{
"name": "Upload coverage",
"uses": "plugin:codecov:upload",
"continueOnError": true
},
{
"name": "Notify Slack",
"uses": "plugin:notify:slack",
"continueOnError": true
}
]
}If codecov or Slack are down, the build still counts as successful. The failed steps are recorded but don't block the pipeline.
When NOT to use it
Never on a step another step depends on. If step B reads steps.a.outputs.xyz, then continueOnError on A means B will run against an undefined output. Almost always what you want instead is a gate or an if check on B.
Shell throwOnError vs step continueOnError
These are orthogonal. throwOnError is an input to the shell handler — it controls whether a non-zero exit fails the step. continueOnError controls whether a failed step fails the job.
throwOnError | continueOnError | Non-zero exit result |
|---|---|---|
false (default) | any | Step succeeds; outputs.ok = false |
true | false (default) | Step fails; job fails |
true | true | Step fails; job continues |
Use throwOnError: false when you want to inspect the exit code in a later step. Use continueOnError: true when you want the job to survive a known-failing step.
hooks.onFailure
hooks?: {
pre?: StepSpec[];
post?: StepSpec[];
onSuccess?: StepSpec[];
onFailure?: StepSpec[];
}The onFailure hook runs after the main steps if any of them failed. Use it for cleanup that only matters when things went wrong.
Execution order
For a failed job: pre → steps (fail) → onFailure → post.
If the main steps succeed: pre → steps → onSuccess → post.
post always runs; onSuccess and onFailure are mutually exclusive based on the main-steps outcome.
Example: rollback and alert on deploy failure
{
"jobs": {
"deploy": {
"runsOn": "sandbox",
"secrets": ["DEPLOY_KEY", "SLACK_WEBHOOK"],
"steps": [
{ "name": "Deploy", "uses": "plugin:release:deploy", "with": { "env": "production" } }
],
"hooks": {
"onFailure": [
{ "name": "Rollback", "uses": "plugin:release:rollback" },
{
"name": "Notify Slack",
"uses": "plugin:notify:slack",
"with": { "message": "Prod deploy FAILED, rolled back. Check run ${{ trigger.runId }}" }
}
],
"post": [
{ "name": "Release lock", "uses": "builtin:shell", "with": { "command": "./scripts/release-lock.sh" } }
]
}
}
}
}On failure: Deploy fails → Rollback runs → Slack alert posts → Lock releases. On success: Deploy succeeds → Lock releases. Either way, the lock is released.
Hook steps are full steps
They have the full StepSpec shape. They can use builtins, plugin handlers, have their own with, env, if, timeoutMs, and continueOnError. A hook step that itself fails is recorded but doesn't change the job outcome (if the main steps succeeded, the job is still successful even if onSuccess failed).
Failing from inside a step
When a plugin handler wants to fail the step, it throws. The engine catches the exception, records it in the step's error field, and transitions the step to 'failed'.
interface StepRunError {
message: string;
code?: string;
stack?: string;
timestamp?: string;
details?: Record<string, unknown>;
}message— human-readable.code— machine-readable identifier your handler defines (e.g.,'ENOENT','RATE_LIMIT','INVALID_CONFIG').stack— stack trace for debugging.details— anything else you want to attach.
Clients can branch on code in UIs and subsequent logic. Prefer stable, documented codes over free-form messages.
The dead-letter queue
type RunState = 'queued' | 'running' | 'success' | 'failed' | 'cancelled' | 'skipped' | 'dlq';When a run exhausts its retries and the engine can't make progress, it lands in the 'dlq' state — the dead-letter queue. This is the terminal sink for runs that need manual intervention.
'dlq' is distinct from 'failed':
'failed'— the run finished with a failure. No more work will be done. Typical cause: a job failed, no retries left.'dlq'— the run is parked because the engine explicitly gave up. Something unusual happened (engine crash mid-run, repeated scheduler failures, unhandled infrastructure error). Needs human attention.
DLQ runs are listed in Studio with a distinct status. From there an operator can inspect the history, fix whatever was wrong, and either replay the run or cancel it.
There's no spec-level configuration for DLQ behavior — it's engine-internal. If a run ends up in DLQ, treat it as a bug to investigate, not a normal failure mode.
Job states vs step states
The two state enums differ in the failure/recovery fields:
type JobState = 'queued' | 'running' | 'success' | 'failed' | 'cancelled' | 'skipped' | 'interrupted';
type StepState = RunState | 'waiting_approval';'interrupted'is job-only. A job is marked'interrupted'when the engine restarts mid-execution. It's not a hard failure — the engine can attempt to resume it on the next boot.'waiting_approval'is step-only. See Gates & Approvals.'dlq'is run-only.
Timeouts
Three separate timeout layers, each with its own scope:
WorkflowSpec — no timeout (runs as long as needed)
JobSpec — timeoutMs?: number (≤ 24h)
StepSpec — timeoutMs?: number (≤ 24h)A step timing out doesn't necessarily fail the job (that's what continueOnError is for). A job timing out does fail the job. Timeouts at both layers are independent — you can have a 5-minute step timeout inside a 30-minute job timeout.
No defaults at any layer: without an explicit timeoutMs, there's no upper bound from the engine side. The shell handler has its own default of 5 minutes (ShellInput.timeout), but that's enforced by the handler, not by the engine.
Patterns
Flaky test with bounded retries
{
"retries": { "max": 3, "backoff": "exp", "initialIntervalMs": 5000 }
}Optional side-effect step
{
"name": "Post to analytics",
"uses": "plugin:analytics:track",
"continueOnError": true
}Cleanup that must always happen
{
"hooks": {
"post": [
{ "name": "Release lock", "uses": "builtin:shell", "with": { "command": "./scripts/release.sh" } }
]
}
}Retry with cleanup between attempts
Retries re-run the whole job, so pre hooks run on every retry:
{
"hooks": {
"pre": [
{ "name": "Reset state", "uses": "builtin:shell", "with": { "command": "./scripts/reset.sh" } }
]
},
"retries": { "max": 2, "backoff": "exp", "initialIntervalMs": 10000 }
}Alert on failure, tag on success
{
"hooks": {
"onSuccess": [ { "name": "Tag release", "uses": "plugin:release:tag" } ],
"onFailure": [ { "name": "Alert", "uses": "plugin:notify:slack" } ]
}
}What NOT to do
- Don't use retries for logic errors. If the code is wrong, retrying won't fix it. Retries are for transient failures — network, rate limits, race conditions on external state.
- Don't use
continueOnErrorto paper over real problems. A failing step that you don't care about is noise in the run history. Either fix it or remove it. - Don't stack retries high.
max: 10just means you'll fail slower. Three retries is usually the ceiling — if it's still broken after three attempts, it's not transient. - Don't rely on
onFailurefor cleanup that must always happen.onFailureonly runs on failure. Cleanup that must always happen goes inpost, notonFailure.
What to read next
- Jobs — where hooks and retries are declared.
- Steps —
continueOnErrorand step-level timeouts. - Gates & Approvals — decision routing as an alternative to retries.
- Spec Reference → RetryPolicy — the full schema.