KB LabsDocs

Retries & Error Handling

Last updated April 7, 2026


RetryPolicy, continueOnError, onFailure hooks, and the DLQ.

Things fail. Networks flake, disks fill up, APIs return 503. The workflow engine gives you several layers of error handling — retry policies at the job level, continueOnError at the step level, onFailure hooks for cleanup, and the dead-letter queue for runs that exhausted every option. Pick the right layer for the failure you're dealing with.

The layers

LayerScopeUse for
retries (job)Whole job re-runsFlaky tests, transient network errors, rate limits
continueOnError (step)Single step failureNon-critical side effects (notifications, analytics)
hooks.onFailure (job)Cleanup after failureReleasing locks, rolling back, alerting
DLQTerminal run stateManual intervention after everything else gave up

Each operates on a different unit. Retries re-run the whole job; continueOnError keeps a job going past a broken step; hooks fire after the fact; the DLQ is where runs go to die.

RetryPolicy

Declared on a job:

TypeScript
interface RetryPolicy {
  max: number;                        // non-negative int
  backoff: 'exp' | 'lin';             // default 'exp'
  initialIntervalMs: number;          // positive int, default 1000
  maxIntervalMs?: number;             // positive int
}

When a job fails, the engine waits according to the backoff and re-queues the whole job. max is the number of retries, not the total attempt count — max: 2 means up to 3 total attempts (1 original + 2 retries).

Backoff modes

Exponential ('exp', default):

attempt 1: initialInterval
attempt 2: initialInterval × 2
attempt 3: initialInterval × 4
...
capped at maxIntervalMs

Linear ('lin'):

attempt 1: initialInterval
attempt 2: initialInterval × 2
attempt 3: initialInterval × 3
...
capped at maxIntervalMs

With initialIntervalMs: 1000 (the default), exponential gives you 1s, 2s, 4s, 8s, 16s…; linear gives you 1s, 2s, 3s, 4s, 5s…. Exponential is usually right for external API failures (gives the upstream time to recover); linear is fine for in-process flakiness.

Example: flaky integration tests

JSON
{
  "jobs": {
    "test": {
      "runsOn": "sandbox",
      "steps": [
        { "name": "Run tests", "uses": "builtin:shell", "with": { "command": "pnpm test:integration" } }
      ],
      "retries": {
        "max": 3,
        "backoff": "exp",
        "initialIntervalMs": 5000,
        "maxIntervalMs": 60000
      }
    }
  }
}

Up to 4 attempts total. Waits 5s, 10s, 20s between retries (capped at 60s if the interval exceeds it).

No retries by default

Without a retries field, a job fails on the first failure. max: 0 is the same as omitting retries entirely.

Retries re-run the whole job

This is important. Retries are a job-level concept — the engine doesn't retry individual steps. If step 3 of 5 fails and retries: { max: 2 } is set, the engine retries the entire job from step 1. Steps that succeeded the first time run again on the retry.

If your steps have side effects (writing files, creating resources, posting to Slack), think about idempotency. If that's painful, consider splitting the risky work into its own job so retries don't clobber the rest.

Retries don't mask failed in the history

Every retry attempt is recorded in the job run as a separate attempt value. The run history shows all attempts, not just the final outcome. Studio visualizes this as "attempt 1 failed, attempt 2 succeeded" — useful for spotting flakiness trends.

continueOnError on steps

TypeScript
continueOnError?: boolean    // default false

When true, a failed step is recorded as failed but the job continues to the next step. Subsequent steps run normally; the job's final status is determined by whether any non-continueOnError step failed.

When to use it

Non-critical side effects. Analytics posting, notification sending, advisory lint checks, optional cleanup.

JSON
{
  "steps": [
    { "name": "Build",   "uses": "builtin:shell", "with": { "command": "pnpm build" } },
    { "name": "Test",    "uses": "builtin:shell", "with": { "command": "pnpm test" } },
    {
      "name": "Upload coverage",
      "uses": "plugin:codecov:upload",
      "continueOnError": true
    },
    {
      "name": "Notify Slack",
      "uses": "plugin:notify:slack",
      "continueOnError": true
    }
  ]
}

If codecov or Slack are down, the build still counts as successful. The failed steps are recorded but don't block the pipeline.

When NOT to use it

Never on a step another step depends on. If step B reads steps.a.outputs.xyz, then continueOnError on A means B will run against an undefined output. Almost always what you want instead is a gate or an if check on B.

Shell throwOnError vs step continueOnError

These are orthogonal. throwOnError is an input to the shell handler — it controls whether a non-zero exit fails the step. continueOnError controls whether a failed step fails the job.

throwOnErrorcontinueOnErrorNon-zero exit result
false (default)anyStep succeeds; outputs.ok = false
truefalse (default)Step fails; job fails
truetrueStep fails; job continues

Use throwOnError: false when you want to inspect the exit code in a later step. Use continueOnError: true when you want the job to survive a known-failing step.

hooks.onFailure

TypeScript
hooks?: {
  pre?:       StepSpec[];
  post?:      StepSpec[];
  onSuccess?: StepSpec[];
  onFailure?: StepSpec[];
}

The onFailure hook runs after the main steps if any of them failed. Use it for cleanup that only matters when things went wrong.

Execution order

For a failed job: pre → steps (fail) → onFailure → post.

If the main steps succeed: pre → steps → onSuccess → post.

post always runs; onSuccess and onFailure are mutually exclusive based on the main-steps outcome.

Example: rollback and alert on deploy failure

JSON
{
  "jobs": {
    "deploy": {
      "runsOn": "sandbox",
      "secrets": ["DEPLOY_KEY", "SLACK_WEBHOOK"],
      "steps": [
        { "name": "Deploy", "uses": "plugin:release:deploy", "with": { "env": "production" } }
      ],
      "hooks": {
        "onFailure": [
          { "name": "Rollback", "uses": "plugin:release:rollback" },
          {
            "name": "Notify Slack",
            "uses": "plugin:notify:slack",
            "with": { "message": "Prod deploy FAILED, rolled back. Check run ${{ trigger.runId }}" }
          }
        ],
        "post": [
          { "name": "Release lock", "uses": "builtin:shell", "with": { "command": "./scripts/release-lock.sh" } }
        ]
      }
    }
  }
}

On failure: Deploy fails → Rollback runs → Slack alert posts → Lock releases. On success: Deploy succeeds → Lock releases. Either way, the lock is released.

Hook steps are full steps

They have the full StepSpec shape. They can use builtins, plugin handlers, have their own with, env, if, timeoutMs, and continueOnError. A hook step that itself fails is recorded but doesn't change the job outcome (if the main steps succeeded, the job is still successful even if onSuccess failed).

Failing from inside a step

When a plugin handler wants to fail the step, it throws. The engine catches the exception, records it in the step's error field, and transitions the step to 'failed'.

TypeScript
interface StepRunError {
  message: string;
  code?: string;
  stack?: string;
  timestamp?: string;
  details?: Record<string, unknown>;
}
  • message — human-readable.
  • code — machine-readable identifier your handler defines (e.g., 'ENOENT', 'RATE_LIMIT', 'INVALID_CONFIG').
  • stack — stack trace for debugging.
  • details — anything else you want to attach.

Clients can branch on code in UIs and subsequent logic. Prefer stable, documented codes over free-form messages.

The dead-letter queue

TypeScript
type RunState = 'queued' | 'running' | 'success' | 'failed' | 'cancelled' | 'skipped' | 'dlq';

When a run exhausts its retries and the engine can't make progress, it lands in the 'dlq' state — the dead-letter queue. This is the terminal sink for runs that need manual intervention.

'dlq' is distinct from 'failed':

  • 'failed' — the run finished with a failure. No more work will be done. Typical cause: a job failed, no retries left.
  • 'dlq' — the run is parked because the engine explicitly gave up. Something unusual happened (engine crash mid-run, repeated scheduler failures, unhandled infrastructure error). Needs human attention.

DLQ runs are listed in Studio with a distinct status. From there an operator can inspect the history, fix whatever was wrong, and either replay the run or cancel it.

There's no spec-level configuration for DLQ behavior — it's engine-internal. If a run ends up in DLQ, treat it as a bug to investigate, not a normal failure mode.

Job states vs step states

The two state enums differ in the failure/recovery fields:

TypeScript
type JobState  = 'queued' | 'running' | 'success' | 'failed' | 'cancelled' | 'skipped' | 'interrupted';
type StepState =  RunState | 'waiting_approval';
  • 'interrupted' is job-only. A job is marked 'interrupted' when the engine restarts mid-execution. It's not a hard failure — the engine can attempt to resume it on the next boot.
  • 'waiting_approval' is step-only. See Gates & Approvals.
  • 'dlq' is run-only.

Timeouts

Three separate timeout layers, each with its own scope:

TypeScript
WorkflowSpec  — no timeout (runs as long as needed)
JobSpec       — timeoutMs?: number  (≤ 24h)
StepSpec      — timeoutMs?: number  (≤ 24h)

A step timing out doesn't necessarily fail the job (that's what continueOnError is for). A job timing out does fail the job. Timeouts at both layers are independent — you can have a 5-minute step timeout inside a 30-minute job timeout.

No defaults at any layer: without an explicit timeoutMs, there's no upper bound from the engine side. The shell handler has its own default of 5 minutes (ShellInput.timeout), but that's enforced by the handler, not by the engine.

Patterns

Flaky test with bounded retries

JSON
{
  "retries": { "max": 3, "backoff": "exp", "initialIntervalMs": 5000 }
}

Optional side-effect step

JSON
{
  "name": "Post to analytics",
  "uses": "plugin:analytics:track",
  "continueOnError": true
}

Cleanup that must always happen

JSON
{
  "hooks": {
    "post": [
      { "name": "Release lock", "uses": "builtin:shell", "with": { "command": "./scripts/release.sh" } }
    ]
  }
}

Retry with cleanup between attempts

Retries re-run the whole job, so pre hooks run on every retry:

JSON
{
  "hooks": {
    "pre": [
      { "name": "Reset state", "uses": "builtin:shell", "with": { "command": "./scripts/reset.sh" } }
    ]
  },
  "retries": { "max": 2, "backoff": "exp", "initialIntervalMs": 10000 }
}

Alert on failure, tag on success

JSON
{
  "hooks": {
    "onSuccess": [ { "name": "Tag release", "uses": "plugin:release:tag" } ],
    "onFailure": [ { "name": "Alert",       "uses": "plugin:notify:slack" } ]
  }
}

What NOT to do

  • Don't use retries for logic errors. If the code is wrong, retrying won't fix it. Retries are for transient failures — network, rate limits, race conditions on external state.
  • Don't use continueOnError to paper over real problems. A failing step that you don't care about is noise in the run history. Either fix it or remove it.
  • Don't stack retries high. max: 10 just means you'll fail slower. Three retries is usually the ceiling — if it's still broken after three attempts, it's not transient.
  • Don't rely on onFailure for cleanup that must always happen. onFailure only runs on failure. Cleanup that must always happen goes in post, not onFailure.
Retries & Error Handling — KB Labs Docs