Runbooks
Last updated April 7, 2026
Copy-paste procedures for common operational tasks.
Runbooks are the "when X happens, do Y" cookbook for operators. Each entry is a specific scenario with concrete steps. Keep them short and actionable — runbooks aren't documentation of how things work, they're documentation of what to do.
For general operational guidance see Deployment, Observability, and Security. For incidents specifically, see Incident Response.
Service lifecycle
Restart a service
kb-dev restart restFor production deployments with systemd:
sudo systemctl restart kb-restOr with Kubernetes:
kubectl rollout restart deployment/kb-rest-apiCheck the service came back up:
kb-dev status
# Or:
curl -f http://localhost:5050/api/v1/healthStop everything
kb-dev stopStops services in reverse dependency order. Safe — drains in-flight requests before killing processes.
Start in a specific order
kb-dev already respects dependsOn, but if you need to bring up a subset:
kb-dev start infra # Qdrant, Redis, state-daemon
kb-dev start backend # REST, workflow, marketplace, gateway
kb-dev start studio # UICheck which services are running
kb-dev statusShows every declared service with current state (running, stopped, unhealthy) and the last health check result.
Rolling updates
Deploy a new platform version
For stateless services (gateway, REST API, marketplace):
- Deploy new version alongside the old.
- Shift traffic on the load balancer.
- Drain the old version (stop accepting new requests, wait for in-flight to complete).
- Shut down.
Connected host agents reconnect during cutover. See Gateway → Self-Hosted → Rollout strategy.
Upgrade the workflow daemon
The workflow daemon is stateful — runs in progress survive a restart via the 'interrupted' state:
- Deploy new version.
sudo systemctl restart kb-workflow(or equivalent).- The old daemon exits; in-flight runs are marked
'interrupted'. - The new daemon starts; its recovery path picks up interrupted runs and resumes them.
For the zero-downtime version, you'd need two workflow daemons pointing at a shared job queue — not currently supported. For now: brief downtime during the restart is the expected behavior.
Upgrade a plugin
pnpm kb marketplace update @scope/plugin-cli
pnpm kb marketplace clear-cache
# Restart whichever service hosts the plugin's handlers:
kb-dev restart rest # if the plugin has REST routes
# Or no restart needed for CLI-only pluginsPlugin management
Install a new plugin
pnpm kb marketplace install @scope/plugin-cli
pnpm kb marketplace clear-cacheAfter install, the plugin is immediately available from the CLI. For REST routes and Studio pages, restart the hosting service:
kb-dev restart restDisable a plugin without uninstalling
pnpm kb marketplace plugins disable @scope/plugin-cliSets enabled: false in .kb/marketplace.lock. The plugin stays installed but doesn't load on the next service startup. Re-enable:
pnpm kb marketplace plugins enable @scope/plugin-cliUninstall a plugin completely
pnpm kb marketplace uninstall @scope/plugin-cli
pnpm kb marketplace clear-cacheRemoves from marketplace.lock and runs pnpm remove to drop the package from node_modules.
Clear the plugin registry cache
pnpm kb marketplace clear-cacheEvery long-running service caches the plugin registry in memory. If a plugin was installed, updated, or linked and isn't showing up, clear the cache and restart. This is the single most common fix for "my plugin doesn't appear".
Logs
Tail a service's logs
kb-dev logs rest -fFollows stdout/stderr from the rest service. -f for follow, omit for snapshot.
For systemd:
sudo journalctl -u kb-rest -fFor Kubernetes:
kubectl logs -f deployment/kb-rest-apiQuery logs by request ID
With Loki / Elasticsearch / Datadog set up, filter by requestId:
serviceId:* AND requestId:req_abc123You'll see every log line for that request across every service it touched. This is the fastest way to debug a specific failure — get the requestId from the error, paste it into the log query.
Increase log verbosity temporarily
Set KB_LOG_LEVEL=debug in the service env and restart:
sudo systemctl edit kb-rest
# Add:
# [Service]
# Environment=KB_LOG_LEVEL=debug
sudo systemctl daemon-reload
sudo systemctl restart kb-restRevert when you're done. debug is very verbose and will flood your log aggregator.
Config changes
Apply a kb.config.json change
Platform services read kb.config.json at startup. To apply a change:
- Edit
.kb/kb.config.json. - Restart every service that reads the changed section:
- Changed
platform.adapters.*→ restart everything. - Changed
platform.execution.*→ restart REST, workflow, marketplace. - Changed
platform.core.*→ restart whichever service owns that concern. - Changed
profiles.*→ plugins pick up changes viauseConfig(), which is cache-aware but doesn't auto-refresh. Restart to be safe. - Changed
gateway.*→ restart the gateway.
- Changed
kb-dev restartRestarts all managed services in dependency order.
Rotate a secret
- Update the secret in your secret manager.
- Update the env var on the service process.
- Restart the service.
For GATEWAY_JWT_SECRET specifically: rotation invalidates all outstanding JWTs simultaneously. Clients will get 401s until they re-authenticate. Coordinate the rotation with your clients if you have long-running sessions.
Infrastructure
Reindex Mind RAG
After adding new files or making significant changes:
rm -rf .kb/cache/* # clear stale cache
pnpm kb mind rag-index --scope default # reindexFor scope-specific reindexing:
pnpm kb mind rag-index --scope default \
--include "packages/core/**/*.ts"See the kb-labs-mind plugin README for the full command surface.
Back up SQLite data
Single-host deployments often use SQLite for analytics, logs, databases, and the gateway host store. Back them up with standard file-system tools:
# Stop services before backing up SQLite files to get a consistent snapshot:
kb-dev stop
tar czf kb-labs-backup-$(date +%Y%m%d).tar.gz \
.kb/database/ \
.kb/analytics/ \
.kb/storage/ \
.kb/mind/
kb-dev startOr use SQLite's .backup command for online backups without stopping services:
sqlite3 .kb/database/kb.sqlite ".backup 'kb.sqlite.backup'"SQLite .backup is consistent under concurrent writes.
Restore from a SQLite backup
kb-dev stop
mv .kb/database/kb.sqlite .kb/database/kb.sqlite.old
cp /path/to/backup.sqlite .kb/database/kb.sqlite
kb-dev startVerify the restore by checking that services come up healthy (kb-dev status) and that analytics queries return expected data.
Clear stale workspaces
The workspace adapter (worktree or local-fs) creates per-run workspaces under .kb/runtime/workspaces/. They're supposed to be GC'd after the run completes, but crashes can leave orphans:
# List current workspaces:
ls .kb/runtime/workspaces/
# Remove orphans (safe when services are stopped):
kb-dev stop workflow
rm -rf .kb/runtime/workspaces/ws_*
kb-dev start workflowIf you're using the worktree adapter, also prune the git worktree list:
cd <your-repo>
git worktree pruneHost agent
Register a new host agent
On the developer's machine:
kb agent register --gateway https://gateway.example.comPrompts for authentication, writes ~/.kb/agent.json, and registers with the gateway.
Start / stop the host agent
kb agent start
kb agent stop
kb agent statusFor debugging, run in the foreground:
node infra/kb-labs-host-agent/apps/host-agent-app/dist/index.jsRevoke a host agent
From the server side:
# Via the REST API:
curl -X DELETE https://gateway.example.com/api/v1/hosts/host_abc123 \
-H "Authorization: Bearer $ADMIN_TOKEN"The host's client ID is marked invalid in the gateway's host store; future connections with that ID are refused.
Health checks
Verify platform is healthy
kb-dev status # dev modeOr query each service's /health endpoint:
curl -f http://localhost:4000/health # gateway
curl -f http://localhost:5050/api/v1/health # rest
curl -f http://localhost:7778/health # workflow
curl -f http://localhost:5070/health # marketplace
curl -f http://localhost:7777/health # state daemon200 means alive; anything else (or connection refused) means broken.
Diagnose why a service won't start
- Tail the service's logs (
kb-dev logs <service>orjournalctl -u kb-<service>). - Look for the first error during bootstrap — usually it's a missing env var, bad config, or unreachable dependency.
- Verify infrastructure dependencies:
curl http://localhost:6333/(Qdrant),redis-cli ping(Redis). - Check
kb.config.jsonis valid JSON:jq . .kb/kb.config.json. - Run
pnpm kb doctorif available — it diagnoses common issues.
Troubleshooting shortcuts
| Symptom | Likely fix |
|---|---|
| Plugin doesn't appear after install | pnpm kb marketplace clear-cache |
| REST API returns 401 | Check Authorization header; check GATEWAY_JWT_SECRET didn't rotate |
Workflow runs stuck in queued | Check workflow daemon logs; verify broker isn't backed up |
| Studio pages show "Failed to load" | Hard-reload browser; plugin Studio remotes are cached aggressively |
| LLM calls fail with "not configured" | Check platform.adapters.llm in kb.config.json; verify OPENAI_API_KEY is set |
| "Module not found" in a plugin | Plugin was rebuilt but cache wasn't cleared; pnpm kb marketplace clear-cache |
| Port already in use | Another service is on that port; lsof -i :5050 to find it |
| High memory usage | Reindex Mind RAG; check for plugins without quota limits |
| Slow CLI startup | Clear registry cache; check for slow adapter init (network timeouts) |
What to read next
- Incident Response — specific incident playbooks.
- Observability — detailed metrics, logs, traces setup.
- Deployment — for new deployments, not for operating existing ones.