Production Readiness
Last updated April 7, 2026
Checklist for moving a KB Labs deployment from dev to production.
Going to production with KB Labs means turning on the things that are disabled or loose in dev: real secrets, hard quotas, sandbox isolation, observability, backups. This guide is the checklist — every item with its reason and the reference page for how to do it.
Nothing here is novel; it's all cross-cutting across the docs. Use it as a "did I miss anything" list, not as primary reading.
Before you start
- Everything works in dev on a local workspace.
- You have the plugins you actually want installed and working.
- You have a specific environment (server, cluster, cloud provider) in mind.
If any of these is fuzzy, fix that first. This guide is about turning on the last mile, not about designing the deployment.
Configuration
kb.config.json
- No secrets in
kb.config.json. Every sensitive value should be an env var reference or a placeholder your deploy pipeline replaces. The file gets committed; treat it as public. - Set
platform.execution.modeexplicitly. Don't rely onauto. Pickworker-poolfor single-node trusted deployments,containerfor multi-tenant or untrusted. See Concepts → Execution Model. - Declare every adapter you need. Missing adapters default to NoOp; plugins that rely on them will silently degrade. List them all in
platform.adapters. - Set
platform.core.resources.defaultQuotas. Per-tenant quotas have defaults — set them explicitly so you know what the cap is. See Concepts → Multi-Tenancy. - Pin adapter versions. Don't use
^ranges for production; pin to exact versions inkb.config.jsonentries so updates are deliberate. - Commit the config. Put
kb.config.jsonin version control so rollbacks are possible.
See Configuration → kb.config.json.
Profiles
- Use a named production profile. Don't run the default profile in production. Define a
productionprofile with explicit settings so dev and prod differ visibly. - Set
KB_PROFILE=productionon every service process. - Review profile inheritance. If your production profile extends another, verify the inherited settings are what you want.
Environment variables
-
GATEWAY_JWT_SECRET— 32+ byte random value from a secret manager. Mandatory. Without it, the gateway runs with an insecure default and warns at startup. -
GATEWAY_INTERNAL_SECRET— separate 32+ byte value. Required whenplatform.execution.mode === 'container'. -
NODE_ENV=productionon every service process. -
KB_LOG_LEVEL=info(orwarnin very quiet deployments). Neverdebugortracein prod — they flood log aggregators. -
KB_TENANT_ID— if you're single-tenant, set it to a meaningful identifier for billing and attribution. - Adapter secrets —
OPENAI_API_KEY,ANTHROPIC_API_KEY,QDRANT_API_KEY, etc. Pull from your secret manager. - Path vars —
KB_PROJECT_ROOTif the services run in a directory different from the project.
See Configuration → Environment Variables.
Security
- Run services as a non-root user. Each service should have a dedicated user with no login shell.
- Filesystem permissions. The service user owns the workspace; no other user has write access.
- Don't expose internal services. Only the gateway is public-facing. REST API, workflow daemon, marketplace, state daemon should all be on an internal network.
- TLS termination. Put Caddy, nginx, or Traefik in front of the gateway. The gateway itself doesn't do TLS.
- Rate limit at the reverse proxy. For HTTP-level DDoS protection. Platform-level rate limits are per-tenant, not per-source-IP.
- Review every plugin's permissions. Each
@installed/plugin-cli/manifest.tsdeclares permissions — know what each plugin can do before you install it. - Don't install random plugins. Plugins from the marketplace run with whatever permissions they declare. Treat plugin installs as dependency audits.
- Never use
fullEnvPresetin production. It grants full env access. Trusted plugins only. - Rotate secrets on a schedule. JWT secret, API keys, service account credentials. Your security policy dictates the schedule; the platform supports in-place rotation via env var + restart.
- Container execution for untrusted plugins. If you're running plugins from the marketplace or from sources you don't fully trust, use
mode: 'container'to get OS-level isolation.
See Operations → Security and Gateway → Self-Hosted.
Observability
- Ship logs to a log aggregator. Loki, Elasticsearch, Datadog, whatever you have. Default stdout logging is not a production answer.
- Scrape Prometheus metrics. Every service exposes
/metrics. Add scrape configs for each. - Set up alerts. See Operations → Observability → Alerting for starter PromQL rules.
- Log correlation IDs. Verify
traceId,requestId, andtenantIdare preserved in your log pipeline. Without them, cross-service debugging is impossible. - Analytics persistence. Configure a real analytics adapter (SQLite, DuckDB, or a custom one). NoOp means no historical data.
- Test incident response. Run a drill. Simulate a service crash, a rogue plugin, a secret leak. If your playbook doesn't work in the drill, fix it before you need it.
See Operations → Observability and Operations → Incident Response.
Data durability
- Back up SQLite files. If you're using SQLite for analytics, databases, or the gateway host store, back up on a schedule. See Operations → Runbooks → Back up SQLite data.
- Put persistent volumes on real storage. Not tmpfs, not container-local FS. EBS, GCP PD, Azure disk, or equivalent.
- Backup the workspace lock file.
.kb/marketplace.lockis the source of truth for installed plugins. Lose it and discovery returns nothing until you re-install everything. - Test restore. A backup you haven't tested restoring is not a backup.
Rolling updates
- Deploy via CI, not manually. Automate the build → test → publish → deploy pipeline. Manual deploys are the #1 source of production incidents.
- Version every plugin. Pin plugin versions in the marketplace lock; don't
pnpm kb marketplace updatewithout a changelog review. - Rollout gradually. For multi-instance deployments, roll out to one instance, observe, then the rest.
- Rollback plan. Know how to roll back.
git revert+ redeploy is the usual path; make sure it actually works. - Workflow daemon restart impact. The workflow daemon is stateful. Restarting it interrupts in-flight runs. Schedule restarts during low-traffic windows and know the
'interrupted'recovery path.
See Operations → Runbooks → Rolling updates.
Scaling
- Gateway can scale horizontally. All instances need the same
GATEWAY_JWT_SECRET. Host registry should be backed by a shared database for persistence across instances. - REST API can scale horizontally. Stateless except for the read-only plugin registry.
- Workflow daemon cannot scale horizontally yet. Run one instance, vertically scale. Multi-instance workflow is roadmap.
- Marketplace and state daemon are low-load. One instance each is plenty.
- Size worker pools.
platform.execution.workerPool.maxshould match your concurrency needs. Default 10 is fine for most deployments. - Monitor memory and CPU. Set resource requests and limits in Kubernetes; use
systemdcgroups on bare hosts.
Multi-tenancy (if applicable)
- Enable distributed state.
platform.core.resourceBroker.distributed: trueplus a running state daemon. - Require
X-Tenant-IDon every request. Don't fall back to the default tenant in production. - Validate tenant claims. Your auth middleware should verify the caller is allowed to act as the claimed tenant.
- Configure per-tenant quotas. At least set
defaultQuotason the resource broker. - Plugins must key on
ctx.tenantId. Every cache key, storage path, state broker operation needs tenant scoping. Review every plugin's data access patterns. - Billing integration. Analytics events carry the tenant ID — wire them into your billing system.
Dev ergonomics (for the team)
- Document the deployment. A short README covering how to access logs, where the config lives, who to page for incidents.
- Local dev parity. Make sure team members can reproduce production issues locally. Provide a
dev.config.jsonthat mirrors production structure. - Onboarding runbook. New team members should be able to set up a local KB Labs environment in under an hour.
- Testing strategy. Unit tests, integration tests, smoke tests. See Guides → Testing.
The final checklist
Before you call it done:
- Everything works on a fresh server/cluster. Not just on the dev's machine.
- Secrets are in a secret manager. Not in
kb.config.json, not in committed.envfiles. - Logs flow to an aggregator. You can search for an error from the last 24 hours in under 30 seconds.
- Metrics are scraped. Your Prometheus shows latency, error rate, and resource usage for every service.
- Alerts are set up. Critical alerts (service down, high error rate, disk full) page someone.
- Backups run. And at least one has been test-restored.
- Rollback works. You've rehearsed reverting a bad deploy.
- Incident runbook exists. Everyone on the team knows where it is.
- At least two people can operate the system. Don't have a single point of failure in the team.
Missing any of these? It's not production-ready. Fix before going live.
What to read next
- Operations → Deployment — deployment topologies.
- Operations → Security — security model and threat model.
- Operations → Observability — logs, metrics, traces.
- Operations → Runbooks — day-to-day operational procedures.
- Operations → Incident Response — playbooks for when things go wrong.
- Gateway → Self-Hosted — gateway-specific production concerns.