Эта страница ещё не переведена на русский.Помочь с переводом на GitHub →

Production Readiness

Обновлено 30 мая 2026 г.

Checklist for moving a KB Labs deployment from dev to production.

Going to production with KB Labs means turning on the things that are disabled or loose in dev: real secrets, hard quotas, sandbox isolation, observability, backups. This guide is the checklist — every item with its reason and the reference page for how to do it.

Nothing here is novel; it's all cross-cutting across the docs. Use it as a "did I miss anything" list, not as primary reading.

Before you start

Everything works in dev on a local workspace.
You have the plugins you actually want installed and working.
You have a specific environment (server, cluster, cloud provider) in mind.

If any of these is fuzzy, fix that first. This guide is about turning on the last mile, not about designing the deployment.

Configuration

`kb.config.json`

No secrets in kb.config.json. Every sensitive value should be an env var reference or a placeholder your deploy pipeline replaces. The file gets committed; treat it as public.
Set platform.execution.mode explicitly. Don't rely on auto. Pick worker-pool for single-node trusted deployments, container for multi-tenant or untrusted. See Concepts → Execution Model.
Declare every adapter you need. Missing adapters default to NoOp; plugins that rely on them will silently degrade. List them all in platform.adapters.
Set platform.core.resources.defaultQuotas. Per-tenant quotas have defaults — set them explicitly so you know what the cap is. See Concepts → Multi-Tenancy.
Pin adapter versions. Don't use ^ ranges for production; pin to exact versions in kb.config.json entries so updates are deliberate.
Commit the config. Put kb.config.json in version control so rollbacks are possible.

See Configuration → kb.config.json.

Profiles

Use a named production profile. Don't run the default profile in production. Define a production profile with explicit settings so dev and prod differ visibly.
Set KB_PROFILE=production on every service process.
Review profile inheritance. If your production profile extends another, verify the inherited settings are what you want.

See Configuration → Profiles.

Environment variables

GATEWAY_JWT_SECRET — 32+ byte random value from a secret manager. Mandatory. Without it, the gateway runs with an insecure default and warns at startup.
GATEWAY_INTERNAL_SECRET — separate 32+ byte value. Required when platform.execution.mode === 'container'.
NODE_ENV=production on every service process.
KB_LOG_LEVEL=info (or warn in very quiet deployments). Never debug or trace in prod — they flood log aggregators.
KB_TENANT_ID — if you're single-tenant, set it to a meaningful identifier for billing and attribution.
Adapter secrets — OPENAI_API_KEY, ANTHROPIC_API_KEY, QDRANT_API_KEY, etc. Pull from your secret manager.
Path vars — KB_PROJECT_ROOT if the services run in a directory different from the project.

See Configuration → Environment Variables.

Security

See Operations → Security and Gateway → Self-Hosted.

Observability

Ship logs to a log aggregator. Loki, Elasticsearch, Datadog, whatever you have. Default stdout logging is not a production answer.
Scrape Prometheus metrics. Every service exposes /metrics. Add scrape configs for each.
Set up alerts. See Operations → Observability → Alerting for starter PromQL rules.
Log correlation IDs. Verify traceId, requestId, and tenantId are preserved in your log pipeline. Without them, cross-service debugging is impossible.
Analytics persistence. Configure a real analytics adapter (SQLite, DuckDB, or a custom one). NoOp means no historical data.
Test incident response. Run a drill. Simulate a service crash, a rogue plugin, a secret leak. If your playbook doesn't work in the drill, fix it before you need it.

See Operations → Observability and Operations → Incident Response.

Data durability

Back up SQLite files. If you're using SQLite for analytics, databases, or the gateway host store, back up on a schedule. See Operations → Runbooks → Back up SQLite data.
Put persistent volumes on real storage. Not tmpfs, not container-local FS. EBS, GCP PD, Azure disk, or equivalent.
Backup the workspace lock file. .kb/marketplace.lock is the source of truth for installed plugins. Lose it and discovery returns nothing until you re-install everything.
Test restore. A backup you haven't tested restoring is not a backup.

Rolling updates

Deploy via CI, not manually. Automate the build → test → publish → deploy pipeline. Manual deploys are the #1 source of production incidents.
Version every plugin. Pin plugin versions in the marketplace lock; don't pnpm kb marketplace update without a changelog review.
Rollout gradually. For multi-instance deployments, roll out to one instance, observe, then the rest. If you deploy the platform itself across hosts, kb-deploy apply does this declaratively — canary waves gated by health checks, with config delivery and auto-rollback.
Rollback plan. Know how to roll back. git revert + redeploy is the usual path; make sure it actually works. With kb-deploy apply, a failed health gate rolls the release back automatically and restores the prior config.
Workflow daemon restart impact. The workflow daemon is stateful. Restarting it interrupts in-flight runs. Schedule restarts during low-traffic windows and know the 'interrupted' recovery path.

See Operations → Runbooks → Rolling updates.

Scaling

Gateway can scale horizontally. All instances need the same GATEWAY_JWT_SECRET. Host registry should be backed by a shared database for persistence across instances.
REST API can scale horizontally. Stateless except for the read-only plugin registry.
Workflow daemon cannot scale horizontally yet. Run one instance, vertically scale. Multi-instance workflow is roadmap.
Marketplace and state daemon are low-load. One instance each is plenty.
Size worker pools. platform.execution.workerPool.max should match your concurrency needs. Default 10 is fine for most deployments.
Monitor memory and CPU. Set resource requests and limits in Kubernetes; use systemd cgroups on bare hosts.

See Operations → Deployment.

Multi-tenancy (if applicable)

Enable distributed state. platform.core.resourceBroker.distributed: true plus a running state daemon.
Require X-Tenant-ID on every request. Don't fall back to the default tenant in production.
Validate tenant claims. Your auth middleware should verify the caller is allowed to act as the claimed tenant.
Configure per-tenant quotas. At least set defaultQuotas on the resource broker.
Plugins must key on ctx.tenantId. Every cache key, storage path, state broker operation needs tenant scoping. Review every plugin's data access patterns.
Billing integration. Analytics events carry the tenant ID — wire them into your billing system.

See Concepts → Multi-Tenancy.

Dev ergonomics (for the team)

Document the deployment. A short README covering how to access logs, where the config lives, who to page for incidents.
Local dev parity. Make sure team members can reproduce production issues locally. Provide a devservices.yaml that mirrors production structure.
Onboarding runbook. New team members should be able to set up a local KB Labs environment in under an hour.
Testing strategy. Unit tests, integration tests, smoke tests. See Guides → Testing.

The final checklist

Before you call it done:

Everything works on a fresh server/cluster. Not just on the dev's machine.
Secrets are in a secret manager. Not in kb.config.json, not in committed .env files.
Logs flow to an aggregator. You can search for an error from the last 24 hours in under 30 seconds.
Metrics are scraped. Your Prometheus shows latency, error rate, and resource usage for every service.
Alerts are set up. Critical alerts (service down, high error rate, disk full) page someone.
Backups run. And at least one has been test-restored.
Rollback works. You've rehearsed reverting a bad deploy.
Incident runbook exists. Everyone on the team knows where it is.
At least two people can operate the system. Don't have a single point of failure in the team.

Missing any of these? It's not production-ready. Fix before going live.