Observability
Last updated April 7, 2026
Logs, metrics, traces, and diagnostic events — how to see what's happening in a running platform.
KB Labs has three observability layers: structured logs, Prometheus metrics, and analytics events. Each serves a different purpose: logs capture what happened and why, metrics capture rates and quantities, analytics captures business-level events (plugin runs, workflow completions, cache hits). A fourth layer — distributed tracing — is wired into log correlation but doesn't yet emit to an OpenTelemetry collector out of the box.
All four are produced by the same platform singleton, so the same observability pipeline covers every service: REST API, workflow daemon, marketplace, gateway, state daemon.
Logs
Every service uses platform.logger from @kb-labs/core-platform — a structured logger that writes JSON records with correlation IDs, tenant IDs, trace IDs, and free-form metadata. The adapter that implements it is @kb-labs/adapters-pino by default.
What gets logged
- Bootstrap events — adapter init, config load, startup diagnostics.
- Request lifecycle — every incoming REST request with
requestId,traceId, method, path, status, duration. - Plugin execution — every handler call with
pluginId,pluginVersion,host, input summary, result summary, duration. - Adapter calls — wrapped LLM/cache/storage calls with timing and token counts.
- Diagnostic events — significant state changes (adapter failure, plugin disabled, workspace materialization error) tagged with machine-readable
codeandreasonCodefields. - User-land logs — anything plugin handlers emit via
useLogger().
The log format
{
"level": "info",
"time": 1729500000000,
"serviceId": "rest",
"logsSource": "rest",
"layer": "rest",
"requestId": "req_abc123",
"traceId": "trace_xyz",
"spanId": "span_1",
"tenantId": "default",
"pluginId": "@kb-labs/commit",
"message": "commit:commit started",
"meta": {
"scope": "packages/core",
"dryRun": false
}
}Fields are attached by the correlated logger (createCorrelatedLogger) — each service binds its serviceId, logsSource, and layer at bootstrap, so every log line from that service carries them automatically.
Shipping logs
Options, in order of complexity:
- stdout — default. Pipe to
journalctl/ Kubernetes logs / Docker logs. Fine for dev and small deployments. - File persistence — use
@kb-labs/adapters-log-sqliteas an extension adapter. Every log record goes to both Pino (stdout) and SQLite, enabling query through the logs API (GET /api/v1/logs). - Ring buffer — use
@kb-labs/adapters-log-ringbufferfor an in-memory buffer of recent logs that Studio can stream without touching persistent storage. - External shippers — run a log collector (Vector, Fluentbit, Promtail) alongside each service to ship JSON logs to Loki/Elasticsearch/Datadog.
The ring buffer and SQLite persistence extensions both hook into the logger via onLog — see Adapters → Overview → Extension points for how they wire themselves in.
Log levels
type LogLevel = 'trace' | 'debug' | 'info' | 'warn' | 'error' | 'fatal';Default production level is info. debug is verbose (every request, every adapter call); use it when investigating an issue. trace is firehose-level — only useful inside small time windows.
Override at service startup with KB_LOG_LEVEL=debug or LOG_LEVEL=debug.
Metrics
Every platform service exposes Prometheus-format metrics at /metrics. The REST API service has two collectors running in parallel:
- Request metrics — per-route request counts, latency histograms, error rates (from
rest-api/src/middleware/metrics.ts). - System metrics — CPU, memory, event loop lag, open handles, GC pauses (from
rest-api/src/services/system-metrics-collector.ts).
Standard metrics
Every service emits:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
kb_service_uptime_seconds | gauge | service | How long since last start |
kb_http_requests_total | counter | service, method, route, status | Request count |
kb_http_request_duration_ms | histogram | service, method, route, status | Latency distribution |
kb_process_cpu_seconds_total | counter | service | CPU time |
kb_process_memory_rss_bytes | gauge | service | Resident memory |
kb_event_loop_lag_seconds | histogram | service | Event loop delay |
Plus service-specific metrics. The REST API adds kb_rest_domain_operation_* for platform-level events (plugin executions, adapter calls). The workflow daemon adds kb_workflow_runs_total, kb_workflow_run_duration_seconds, kb_workflow_jobs_in_flight, etc. The gateway adds kb_gateway_upstream_requests_total for each upstream.
Scraping
Standard Prometheus scrape config:
scrape_configs:
- job_name: 'kb-rest-api'
static_configs:
- targets: ['rest-api:5050']
metrics_path: '/api/v1/metrics'
- job_name: 'kb-workflow'
static_configs:
- targets: ['workflow:7778']
metrics_path: '/metrics'
- job_name: 'kb-gateway'
static_configs:
- targets: ['gateway:4000']
metrics_path: '/metrics'In Kubernetes, use a ServiceMonitor if you're running prometheus-operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kb-labs-services
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: kb-labs
endpoints:
- port: http
path: /metrics
interval: 30sAlerting
Suggestions for PromQL alerting rules:
groups:
- name: kb-labs
rules:
- alert: KbHighErrorRate
expr: |
rate(kb_http_requests_total{status=~"5.."}[5m])
/ rate(kb_http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "High 5xx rate on {{ $labels.service }}"
- alert: KbEventLoopLag
expr: histogram_quantile(0.99, rate(kb_event_loop_lag_seconds_bucket[5m])) > 0.1
for: 10m
annotations:
summary: "Event loop lag > 100ms on {{ $labels.service }}"
- alert: KbWorkflowRunBacklog
expr: kb_workflow_jobs_in_flight > 100
for: 15m
annotations:
summary: "Workflow backlog building up"Tune thresholds to your actual traffic patterns. The above are starting points, not universal defaults.
Analytics events
The third layer is business-level events, emitted through useAnalytics() and handled by whichever analytics adapter is configured (@kb-labs/adapters-analytics-sqlite by default, with DuckDB and file-based options).
What gets tracked
Every handler execution, LLM call, cache operation, workflow run, and marketplace action produces an analytics event with the schema defined in IAnalytics → AnalyticsEvent`:
interface AnalyticsEvent {
id: string;
schema: 'kb.v1';
type: string; // e.g. 'llm.completion.completed'
ts: string;
ingestTs: string;
source: { product: string; version: string };
runId: string;
actor?: { type: 'user' | 'agent' | 'ci'; id?: string; name?: string };
ctx?: Record<string, string | number | boolean | null>;
payload?: unknown;
}Standard event types:
llm.completion.completed—{ totalTokens, totalCost, model, tier, durationMs }llm.chatWithTools.completed— same plus tool call countsembeddings.embed.completed—{ totalTokens, totalCost, durationMs }vectorStore.search.completed—{ totalSearches, durationMs }cache.hit/cache.miss—{ namespace, durationMs }storage.read.completed/storage.write.completed—{ bytesRead, bytesWritten, durationMs }plugin.command.completed/plugin.command.failed—{ pluginId, commandId, durationMs, exitCode }workflow.run.completed/workflow.run.failed—{ workflowId, runId, durationMs, jobsTotal, jobsFailed }
Plugins can emit custom events via useAnalytics().track(name, payload).
Querying analytics
Adapters with a read side expose:
analytics.getEvents({ type, source, from, to, limit, offset });
analytics.getStats(); // aggregate counts
analytics.getDailyStats({ groupBy, breakdownBy, metrics }); // time-seriesSee Adapters → IAnalytics for the full query API.
The REST API exposes these queries at /api/v1/analytics/* so Studio dashboards can read them without plugin code. See routes/analytics.ts.
Analytics vs metrics
Metrics are counters and histograms aggregated at scrape time. Good for rates ("how many requests per second") and distributions ("what's the p99 latency").
Analytics events are individual event records stored durably. Good for per-event queries ("which commits did plugin X make today"), breakdowns ("cost per model per day"), and auditing.
They're complementary. Use metrics for dashboards and alerting; use analytics for cost attribution, debugging specific incidents, and answering questions like "what did the plugin do in this run".
Distributed tracing
Every request carries a traceId and spanId in the correlated logger. When a request enters the gateway and is forwarded to an upstream, the trace ID is propagated via the X-Trace-ID header. Downstream services log under the same trace ID, so you can follow a single request across services by grepping for the ID in your log aggregator.
This is correlation, not full tracing. The platform does not emit OpenTelemetry spans out of the box. If you want to feed traces into Jaeger, Tempo, or Honeycomb, you have two options:
- Instrument the gateway — put an OpenTelemetry reverse proxy (like nginx with the OTel module) in front of the gateway. It produces real spans for every HTTP request.
- Patch the bootstrap — add
@opentelemetry/sdk-nodeto your service startup. This is a custom patch; the platform doesn't include it.
For most deployments, correlation IDs in logs are enough — "I can find every log line for this request" covers most debugging scenarios without the overhead of a full tracing pipeline.
Diagnostic events
A separate channel for significant state changes and errors. Emitted via logDiagnosticEvent() from @kb-labs/core-platform:
import { logDiagnosticEvent } from '@kb-labs/core-platform';
logDiagnosticEvent(platform.logger, {
domain: 'registry',
event: 'gateway.hosts.restore',
level: 'error',
reasonCode: 'registry_restore_failed',
message: 'Failed to restore gateway host registry',
outcome: 'failed',
error: err,
serviceId: 'gateway',
evidence: { persistentStore: true },
});Diagnostic events are structured log entries with machine-readable domain, event, and reasonCode fields. They're what alerting hooks into — "fire an alert when we see a diagnostic with reasonCode=registry_restore_failed".
All platform services emit diagnostics at bootstrap (when adapters fail to load), at runtime (when significant things fail), and at shutdown (when cleanup fails). They show up as regular log entries with the extra fields — no separate pipeline.
Studio dashboards
Studio has built-in dashboards that read from the analytics and metrics endpoints. Navigate to /analytics in Studio to see:
- Events over time, grouped by type.
- LLM usage (tokens, cost, models) over time.
- Cache hit rates.
- Workflow run durations and outcomes.
- Per-tenant breakdowns (in multi-tenant deployments).
The dashboards are built on the same useData hook plugins use for their own pages. You can extend them by writing your own Studio plugin pages that query the analytics API — see Plugins → Studio Pages.
Debugging a specific request
When something goes wrong and you want to trace a specific request:
- Get the
requestIdfrom the failure — Studio error screens, Slack notifications, or HTTP response headers (X-Request-ID). - Query logs:
serviceId:* AND requestId:req_abc123. - You'll see the full request lifecycle across every service it touched (gateway → REST → plugin → LLM).
- Cross-reference with analytics: the same
runIdorrequestIdappears in analytics events, so you can see exactly which LLM model was called, what tokens were used, what the final cost was. - If metrics show a spike at the same time, it's probably a systemic issue; if only this one request was affected, it's a specific failure.
Observability gotchas
- Log level affects everything. Setting
KB_LOG_LEVEL=debugin production will flood your log aggregator with per-request adapter call logs. Usedebugfor short debugging windows, not as the default. - Metrics cardinality matters. Don't add high-cardinality labels to Prometheus metrics (per-user, per-request-id). They'll blow up your Prometheus memory and scrape times. Use analytics events for per-event data.
- Analytics can be slow. The SQLite adapter is fine for small-to-medium volumes; at scale, use the DuckDB adapter (
@kb-labs/adapters-analytics-duckdb) or ship events to a dedicated pipeline. - Trace IDs need correlation. They're only useful if you can query across services. Make sure your log aggregator preserves the
traceIdfield and lets you filter on it. - Diagnostic events are logs, not a separate stream. Alerting rules need to query the log aggregator for fields like
reasonCode— they're not a dedicated API.
What to read next
- Adapters → IAnalytics — the full analytics adapter API.
- Adapters → ILogger — the logger interface.
- Operations → Deployment — where to run observability collectors.
- Operations → Runbooks — specific incident-response patterns.
- Services → REST API — the service that hosts analytics and metrics endpoints.