KB LabsDocs

Observability

Last updated April 7, 2026


Logs, metrics, traces, and diagnostic events — how to see what's happening in a running platform.

KB Labs has three observability layers: structured logs, Prometheus metrics, and analytics events. Each serves a different purpose: logs capture what happened and why, metrics capture rates and quantities, analytics captures business-level events (plugin runs, workflow completions, cache hits). A fourth layer — distributed tracing — is wired into log correlation but doesn't yet emit to an OpenTelemetry collector out of the box.

All four are produced by the same platform singleton, so the same observability pipeline covers every service: REST API, workflow daemon, marketplace, gateway, state daemon.

Logs

Every service uses platform.logger from @kb-labs/core-platform — a structured logger that writes JSON records with correlation IDs, tenant IDs, trace IDs, and free-form metadata. The adapter that implements it is @kb-labs/adapters-pino by default.

What gets logged

  • Bootstrap events — adapter init, config load, startup diagnostics.
  • Request lifecycle — every incoming REST request with requestId, traceId, method, path, status, duration.
  • Plugin execution — every handler call with pluginId, pluginVersion, host, input summary, result summary, duration.
  • Adapter calls — wrapped LLM/cache/storage calls with timing and token counts.
  • Diagnostic events — significant state changes (adapter failure, plugin disabled, workspace materialization error) tagged with machine-readable code and reasonCode fields.
  • User-land logs — anything plugin handlers emit via useLogger().

The log format

JSON
{
  "level": "info",
  "time": 1729500000000,
  "serviceId": "rest",
  "logsSource": "rest",
  "layer": "rest",
  "requestId": "req_abc123",
  "traceId": "trace_xyz",
  "spanId": "span_1",
  "tenantId": "default",
  "pluginId": "@kb-labs/commit",
  "message": "commit:commit started",
  "meta": {
    "scope": "packages/core",
    "dryRun": false
  }
}

Fields are attached by the correlated logger (createCorrelatedLogger) — each service binds its serviceId, logsSource, and layer at bootstrap, so every log line from that service carries them automatically.

Shipping logs

Options, in order of complexity:

  • stdout — default. Pipe to journalctl / Kubernetes logs / Docker logs. Fine for dev and small deployments.
  • File persistence — use @kb-labs/adapters-log-sqlite as an extension adapter. Every log record goes to both Pino (stdout) and SQLite, enabling query through the logs API (GET /api/v1/logs).
  • Ring buffer — use @kb-labs/adapters-log-ringbuffer for an in-memory buffer of recent logs that Studio can stream without touching persistent storage.
  • External shippers — run a log collector (Vector, Fluentbit, Promtail) alongside each service to ship JSON logs to Loki/Elasticsearch/Datadog.

The ring buffer and SQLite persistence extensions both hook into the logger via onLog — see Adapters → Overview → Extension points for how they wire themselves in.

Log levels

TypeScript
type LogLevel = 'trace' | 'debug' | 'info' | 'warn' | 'error' | 'fatal';

Default production level is info. debug is verbose (every request, every adapter call); use it when investigating an issue. trace is firehose-level — only useful inside small time windows.

Override at service startup with KB_LOG_LEVEL=debug or LOG_LEVEL=debug.

Metrics

Every platform service exposes Prometheus-format metrics at /metrics. The REST API service has two collectors running in parallel:

Standard metrics

Every service emits:

MetricTypeLabelsPurpose
kb_service_uptime_secondsgaugeserviceHow long since last start
kb_http_requests_totalcounterservice, method, route, statusRequest count
kb_http_request_duration_mshistogramservice, method, route, statusLatency distribution
kb_process_cpu_seconds_totalcounterserviceCPU time
kb_process_memory_rss_bytesgaugeserviceResident memory
kb_event_loop_lag_secondshistogramserviceEvent loop delay

Plus service-specific metrics. The REST API adds kb_rest_domain_operation_* for platform-level events (plugin executions, adapter calls). The workflow daemon adds kb_workflow_runs_total, kb_workflow_run_duration_seconds, kb_workflow_jobs_in_flight, etc. The gateway adds kb_gateway_upstream_requests_total for each upstream.

Scraping

Standard Prometheus scrape config:

YAML
scrape_configs:
  - job_name: 'kb-rest-api'
    static_configs:
      - targets: ['rest-api:5050']
    metrics_path: '/api/v1/metrics'
 
  - job_name: 'kb-workflow'
    static_configs:
      - targets: ['workflow:7778']
    metrics_path: '/metrics'
 
  - job_name: 'kb-gateway'
    static_configs:
      - targets: ['gateway:4000']
    metrics_path: '/metrics'

In Kubernetes, use a ServiceMonitor if you're running prometheus-operator:

YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kb-labs-services
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: kb-labs
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Alerting

Suggestions for PromQL alerting rules:

YAML
groups:
- name: kb-labs
  rules:
  - alert: KbHighErrorRate
    expr: |
      rate(kb_http_requests_total{status=~"5.."}[5m])
        / rate(kb_http_requests_total[5m]) > 0.05
    for: 5m
    annotations:
      summary: "High 5xx rate on {{ $labels.service }}"
 
  - alert: KbEventLoopLag
    expr: histogram_quantile(0.99, rate(kb_event_loop_lag_seconds_bucket[5m])) > 0.1
    for: 10m
    annotations:
      summary: "Event loop lag > 100ms on {{ $labels.service }}"
 
  - alert: KbWorkflowRunBacklog
    expr: kb_workflow_jobs_in_flight > 100
    for: 15m
    annotations:
      summary: "Workflow backlog building up"

Tune thresholds to your actual traffic patterns. The above are starting points, not universal defaults.

Analytics events

The third layer is business-level events, emitted through useAnalytics() and handled by whichever analytics adapter is configured (@kb-labs/adapters-analytics-sqlite by default, with DuckDB and file-based options).

What gets tracked

Every handler execution, LLM call, cache operation, workflow run, and marketplace action produces an analytics event with the schema defined in IAnalytics → AnalyticsEvent`:

TypeScript
interface AnalyticsEvent {
  id: string;
  schema: 'kb.v1';
  type: string;                     // e.g. 'llm.completion.completed'
  ts: string;
  ingestTs: string;
  source: { product: string; version: string };
  runId: string;
  actor?: { type: 'user' | 'agent' | 'ci'; id?: string; name?: string };
  ctx?: Record<string, string | number | boolean | null>;
  payload?: unknown;
}

Standard event types:

  • llm.completion.completed{ totalTokens, totalCost, model, tier, durationMs }
  • llm.chatWithTools.completed — same plus tool call counts
  • embeddings.embed.completed{ totalTokens, totalCost, durationMs }
  • vectorStore.search.completed{ totalSearches, durationMs }
  • cache.hit / cache.miss{ namespace, durationMs }
  • storage.read.completed / storage.write.completed{ bytesRead, bytesWritten, durationMs }
  • plugin.command.completed / plugin.command.failed{ pluginId, commandId, durationMs, exitCode }
  • workflow.run.completed / workflow.run.failed{ workflowId, runId, durationMs, jobsTotal, jobsFailed }

Plugins can emit custom events via useAnalytics().track(name, payload).

Querying analytics

Adapters with a read side expose:

TypeScript
analytics.getEvents({ type, source, from, to, limit, offset });
analytics.getStats();                               // aggregate counts
analytics.getDailyStats({ groupBy, breakdownBy, metrics }); // time-series

See Adapters → IAnalytics for the full query API.

The REST API exposes these queries at /api/v1/analytics/* so Studio dashboards can read them without plugin code. See routes/analytics.ts.

Analytics vs metrics

Metrics are counters and histograms aggregated at scrape time. Good for rates ("how many requests per second") and distributions ("what's the p99 latency").

Analytics events are individual event records stored durably. Good for per-event queries ("which commits did plugin X make today"), breakdowns ("cost per model per day"), and auditing.

They're complementary. Use metrics for dashboards and alerting; use analytics for cost attribution, debugging specific incidents, and answering questions like "what did the plugin do in this run".

Distributed tracing

Every request carries a traceId and spanId in the correlated logger. When a request enters the gateway and is forwarded to an upstream, the trace ID is propagated via the X-Trace-ID header. Downstream services log under the same trace ID, so you can follow a single request across services by grepping for the ID in your log aggregator.

This is correlation, not full tracing. The platform does not emit OpenTelemetry spans out of the box. If you want to feed traces into Jaeger, Tempo, or Honeycomb, you have two options:

  1. Instrument the gateway — put an OpenTelemetry reverse proxy (like nginx with the OTel module) in front of the gateway. It produces real spans for every HTTP request.
  2. Patch the bootstrap — add @opentelemetry/sdk-node to your service startup. This is a custom patch; the platform doesn't include it.

For most deployments, correlation IDs in logs are enough — "I can find every log line for this request" covers most debugging scenarios without the overhead of a full tracing pipeline.

Diagnostic events

A separate channel for significant state changes and errors. Emitted via logDiagnosticEvent() from @kb-labs/core-platform:

TypeScript
import { logDiagnosticEvent } from '@kb-labs/core-platform';
 
logDiagnosticEvent(platform.logger, {
  domain: 'registry',
  event: 'gateway.hosts.restore',
  level: 'error',
  reasonCode: 'registry_restore_failed',
  message: 'Failed to restore gateway host registry',
  outcome: 'failed',
  error: err,
  serviceId: 'gateway',
  evidence: { persistentStore: true },
});

Diagnostic events are structured log entries with machine-readable domain, event, and reasonCode fields. They're what alerting hooks into — "fire an alert when we see a diagnostic with reasonCode=registry_restore_failed".

All platform services emit diagnostics at bootstrap (when adapters fail to load), at runtime (when significant things fail), and at shutdown (when cleanup fails). They show up as regular log entries with the extra fields — no separate pipeline.

Studio dashboards

Studio has built-in dashboards that read from the analytics and metrics endpoints. Navigate to /analytics in Studio to see:

  • Events over time, grouped by type.
  • LLM usage (tokens, cost, models) over time.
  • Cache hit rates.
  • Workflow run durations and outcomes.
  • Per-tenant breakdowns (in multi-tenant deployments).

The dashboards are built on the same useData hook plugins use for their own pages. You can extend them by writing your own Studio plugin pages that query the analytics API — see Plugins → Studio Pages.

Debugging a specific request

When something goes wrong and you want to trace a specific request:

  1. Get the requestId from the failure — Studio error screens, Slack notifications, or HTTP response headers (X-Request-ID).
  2. Query logs: serviceId:* AND requestId:req_abc123.
  3. You'll see the full request lifecycle across every service it touched (gateway → REST → plugin → LLM).
  4. Cross-reference with analytics: the same runId or requestId appears in analytics events, so you can see exactly which LLM model was called, what tokens were used, what the final cost was.
  5. If metrics show a spike at the same time, it's probably a systemic issue; if only this one request was affected, it's a specific failure.

Observability gotchas

  • Log level affects everything. Setting KB_LOG_LEVEL=debug in production will flood your log aggregator with per-request adapter call logs. Use debug for short debugging windows, not as the default.
  • Metrics cardinality matters. Don't add high-cardinality labels to Prometheus metrics (per-user, per-request-id). They'll blow up your Prometheus memory and scrape times. Use analytics events for per-event data.
  • Analytics can be slow. The SQLite adapter is fine for small-to-medium volumes; at scale, use the DuckDB adapter (@kb-labs/adapters-analytics-duckdb) or ship events to a dedicated pipeline.
  • Trace IDs need correlation. They're only useful if you can query across services. Make sure your log aggregator preserves the traceId field and lets you filter on it.
  • Diagnostic events are logs, not a separate stream. Alerting rules need to query the log aggregator for fields like reasonCode — they're not a dedicated API.
Observability — KB Labs Docs