Observability
trstctl's serving control plane is instrumented so an operator can answer "is it healthy, and if not, where does it hurt" from telemetry alone (B6). Every request is traced, counted, and access-logged, and the real dependencies are health- and readiness-probed.
Endpoints
| Path | Purpose | Auth |
|---|---|---|
/healthz |
Liveness — the process is up and the signer (if configured) is reachable. | none |
/readyz |
Readiness — probes the real dependencies (PostgreSQL, NATS JetStream, the signer). Returns 200 when all are up, 503 with a per-dependency body when any is down. | none |
/metrics |
Prometheus metrics in the text exposition format. | none |
/readyz is what a Kubernetes readiness probe should target: when a dependency
drops, readiness flips to 503 and the pod is removed from rotation, while
/healthz (liveness) stays green so the pod is not killed for a transient
dependency blip.
curl -fksS https://localhost:8443/readyz # {"status":"ok","checks":{"db":"ok","nats":"ok","signer":"ok"}}
curl -fksS https://localhost:8443/metrics # # TYPE trstctl_http_requests_total counter ...
Metrics
The control plane emits, at minimum:
trstctl_http_requests_total{method,route,code}— a counter of HTTP requests by method, normalized route, and status code.trstctl_http_request_duration_seconds{method,route}— a latency histogram (with_bucket,_sum,_count).trstctl_signer_up—1when the out-of-process signer is healthy, else0.trstctl_signer_restarts_total— cumulative relaunches of the signer child by the supervisor.
The signer is a separate, HTTP-less process (AN-4), so it cannot expose its own
/metrics; the control plane samples its health and restart count on a fixed
cadence and publishes them on the same registry as everything else. The sampler is
a background worker that stops cleanly on shutdown.
Routes are normalized — opaque path segments (UUIDs, long hex ids, numeric
ids) are collapsed to :id — so per-id paths do not explode label cardinality and
no identifier leaks into a label.
Scrape it with the example config in
deploy/observability/prometheus.example.yml.
Tracing
Every request is part of a distributed trace using the W3C Trace Context standard, so it interoperates with OpenTelemetry/Jaeger collectors on the wire:
- An inbound
traceparentheader is continued; otherwise a new trace starts. - The trace id is returned on the response
traceparentheader and included in the structured access log, so a request is correlatable end to end. - The trace spans subsystems: the readiness probes for PostgreSQL, NATS, and the signer run as child spans of the request, so one trace shows where time goes across dependencies.
!!! note "OTLP export is a follow-up"
The trace model is OpenTelemetry-shaped and W3C-traceparent-compatible on
the wire today. Exporting spans over OTLP to a collector is wired behind a
pluggable exporter seam (observ.Exporter) and is a tracked follow-up; the
control plane does not bundle the OTel SDK yet.
Structured logs
The control plane logs in structured JSON (or text — set TRSTCTL_LOG_FORMAT)
via log/slog, wired into the serving path. Each request emits one access-log
record carrying the trace_id correlation field plus the method, normalized
route, status, response size, and duration.
Logs contain zero secret material (AN-8): the access log never records the
Authorization header, the request body, or the query string — only the method,
the normalized route, and the status. This is asserted by a test.
Dashboards & alerts
Baseline operator assets ship under
deploy/observability/:
alerts.yml— Prometheus alerting rules: control plane down, 5xx error rate above 5%, p99 latency above 1s, signer down, and signer restarting repeatedly. Every metric the rules reference is one the control plane actually emits (asserted by a test, so a rule can't reference a metric that does not exist).dashboard.json— a Grafana dashboard: request rate, error ratio, latency percentiles, throughput by status code, and signer up / restarts.prometheus.example.yml— a ready-to-use scrape + rules config.
Plugging a new component in
Observability is a default of the codebase, not a per-sprint afterthought: a new
serving surface or background worker registers its metrics, structured logs,
health/readiness, and tracing through the shared internal/observ library (the
same Registry, Middleware, Readiness, Tracer, and the SignerMetrics-style
helpers) rather than rolling its own. Background workers take a context and stop on
cancellation so shutdown stays graceful. New trstctl_ alert metrics are held to
the same reality test, so a dashboard or alert can never reference a metric the code
does not emit.
Configuration
| Variable | Default | Meaning |
|---|---|---|
TRSTCTL_LOG_LEVEL |
info |
debug, info, warn, or error. |
TRSTCTL_LOG_FORMAT |
json |
json or text. |
/metrics and /readyz are always served and unauthenticated; restrict them at
your ingress / network policy if you do not want them publicly reachable.