trstctl /docs GitHub ↗

Operations & resilience

The serving control plane is built so one overloaded or failing part cannot take down the rest (AN-7). This page covers the resilience controls in the live path: bulkheads, the per-tenant rate limiter, graceful drain, and the fail-closed signer timeout.

Bulkheads (isolation + backpressure)

Each subsystem runs on its own bounded worker pool with a bounded queue: the API, the projection workers, the outbox dispatcher, and the signing path. When a pool is saturated it rejects fast rather than blocking — an API flood returns 503 with a Retry-After header instead of consuming capacity another subsystem needs.

Because the pools are isolated, a saturated API cannot starve the things you rely on to observe and recover: /healthz, /readyz, and /metrics are served outside the API bulkhead and keep answering even while the API sheds load. The continuous outbox dispatcher runs on its own pool, so a backlog of external calls applies backpressure to itself (it sheds a sweep rather than piling up) without touching API capacity.

The pool sizes ship with conservative defaults and are tuned per deployment.

Rate limiting (per tenant, PostgreSQL-backed)

A per-tenant token bucket, persisted in PostgreSQL (no Redis — the limit holds across every replica), sheds load on the guarded routes: each tenant may make requests calls per window, admitting a burst of requests and refilling steadily. Over-budget requests get 429 Too Many Requests with a Retry-After header. The check runs after authentication and authorization, so one noisy tenant cannot exhaust the control plane while others are unaffected (AN-1).

Variable Default Meaning
TRSTCTL_RATE_LIMIT_ENABLED true Turn per-tenant rate limiting on/off.
TRSTCTL_RATE_LIMIT_REQUESTS 600 Burst/budget per window, per tenant.
TRSTCTL_RATE_LIMIT_WINDOW 1m The refill window (Go duration).

Graceful drain on shutdown

On SIGTERM the control plane drains without losing in-flight work: it stops accepting new connections, stops the outbox dispatcher, drains the per-subsystem worker pools (finishing queued and running tasks), runs a final outbox sweep so no enqueued external effect is lost (AN-6), then closes the event log and datastore in order.

Fail-closed signing

Issuance is bounded by a per-operation timeout. If the out-of-process signer (AN-4) is slow, unreachable, or stopped, IssueLeaf fails closed — it returns an error within the timeout and never falls back to an in-process signature. This is exercised by fault injection (a deliberately slow signer) in the test suite.

What an operator should watch

Pair this with Observability: the trstctl_http_requests_total counter shows 429/503 shedding as it happens, and the alert rules fire on sustained error rate or latency. A rising 503 rate points at a saturated subsystem; a rising 429 rate points at a tenant over budget.

Rendered live from github.com/ctlplne/trstctl — found a mistake? edit this page.