Observability

Diminuendo provides three observability pillars: structured logging via Effect’s built-in logger, distributed tracing via OpenTelemetry, and deep health checks that probe upstream dependencies. Each is designed for zero-configuration local development and opt-in production instrumentation.

Logging

Diminuendo uses Effect’s built-in logging system, which integrates directly with the Effect runtime’s fiber scheduler. Every Effect.log* call captures the current fiber’s context (span, annotations) and routes through the configured logger implementation.

Logger Configuration

The logger is configured by two environment variables:
VariableEffect
LOG_LEVELMinimum severity: trace, debug, info, warning, error, fatal. Default: info
DEV_MODE / NODE_ENVFormat selection: pretty-print in dev, JSON in production
const loggerLayer = config.devMode
  ? Logger.replace(Logger.defaultLogger, Logger.prettyLoggerDefault)
  : Logger.json

Production: JSON Logger

In production (NODE_ENV=production or DEV_MODE not set), logs are emitted as structured JSON, one object per line. This format is optimized for ingestion by log aggregators (Datadog, Grafana Loki, CloudWatch Logs):
{"level":"INFO","message":"Gateway listening on 0.0.0.0:8080","timestamp":"2024-03-01T12:00:00.000Z","fiber":"#1"}
{"level":"DEBUG","message":"PodiumClient: POST /api/v1/instances","timestamp":"2024-03-01T12:00:01.234Z","fiber":"#5"}

Development: Pretty Logger

In development, logs use Effect’s prettyLoggerDefault, which renders human-readable output with color coding:
12:00:00.000 INFO  Gateway listening on 0.0.0.0:8080 (dev mode - auth bypassed)
12:00:01.234 DEBUG PodiumClient: POST /api/v1/instances body={...}

Log Level Recommendations

LevelUse Case
errorUnrecoverable failures, data corruption, service crashes
warningRecoverable issues: stale session recovery failures, missing optional config, degraded dependencies
infoService lifecycle events: startup, shutdown, configuration summary, connection events
debugRequest/response details: Podium API calls, WebSocket frame details, SQL queries
traceFiber scheduling, Effect runtime internals (rarely needed)
In production, use info as the default log level. Switch to debug temporarily when diagnosing issues — the additional output includes every Podium API call, every SQLite worker command, and every WebSocket message type.

OpenTelemetry Tracing

Distributed tracing is opt-in. Set OTEL_EXPORTER_OTLP_ENDPOINT to enable it. If the variable is unset, the tracing subsystem is completely inert — no spans are created, no overhead is incurred.

Initialization

Tracing is initialized once at startup via initTracing(). The function is idempotent and safe to call multiple times:
await initTracing(process.env.OTEL_SERVICE_NAME ?? "diminuendo-gateway")
Initialization dynamically imports the OpenTelemetry packages:
  • @opentelemetry/api
  • @opentelemetry/sdk-trace-node
  • @opentelemetry/exporter-trace-otlp-http
  • @opentelemetry/sdk-trace-base
The OpenTelemetry packages are optional dependencies. If they are not installed, initTracing() catches the import error and silently disables tracing. The gateway runs identically with or without these packages in node_modules.

Configuration

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINT(none)OTLP HTTP endpoint (e.g., http://localhost:4318)
OTEL_SERVICE_NAMEdiminuendo-gatewayService name in trace metadata
The exporter sends traces to {OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces using the OTLP HTTP protocol. A BatchSpanProcessor batches spans for efficient network transmission.

withSpan() Helper

The withSpan() function wraps any Effect in an OpenTelemetry span. If tracing is disabled, it passes the Effect through unchanged (zero overhead):
export function withSpan<A, E, R>(
  name: string,
  effect: Effect.Effect<A, E, R>,
  attributes?: Record<string, string | number | boolean>,
): Effect.Effect<A, E, R>
Span lifecycle is managed correctly even under fiber interruption:
  • On success: span status is set to OK and the span is ended
  • On failure or interruption: span status is set to ERROR with a diagnostic message, and the span is ended

Trace ID Propagation

The currentTraceId() function returns the active span’s trace ID if OTel is enabled, or a random 32-character hex string otherwise. This ID is propagated through event envelopes, enabling correlation between client-visible events and server-side traces:
export function currentTraceId(): string {
  if (tracingEnabled && otelApi) {
    const span = otelApi.trace.getActiveSpan?.()
    if (span) {
      const ctx = span.spanContext()
      if (ctx?.traceId) return ctx.traceId
    }
  }
  return crypto.randomUUID().replace(/-/g, "").slice(0, 32)
}

Graceful Degradation

The tracing subsystem is designed for complete graceful degradation:
ConditionBehavior
OTEL_EXPORTER_OTLP_ENDPOINT not setTracing disabled; withSpan() is a pass-through
OTel packages not installedinitTracing() catches import error; tracing disabled
Collector unreachableBatchSpanProcessor buffers and retries; no impact on gateway
initTracing() called multiple timesIdempotent; second call is a no-op

Health Endpoint

The gateway exposes a GET /health endpoint that performs deep health checks against upstream dependencies.

Response Format

{
  "status": "ok",
  "uptime": 3600000,
  "connections": 42,
  "dependencies": [
    {
      "name": "podium",
      "status": "ok",
      "latencyMs": 12
    },
    {
      "name": "ensemble",
      "status": "ok",
      "latencyMs": 8
    }
  ],
  "version": "0.1.0"
}

Health Check Logic

The endpoint probes each configured upstream service by sending a GET request to {service_url}/health with a 2-second timeout:
1

Probe Dependencies

Podium and Ensemble (if configured) are probed in parallel. Each probe measures latency and captures the HTTP status.
2

Classify Each Dependency

  • 200 OK with latency under timeout: ok
  • Non-200 HTTP status: degraded (with error detail)
  • Timeout or connection error: unhealthy (with error message)
3

Compute Overall Status

  • If Podium is unhealthy: overall status is unhealthy (Podium is critical)
  • If any dependency is not ok but Podium is available: overall status is degraded
  • If all dependencies are ok: overall status is ok
4

Return Response

  • 200 for ok or degraded status
  • 503 for unhealthy status

Response Fields

FieldTypeDescription
status"ok" | "degraded" | "unhealthy"Overall gateway health
uptimenumberMilliseconds since gateway started
connectionsnumberNumber of active session subscriptions
dependenciesDependencyStatus[]Per-dependency health details
versionstringGateway version

Dependency Criticality

Podium is the only critical dependency. If Podium is unreachable, the gateway cannot create or manage agent sessions, so the overall status is unhealthy (503). Ensemble is non-critical — if it is unreachable, the gateway reports degraded (200) because agent sessions can still function without gateway-level inference.
The health endpoint does not require authentication. It is designed for load balancer health checks and monitoring systems. Do not expose sensitive information in the response.

Load Balancer Integration

Configure your load balancer to probe GET /health periodically:
Health check path: /health
Expected status: 200
Interval: 10s
Timeout: 5s
Unhealthy threshold: 3 consecutive 503 responses
An instance returning 503 (Podium unhealthy) should be removed from the load balancer pool. An instance returning 200 with degraded status should remain in the pool — it can still serve requests, but operators should investigate the degraded dependency.