Ensemble Integration

Ensemble is the LLM inference proxy that handles model routing, rate limiting, and cost tracking. Diminuendo connects to Ensemble as an upstream service for any gateway-level inference needs that are separate from the agent’s own LLM calls (which flow through Podium).

EnsembleClient

The EnsembleClient is an Effect service (Context.Tag) that provides two inference methods and a health probe:
export class EnsembleClient extends Context.Tag("EnsembleClient")<EnsembleClient, {
  readonly generate: (params: GenerateParams) => Effect.Effect<GenerateResult, EnsembleError>
  readonly generateStream: (params: GenerateParams) => Effect.Effect<ReadableStream<string>, EnsembleError>
  readonly isHealthy: Effect.Effect<boolean>
}>() {}

generate()

Sends a synchronous inference request and returns the complete response:
interface GenerateParams {
  readonly model: string
  readonly messages: ReadonlyArray<{ readonly role: string; readonly content: string }>
  readonly maxTokens?: number
  readonly temperature?: number
}

interface GenerateResult {
  readonly content: string
  readonly usage: {
    readonly inputTokens: number
    readonly outputTokens: number
  }
}
The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate with a 60-second timeout.

generateStream()

Sends a streaming inference request and returns a ReadableStream<string> that yields text chunks as they arrive:
const stream = await Effect.runPromise(ensemble.generateStream({
  model: "claude-3.5-sonnet",
  messages: [{ role: "user", content: "Explain quicksort" }],
}))

const reader = stream.getReader()
while (true) {
  const { done, value } = await reader.read()
  if (done) break
  process.stdout.write(value)
}
The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate/stream with a 120-second timeout. The response body is piped through a TextDecoderStream.

Configuration

The EnsembleClientLive layer reads two environment variables from the application config:
VariableDefaultDescription
ENSEMBLE_URLhttp://localhost:5180Base URL for the Ensemble API
ENSEMBLE_API_KEY(empty)Bearer token for authentication
If ENSEMBLE_URL is set to a non-default value but ENSEMBLE_API_KEY is empty, the gateway logs a warning at startup: “ENSEMBLE_URL is set but ENSEMBLE_API_KEY is empty — Ensemble integration will fail.” This catches a common misconfiguration where the URL is set in the environment but the secret was not provisioned.

No-Op Fallback

If either ENSEMBLE_URL or ENSEMBLE_API_KEY is missing, the EnsembleClient falls back to a no-op implementation. Both generate and generateStream return Effect.fail(new EnsembleError({ message: "Ensemble unavailable: ..." })). The isHealthy probe returns false. This ensures the gateway starts successfully even without Ensemble configured — it degrades gracefully rather than failing to boot.

Resilience

Both generate and generateStream are wrapped with resilience patterns to handle transient failures from the inference service.

Circuit Breaker

ParameterValue
Failure threshold5 consecutive failures
Cooldown period30 seconds
Reset behaviorHalf-open after cooldown; first success closes the breaker
When the circuit breaker opens, all inference calls fail immediately with EnsembleError (status code 503) rather than attempting the HTTP request. This prevents a failing Ensemble service from consuming gateway resources with timeout-bound requests.
const breaker = yield* makeCircuitBreaker({ failureThreshold: 5, cooldownMs: 30_000 })

const wrapWithResilience = <A>(
  effect: Effect.Effect<A, EnsembleError>,
  breaker: CircuitBreaker,
): Effect.Effect<A, EnsembleError> =>
  breaker.execute(effect).pipe(
    Effect.retry(ensembleRetry),
    Effect.catchTag("CircuitBreakerOpen", (err: CircuitBreakerOpen) =>
      Effect.fail(new EnsembleError({ message: err.message, statusCode: 503 })),
    ),
  )

Exponential Retry

ParameterValue
Backoff schedule1s, 2s, 4s
Max retries2
Retries are applied after circuit breaker execution. If a request fails and the circuit breaker is still closed, the request is retried with exponential backoff. After 2 retries (3 total attempts), the error propagates.

Health Probe

The isHealthy property sends a GET request to {ENSEMBLE_URL}/health with a 2-second timeout:
const isHealthy: Effect.Effect<boolean> = Effect.tryPromise({
  try: () => fetch(`${ensembleUrl}/health`, { signal: AbortSignal.timeout(2000) }),
  catch: () => false as never,
}).pipe(
  Effect.map((res) => res.ok),
  Effect.catchAll(() => Effect.succeed(false)),
)
This probe is used by the gateway’s /health endpoint. If Ensemble is unreachable or returns a non-200 status, the health check reports Ensemble as degraded. Unlike Podium (which is critical), an unhealthy Ensemble degrades the overall health status but does not mark the gateway as unhealthy.

Usage Tracking

Ensemble usage is tracked through the gateway’s event system. When an agent consumes tokens during a turn, the gateway maps Podium’s usage events to two client-facing event types:

usage_update

Emitted per-inference call within a turn:
{
  "type": "usage_update",
  "sessionId": "...",
  "turnId": "...",
  "model": "claude-3.5-sonnet",
  "inputTokens": 1500,
  "outputTokens": 350,
  "cachedTokens": 200,
  "costMicroDollars": 4200,
  "seq": 42,
  "ts": 1709312400000
}
FieldDescription
modelThe model identifier used for this inference
inputTokensNumber of input tokens consumed
outputTokensNumber of output tokens generated
cachedTokensNumber of tokens served from cache
costMicroDollarsCost in micro-dollars (1/1,000,000 of a dollar)

usage_context

Emitted to report the agent’s context window utilization:
{
  "type": "usage_context",
  "sessionId": "...",
  "turnId": "...",
  "totalTokens": 45000,
  "maxTokens": 200000,
  "percentUsed": 22.5,
  "seq": 43,
  "ts": 1709312400000
}
FieldDescription
totalTokensTotal tokens currently in the context window
maxTokensMaximum context window size for the active model
percentUsedPercentage of context window consumed
Frontend clients use usage_context events to render a context window utilization indicator, helping users understand how much context capacity remains before the agent needs to summarize or truncate its working memory.