Ensemble Integration

Ensemble is the LLM inference proxy that handles model routing, rate limiting, and cost tracking. Diminuendo connects to Ensemble as an upstream service for any gateway-level inference needs that are separate from the agent’s own LLM calls (which flow through Podium).

EnsembleClient

The EnsembleClient is an Effect service (Context.Tag) that provides two inference methods and a health probe:

export class EnsembleClient extends Context.Tag("EnsembleClient")<EnsembleClient, {
  readonly generate: (params: GenerateParams) => Effect.Effect<GenerateResult, EnsembleError>
  readonly generateStream: (params: GenerateParams) => Effect.Effect<ReadableStream<string>, EnsembleError>
  readonly isHealthy: Effect.Effect<boolean>
}>() {}

generate()

Sends a synchronous inference request and returns the complete response:

interface GenerateParams {
  readonly model: string
  readonly messages: ReadonlyArray<{ readonly role: string; readonly content: string }>
  readonly maxTokens?: number
  readonly temperature?: number
}

interface GenerateResult {
  readonly content: string
  readonly usage: {
    readonly inputTokens: number
    readonly outputTokens: number
  }
}

The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate with a 60-second timeout.

generateStream()

Sends a streaming inference request and returns a ReadableStream<string> that yields text chunks as they arrive:

const stream = await Effect.runPromise(ensemble.generateStream({
  model: "claude-3.5-sonnet",
  messages: [{ role: "user", content: "Explain quicksort" }],
}))

const reader = stream.getReader()
while (true) {
  const { done, value } = await reader.read()
  if (done) break
  process.stdout.write(value)
}

The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate/stream with a 120-second timeout. The response body is piped through a TextDecoderStream.

Configuration

The EnsembleClientLive layer reads two environment variables from the application config:

Variable	Default	Description
`ENSEMBLE_URL`	`http://localhost:5180`	Base URL for the Ensemble API
`ENSEMBLE_API_KEY`	(empty)	Bearer token for authentication

If ENSEMBLE_URL is set to a non-default value but ENSEMBLE_API_KEY is empty, the gateway logs a warning at startup: “ENSEMBLE_URL is set but ENSEMBLE_API_KEY is empty — Ensemble integration will fail.” This catches a common misconfiguration where the URL is set in the environment but the secret was not provisioned.

No-Op Fallback

If either ENSEMBLE_URL or ENSEMBLE_API_KEY is missing, the EnsembleClient falls back to a no-op implementation. Both generate and generateStream return Effect.fail(new EnsembleError({ message: "Ensemble unavailable: ..." })). The isHealthy probe returns false. This ensures the gateway starts successfully even without Ensemble configured — it degrades gracefully rather than failing to boot.

Resilience

Both generate and generateStream are wrapped with resilience patterns to handle transient failures from the inference service.

Circuit Breaker

Parameter	Value
Failure threshold	5 consecutive failures
Cooldown period	30 seconds
Reset behavior	Half-open after cooldown; first success closes the breaker

When the circuit breaker opens, all inference calls fail immediately with EnsembleError (status code 503) rather than attempting the HTTP request. This prevents a failing Ensemble service from consuming gateway resources with timeout-bound requests.

const breaker = yield* makeCircuitBreaker({ failureThreshold: 5, cooldownMs: 30_000 })

const wrapWithResilience = <A>(
  effect: Effect.Effect<A, EnsembleError>,
  breaker: CircuitBreaker,
): Effect.Effect<A, EnsembleError> =>
  breaker.execute(effect).pipe(
    Effect.retry(ensembleRetry),
    Effect.catchTag("CircuitBreakerOpen", (err: CircuitBreakerOpen) =>
      Effect.fail(new EnsembleError({ message: err.message, statusCode: 503 })),
    ),
  )

Exponential Retry

Parameter	Value
Backoff schedule	1s, 2s, 4s
Max retries	2

Retries are applied after circuit breaker execution. If a request fails and the circuit breaker is still closed, the request is retried with exponential backoff. After 2 retries (3 total attempts), the error propagates.

Health Probe

The isHealthy property sends a GET request to {ENSEMBLE_URL}/health with a 2-second timeout:

const isHealthy: Effect.Effect<boolean> = Effect.tryPromise({
  try: () => fetch(`${ensembleUrl}/health`, { signal: AbortSignal.timeout(2000) }),
  catch: () => false as never,
}).pipe(
  Effect.map((res) => res.ok),
  Effect.catchAll(() => Effect.succeed(false)),
)

This probe is used by the gateway’s /health endpoint. If Ensemble is unreachable or returns a non-200 status, the health check reports Ensemble as degraded. Unlike Podium (which is critical), an unhealthy Ensemble degrades the overall health status but does not mark the gateway as unhealthy.

Usage Tracking

Ensemble usage is tracked through the gateway’s event system. When an agent consumes tokens during a turn, the gateway maps Podium’s usage events to two client-facing event types:

usage_update

Emitted per-inference call within a turn:

{
  "type": "usage_update",
  "sessionId": "...",
  "turnId": "...",
  "model": "claude-3.5-sonnet",
  "inputTokens": 1500,
  "outputTokens": 350,
  "cachedTokens": 200,
  "costMicroDollars": 4200,
  "seq": 42,
  "ts": 1709312400000
}

Field	Description
`model`	The model identifier used for this inference
`inputTokens`	Number of input tokens consumed
`outputTokens`	Number of output tokens generated
`cachedTokens`	Number of tokens served from cache
`costMicroDollars`	Cost in micro-dollars (1/1,000,000 of a dollar)

usage_context

Emitted to report the agent’s context window utilization:

{
  "type": "usage_context",
  "sessionId": "...",
  "turnId": "...",
  "totalTokens": 45000,
  "maxTokens": 200000,
  "percentUsed": 22.5,
  "seq": 43,
  "ts": 1709312400000
}

Field	Description
`totalTokens`	Total tokens currently in the context window
`maxTokens`	Maximum context window size for the active model
`percentUsed`	Percentage of context window consumed

Frontend clients use usage_context events to render a context window utilization indicator, helping users understand how much context capacity remains before the agent needs to summarize or truncate its working memory.

Getting Started

Architecture

Wire Protocol

Client SDKs

Frontend Clients

Platform Integration

Operations

Ensemble Integration

Ensemble Integration

EnsembleClient

generate()

generateStream()

Configuration

No-Op Fallback

Resilience

Circuit Breaker

Exponential Retry

Health Probe

Usage Tracking

usage_update

usage_context

Getting Started

Architecture

Wire Protocol

Client SDKs

Frontend Clients

Platform Integration

Operations

​Ensemble Integration

​EnsembleClient

​generate()

​generateStream()

​Configuration

​No-Op Fallback

​Resilience

​Circuit Breaker

​Exponential Retry

​Health Probe

​Usage Tracking

​usage_update

​usage_context

Ensemble Integration

EnsembleClient

generate()

generateStream()

Configuration

No-Op Fallback

Resilience

Circuit Breaker

Exponential Retry

Health Probe

Usage Tracking

usage_update

usage_context