Ensemble Integration
Ensemble is the LLM inference proxy that handles model routing, rate limiting, and cost tracking. Diminuendo connects to Ensemble as an upstream service for any gateway-level inference needs that are separate from the agent’s own LLM calls (which flow through Podium).
EnsembleClient
The EnsembleClient is an Effect service (Context.Tag) that provides two inference methods and a health probe:
export class EnsembleClient extends Context.Tag("EnsembleClient")<EnsembleClient, {
readonly generate: (params: GenerateParams) => Effect.Effect<GenerateResult, EnsembleError>
readonly generateStream: (params: GenerateParams) => Effect.Effect<ReadableStream<string>, EnsembleError>
readonly isHealthy: Effect.Effect<boolean>
}>() {}
generate()
Sends a synchronous inference request and returns the complete response:
interface GenerateParams {
readonly model: string
readonly messages: ReadonlyArray<{ readonly role: string; readonly content: string }>
readonly maxTokens?: number
readonly temperature?: number
}
interface GenerateResult {
readonly content: string
readonly usage: {
readonly inputTokens: number
readonly outputTokens: number
}
}
The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate with a 60-second timeout.
generateStream()
Sends a streaming inference request and returns a ReadableStream<string> that yields text chunks as they arrive:
const stream = await Effect.runPromise(ensemble.generateStream({
model: "claude-3.5-sonnet",
messages: [{ role: "user", content: "Explain quicksort" }],
}))
const reader = stream.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
process.stdout.write(value)
}
The request is sent as a POST to {ENSEMBLE_URL}/api/v1/generate/stream with a 120-second timeout. The response body is piped through a TextDecoderStream.
Configuration
The EnsembleClientLive layer reads two environment variables from the application config:
| Variable | Default | Description |
|---|
ENSEMBLE_URL | http://localhost:5180 | Base URL for the Ensemble API |
ENSEMBLE_API_KEY | (empty) | Bearer token for authentication |
If ENSEMBLE_URL is set to a non-default value but ENSEMBLE_API_KEY is empty, the gateway logs a warning at startup: “ENSEMBLE_URL is set but ENSEMBLE_API_KEY is empty — Ensemble integration will fail.” This catches a common misconfiguration where the URL is set in the environment but the secret was not provisioned.
No-Op Fallback
If either ENSEMBLE_URL or ENSEMBLE_API_KEY is missing, the EnsembleClient falls back to a no-op implementation. Both generate and generateStream return Effect.fail(new EnsembleError({ message: "Ensemble unavailable: ..." })). The isHealthy probe returns false.
This ensures the gateway starts successfully even without Ensemble configured — it degrades gracefully rather than failing to boot.
Resilience
Both generate and generateStream are wrapped with resilience patterns to handle transient failures from the inference service.
Circuit Breaker
| Parameter | Value |
|---|
| Failure threshold | 5 consecutive failures |
| Cooldown period | 30 seconds |
| Reset behavior | Half-open after cooldown; first success closes the breaker |
When the circuit breaker opens, all inference calls fail immediately with EnsembleError (status code 503) rather than attempting the HTTP request. This prevents a failing Ensemble service from consuming gateway resources with timeout-bound requests.
const breaker = yield* makeCircuitBreaker({ failureThreshold: 5, cooldownMs: 30_000 })
const wrapWithResilience = <A>(
effect: Effect.Effect<A, EnsembleError>,
breaker: CircuitBreaker,
): Effect.Effect<A, EnsembleError> =>
breaker.execute(effect).pipe(
Effect.retry(ensembleRetry),
Effect.catchTag("CircuitBreakerOpen", (err: CircuitBreakerOpen) =>
Effect.fail(new EnsembleError({ message: err.message, statusCode: 503 })),
),
)
Exponential Retry
| Parameter | Value |
|---|
| Backoff schedule | 1s, 2s, 4s |
| Max retries | 2 |
Retries are applied after circuit breaker execution. If a request fails and the circuit breaker is still closed, the request is retried with exponential backoff. After 2 retries (3 total attempts), the error propagates.
Health Probe
The isHealthy property sends a GET request to {ENSEMBLE_URL}/health with a 2-second timeout:
const isHealthy: Effect.Effect<boolean> = Effect.tryPromise({
try: () => fetch(`${ensembleUrl}/health`, { signal: AbortSignal.timeout(2000) }),
catch: () => false as never,
}).pipe(
Effect.map((res) => res.ok),
Effect.catchAll(() => Effect.succeed(false)),
)
This probe is used by the gateway’s /health endpoint. If Ensemble is unreachable or returns a non-200 status, the health check reports Ensemble as degraded. Unlike Podium (which is critical), an unhealthy Ensemble degrades the overall health status but does not mark the gateway as unhealthy.
Usage Tracking
Ensemble usage is tracked through the gateway’s event system. When an agent consumes tokens during a turn, the gateway maps Podium’s usage events to two client-facing event types:
usage_update
Emitted per-inference call within a turn:
{
"type": "usage_update",
"sessionId": "...",
"turnId": "...",
"model": "claude-3.5-sonnet",
"inputTokens": 1500,
"outputTokens": 350,
"cachedTokens": 200,
"costMicroDollars": 4200,
"seq": 42,
"ts": 1709312400000
}
| Field | Description |
|---|
model | The model identifier used for this inference |
inputTokens | Number of input tokens consumed |
outputTokens | Number of output tokens generated |
cachedTokens | Number of tokens served from cache |
costMicroDollars | Cost in micro-dollars (1/1,000,000 of a dollar) |
usage_context
Emitted to report the agent’s context window utilization:
{
"type": "usage_context",
"sessionId": "...",
"turnId": "...",
"totalTokens": 45000,
"maxTokens": 200000,
"percentUsed": 22.5,
"seq": 43,
"ts": 1709312400000
}
| Field | Description |
|---|
totalTokens | Total tokens currently in the context window |
maxTokens | Maximum context window size for the active model |
percentUsed | Percentage of context window consumed |
Frontend clients use usage_context events to render a context window utilization indicator, helping users understand how much context capacity remains before the agent needs to summarize or truncate its working memory.