Scalability & Horizontal Scaling

Diminuendo’s architecture was designed from the outset to make horizontal scaling a natural consequence of its data model, not an afterthought bolted on through distributed consensus protocols or shared-nothing clustering. The key insight is straightforward: if no two gateway instances ever need to write to the same database, then scaling out is simply a matter of routing tenants to instances.

Per-Tenant Data Isolation

Every tenant in Diminuendo receives its own SQLite database file for session metadata, and every session receives its own dedicated database for conversation history, events, and usage records:
data/
  tenants/
    {tenantId}/
      registry.db          # Session metadata for this tenant
  sessions/
    {sessionId}/
      session.db           # Conversation history, events, turn usage
There is no shared database between tenants or sessions. A query against tenant acme’s registry physically cannot touch tenant globex’s data — they reside in different files on different filesystem paths. There is no WHERE tenant_id = ? clause to forget, no row-level security policy to misconfigure, no cross-tenant join to accidentally permit.
This isolation extends to deletion semantics. Removing a session means deleting a directory. Removing a tenant means deleting a directory tree. No cascading deletes, no orphaned foreign key references, no vacuum passes over a shared tablespace.

Why This Enables Horizontal Scaling

Since there is no shared state between tenants, multiple Diminuendo instances can serve different tenants independently. A load balancer can route by tenant ID — using sticky sessions or tenant-affinity routing — to ensure all requests for a given tenant reach the same instance. The fundamental invariant is simple: at any point in time, exactly one gateway instance is responsible for a given tenant’s data. This is trivially satisfied by a load balancer that hashes on the tenant ID extracted from the JWT’s tenant_id claim.
                Load Balancer
                (tenant-affinity)
              /        |        \
        Instance A  Instance B  Instance C
        tenant: acme  tenant: globex  tenant: initech
        data/tenants/acme/  data/tenants/globex/  data/tenants/initech/
Adding capacity means adding instances and rebalancing the tenant-to-instance mapping. No data migration is required — just copy the tenant’s data/ directory to the new instance and update the routing table.

Sticky Session Requirement

WebSocket connections are inherently stateful. Each connected client maintains in-memory state on the gateway instance: the ActiveSession record, the ConnectionState tracking authentication and subscriptions, and the event streaming fiber that consumes Podium events and publishes them to session topics. A client must reconnect to the same instance that holds its session’s in-memory state. If a load balancer routes a reconnecting client to a different instance, that instance will not have the session’s active Podium connection, event fiber, or subscriber registrations.
Sticky sessions are a hard requirement for the current architecture, not a performance optimization. The gateway does not replicate in-memory state between instances. A misdirected WebSocket connection will fail to find an active session and will require the client to re-join, which triggers a fresh Podium connection and state snapshot.
During rolling deployments, tenants can be redistributed across instances by leveraging the stale session recovery mechanism: when an instance restarts, it queries all non-idle sessions across its known tenants and resets them to inactive. Clients reconnect, receive a state_snapshot reflecting the reset state, and the session activates cleanly on the new instance.

SQLite as Scaling Advantage

The choice of SQLite over PostgreSQL is often perceived as a scalability limitation. In Diminuendo’s architecture, it is precisely the opposite — SQLite enables a scaling model that a shared database would complicate:

No Cluster to Manage

There is no PostgreSQL primary, no read replicas, no connection pooler (PgBouncer/pgcat), no failover orchestrator. Each instance manages its own local files.

Copy-Based Backup

Backing up a tenant means copying a directory. Restoring means placing files. No pg_dump, no WAL archiving, no point-in-time recovery infrastructure.

Per-Session Archival

Completed sessions can be archived independently — compress the session directory, upload to object storage, and delete locally. No DELETE FROM events WHERE session_id = ? on a multi-terabyte table.

WAL Concurrency

WAL mode allows concurrent reads without blocking the writer. The two-worker architecture places reads and writes on separate threads, so a long-running history query never stalls event persistence.

Resource Budget Per Instance

Each Diminuendo instance enforces bounded resource consumption through carefully sized caches and rate limiters:
ResourceBoundEviction Policy
Writer DB cache128 max open handlesLRU eviction
Reader DB cache64 max open handlesLRU eviction
Auth rate limiter10,000 IP entriesPeriodic cleanup (60s interval)
Per-connection dedup buffer5,000 eventsPer-connection, cleared on disconnect
Prepared statement cacheWeakMap per DB handleGC’d when DB handle is evicted
Per-connection rate limit60 messages per 10s windowSliding window, per connection
These bounds ensure that memory consumption grows linearly with the number of active sessions (up to the LRU cache cap) and then plateaus. An instance serving 1,000 concurrent sessions uses approximately the same memory as one serving 200, because at most 128 session databases are open simultaneously.

Vertical Scaling Limits

Diminuendo runs on Bun’s single-threaded JavaScript event loop, with SQLite I/O offloaded to dedicated Web Workers. The practical bottlenecks for a single instance are:
  1. CPU for JSON serialization — every WebSocket message is JSON.parse’d on receipt and JSON.stringify’d on send. For high-throughput sessions with rapid text_delta events, this is the dominant CPU cost.
  2. SQLite write throughput — the writer worker batches commands (50ms or 100 commands, whichever comes first) and executes them within transactions. This sustains thousands of writes per second, but a single writer is ultimately serialized.
  3. WebSocket connection count — Bun’s event loop can handle thousands of concurrent WebSocket connections, but each connection consumes a file descriptor and a small amount of memory for its WsData state.
The multi-worker architecture moves SQLite I/O off the main thread, ensuring that database writes never block event delivery. For most workloads, a single instance can serve hundreds of concurrent agent sessions before any of these limits become relevant.

What Would Require Redis or PostgreSQL

The current architecture is designed for tenant-affinity routing, where each tenant is served by exactly one instance. Several capabilities would require shared infrastructure:
If a client connects to instance A but the session’s Podium events arrive on instance B (because the Podium connection was established there), instance A has no way to receive those events. A shared pub/sub layer (Redis Streams, NATS) would be needed to bridge events across instances.
Moving an active session from one instance to another — for example, during a rolling deployment — currently requires the session to be deactivated and reactivated. A shared state store would enable live handoff without interrupting the Podium connection.
The auth rate limiter tracks attempts per IP address within a single instance. A coordinated attacker distributing attempts across instances would bypass per-instance limits. A shared rate limiter (Redis-backed sliding window) would provide global protection.
The BillingService currently operates per-instance with local credit reservation. A multi-instance deployment serving the same tenant from different instances would require a shared ledger to prevent over-spending.
These capabilities are not yet needed for the current deployment model. The architecture is designed so that adding them later is additive — it requires new service implementations behind the existing Effect Layer interfaces, not rewrites of the core logic.

Stale Recovery on Restart

When a gateway instance restarts (whether due to deployment, crash, or scaling event), it performs stale session recovery as part of its startup sequence:
1

Enumerate known tenants

The instance queries all known tenant IDs from the data/tenants/ directory, plus the default tenant (dev in dev mode, default otherwise).
2

Query non-idle sessions

For each tenant, the instance queries the registry database for sessions whose status is not inactive — these are sessions that were active when the previous instance process died.
3

Reset to inactive

Each stale session is reset to inactive. This is safe because Podium connections do not survive process death — the WebSocket to the Podium coordinator was severed when the process exited, and the compute instance has already been reclaimed or timed out.
4

Resume normal operation

When clients reconnect and join these sessions, they receive a state_snapshot showing inactive status. The client can then trigger re-activation, which creates a fresh Podium instance and establishes a new connection.
This recovery runs as a forked daemon fiber — it executes concurrently with the server startup and does not block incoming connections. Up to 4 tenants are reconciled in parallel.