Scalability & Operations

Production deployment patterns, horizontal scaling strategies, observability, and the operational lessons learned from rsky's failures. This section addresses how to run the stack reliably at scale.

Lessons from rsky's Failures

Blacksky's rsky implementation attempted a full AT Protocol stack in Rust but suffered critical production failures. Understanding these failures informs our architecture decisions.

OOM Crashes

Unbounded memory growth from accumulating firehose data without backpressure. The relay consumer lacked flow control.

Our approach: Use Tap's built-in backpressure and ack-based flow control instead of raw firehose consumption.

Data Integrity Loss

Users couldn't see their own posts. The AppView-PDS interaction lost records during indexing, likely due to serialization bugs.

Our approach: Two-tier validation: compile-time types for known schemas, runtime validation for dynamic ones. Never silently drop records.

Monolithic Architecture

PDS, relay, AppView, and feed generators all tightly coupled in one deployment. A failure in one component cascaded to all.

Our approach: Strict service boundaries. Each component deploys independently with its own scaling profile.

No Observability

When things broke, operators had no visibility into what was happening. No metrics, no structured logging, no tracing.

Our approach: Instrument everything from day one. Tracing spans, Prometheus metrics, structured JSON logs.

Deployment Architecture

Fig 6 — Kubernetes deployment topology with scaling characteristics

Service	Scaling	Resources (Baseline)	Notes
Tranquil PDS	Vertical	2 CPU, 4GB RAM	Single instance per domain; scale DB separately
Indigo Relay	Vertical	4 CPU, 8GB RAM	Single instance; I/O bound, not CPU bound
Indigo Tap	Vertical	1 CPU, 2GB RAM	One per AppView; lightweight bridge
AppView Indexer	Vertical	2 CPU, 4GB RAM	Single writer; bottleneck is DB write throughput
AppView API	Horizontal	1 CPU, 2GB RAM each	Stateless; scale with HPA on request latency
Conduit	Vertical	1 CPU, 1GB RAM	Efficient Rust binary; handles thousands of rooms
LiveKit SFU	Horizontal	4 CPU, 8GB RAM each	Scale per concurrent media sessions

Observability Stack

Every component exports metrics, traces, and structured logs. The observability stack provides full visibility into the system's behavior, enabling rapid diagnosis of issues.

Metrics

Prometheus exposition
Events/sec throughput
Indexing lag (seconds)
Query latency p50/p99
Connection pool utilization

Tracing

OpenTelemetry spans
Cross-service correlation
Event → Index → Query path
Agent task lifecycle
Jaeger/Tempo backend

Logging

Structured JSON (tracing crate)
Request ID propagation
DID-scoped context
Error classification
Loki/Elasticsearch backend

// Rust observability setup with tracing + OTEL
use tracing_subscriber::{
  fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt
};
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;

let tracer = opentelemetry_otlp::new_pipeline()
  .tracing()
  .with_exporter(opentelemetry_otlp::new_exporter().tonic())
  .install_batch(opentelemetry::runtime::Tokio)?;

tracing_subscriber::registry()
  .with(EnvFilter::from_default_env())
  .with(fmt::layer().json())
  .with(tracing_opentelemetry::layer().with_tracer(tracer))
  .init();

// Now every #[tracing::instrument] emits spans + logs

Critical Alerts

Critical

indexer_lag_seconds > 30

Indexer falling behind firehose. Check DB write throughput and connection pool.

Critical

tap_connection_status != connected

Tap lost connection to relay. Check network, relay health, and auto-reconnect.

Warning

pds_auth_failures_5m > 100

Elevated auth failures. Possible credential stuffing or misconfigured agent.

Warning

appview_query_p99 > 500ms

Query latency degraded. Check DB indexes, connection pool, and query plans.

Warning

memory_usage_percent > 85

Memory pressure. Check for unbounded caches or connection leaks.