Scalability & Operations
Production deployment patterns, horizontal scaling strategies, observability, and the operational lessons learned from rsky's failures. This section addresses how to run the stack reliably at scale.
Lessons from rsky's Failures
Blacksky's rsky implementation attempted a full AT Protocol stack in Rust but suffered critical production failures. Understanding these failures informs our architecture decisions.
OOM Crashes
Unbounded memory growth from accumulating firehose data without backpressure. The relay consumer lacked flow control.
Our approach: Use Tap's built-in backpressure and ack-based flow control instead of raw firehose consumption.
Data Integrity Loss
Users couldn't see their own posts. The AppView-PDS interaction lost records during indexing, likely due to serialization bugs.
Our approach: Two-tier validation: compile-time types for known schemas, runtime validation for dynamic ones. Never silently drop records.
Monolithic Architecture
PDS, relay, AppView, and feed generators all tightly coupled in one deployment. A failure in one component cascaded to all.
Our approach: Strict service boundaries. Each component deploys independently with its own scaling profile.
No Observability
When things broke, operators had no visibility into what was happening. No metrics, no structured logging, no tracing.
Our approach: Instrument everything from day one. Tracing spans, Prometheus metrics, structured JSON logs.
Deployment Architecture
| Service | Scaling | Resources (Baseline) | Notes |
|---|---|---|---|
| Tranquil PDS | Vertical | 2 CPU, 4GB RAM | Single instance per domain; scale DB separately |
| Indigo Relay | Vertical | 4 CPU, 8GB RAM | Single instance; I/O bound, not CPU bound |
| Indigo Tap | Vertical | 1 CPU, 2GB RAM | One per AppView; lightweight bridge |
| AppView Indexer | Vertical | 2 CPU, 4GB RAM | Single writer; bottleneck is DB write throughput |
| AppView API | Horizontal | 1 CPU, 2GB RAM each | Stateless; scale with HPA on request latency |
| Conduit | Vertical | 1 CPU, 1GB RAM | Efficient Rust binary; handles thousands of rooms |
| LiveKit SFU | Horizontal | 4 CPU, 8GB RAM each | Scale per concurrent media sessions |
Observability Stack
Every component exports metrics, traces, and structured logs. The observability stack provides full visibility into the system's behavior, enabling rapid diagnosis of issues.
Metrics
- Prometheus exposition
- Events/sec throughput
- Indexing lag (seconds)
- Query latency p50/p99
- Connection pool utilization
Tracing
- OpenTelemetry spans
- Cross-service correlation
- Event → Index → Query path
- Agent task lifecycle
- Jaeger/Tempo backend
Logging
- Structured JSON (tracing crate)
- Request ID propagation
- DID-scoped context
- Error classification
- Loki/Elasticsearch backend
// Rust observability setup with tracing + OTEL
use tracing_subscriber::{
fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt
};
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(opentelemetry_otlp::new_exporter().tonic())
.install_batch(opentelemetry::runtime::Tokio)?;
tracing_subscriber::registry()
.with(EnvFilter::from_default_env())
.with(fmt::layer().json())
.with(tracing_opentelemetry::layer().with_tracer(tracer))
.init();
// Now every #[tracing::instrument] emits spans + logs
Critical Alerts
indexer_lag_seconds > 30Indexer falling behind firehose. Check DB write throughput and connection pool.
tap_connection_status != connectedTap lost connection to relay. Check network, relay health, and auto-reconnect.
pds_auth_failures_5m > 100Elevated auth failures. Possible credential stuffing or misconfigured agent.
appview_query_p99 > 500msQuery latency degraded. Check DB indexes, connection pool, and query plans.
memory_usage_percent > 85Memory pressure. Check for unbounded caches or connection leaks.