KnowMe
Section 8

Scalability & Operations

Production deployment patterns, horizontal scaling strategies, observability, and the operational lessons learned from rsky's failures. This section addresses how to run the stack reliably at scale.

Lessons from rsky's Failures

Blacksky's rsky implementation attempted a full AT Protocol stack in Rust but suffered critical production failures. Understanding these failures informs our architecture decisions.

OOM Crashes

Unbounded memory growth from accumulating firehose data without backpressure. The relay consumer lacked flow control.

Our approach: Use Tap's built-in backpressure and ack-based flow control instead of raw firehose consumption.

Data Integrity Loss

Users couldn't see their own posts. The AppView-PDS interaction lost records during indexing, likely due to serialization bugs.

Our approach: Two-tier validation: compile-time types for known schemas, runtime validation for dynamic ones. Never silently drop records.

Monolithic Architecture

PDS, relay, AppView, and feed generators all tightly coupled in one deployment. A failure in one component cascaded to all.

Our approach: Strict service boundaries. Each component deploys independently with its own scaling profile.

No Observability

When things broke, operators had no visibility into what was happening. No metrics, no structured logging, no tracing.

Our approach: Instrument everything from day one. Tracing spans, Prometheus metrics, structured JSON logs.

Deployment Architecture

ServiceScalingResources (Baseline)Notes
Tranquil PDSVertical2 CPU, 4GB RAMSingle instance per domain; scale DB separately
Indigo RelayVertical4 CPU, 8GB RAMSingle instance; I/O bound, not CPU bound
Indigo TapVertical1 CPU, 2GB RAMOne per AppView; lightweight bridge
AppView IndexerVertical2 CPU, 4GB RAMSingle writer; bottleneck is DB write throughput
AppView APIHorizontal1 CPU, 2GB RAM eachStateless; scale with HPA on request latency
ConduitVertical1 CPU, 1GB RAMEfficient Rust binary; handles thousands of rooms
LiveKit SFUHorizontal4 CPU, 8GB RAM eachScale per concurrent media sessions

Observability Stack

Every component exports metrics, traces, and structured logs. The observability stack provides full visibility into the system's behavior, enabling rapid diagnosis of issues.

Metrics

  • Prometheus exposition
  • Events/sec throughput
  • Indexing lag (seconds)
  • Query latency p50/p99
  • Connection pool utilization

Tracing

  • OpenTelemetry spans
  • Cross-service correlation
  • Event → Index → Query path
  • Agent task lifecycle
  • Jaeger/Tempo backend

Logging

  • Structured JSON (tracing crate)
  • Request ID propagation
  • DID-scoped context
  • Error classification
  • Loki/Elasticsearch backend
// Rust observability setup with tracing + OTEL use tracing_subscriber::{ fmt, EnvFilter, layer::SubscriberExt, util::SubscriberInitExt }; use opentelemetry::global; use opentelemetry_otlp::WithExportConfig; let tracer = opentelemetry_otlp::new_pipeline() .tracing() .with_exporter(opentelemetry_otlp::new_exporter().tonic()) .install_batch(opentelemetry::runtime::Tokio)?; tracing_subscriber::registry() .with(EnvFilter::from_default_env()) .with(fmt::layer().json()) .with(tracing_opentelemetry::layer().with_tracer(tracer)) .init(); // Now every #[tracing::instrument] emits spans + logs

Critical Alerts

Critical
indexer_lag_seconds > 30

Indexer falling behind firehose. Check DB write throughput and connection pool.

Critical
tap_connection_status != connected

Tap lost connection to relay. Check network, relay health, and auto-reconnect.

Warning
pds_auth_failures_5m > 100

Elevated auth failures. Possible credential stuffing or misconfigured agent.

Warning
appview_query_p99 > 500ms

Query latency degraded. Check DB indexes, connection pool, and query plans.

Warning
memory_usage_percent > 85

Memory pressure. Check for unbounded caches or connection leaks.