Retrospective: Building a Seamless Omnichannel IoT Architecture with RabbitMQ (A 3-Week POC)¶

The architecture behind Wattlog.pro's BLE-to-cloud data pipeline.

Table of Contents¶

1. The Architecture at a Glance
- 1.1 Why mobile doesn't connect directly to RabbitMQ
- 1.2 Achieving true omnichannel capabilities
2. AWS Infrastructure Layer
- 2.1 What each service does
- 2.2 Why I kept it on a single EC2
3. Testing Strategy
4. Security
5. Performance Optimisations
6. Scalability
7. Architectural Trade-offs
- Pros
- Cons
What I learned

When building a modern application that bridges local Bluetooth Low Energy (BLE) hardware with a cloud backend, one of the biggest challenges is handling wildly different client environments. Desktop apps have horsepower and stable network connections, while mobile apps face strict battery constraints, backgrounding limitations, and frequent network switches. Over the last three weeks, I set out to build a Proof of Concept (POC) to solve this—with a hard infrastructure budget of just $20 USD per month. I designed an Event-Driven Architecture (EDA) centered around a RabbitMQ message bus. By layering a Backend-for-Frontend (BFF) pattern on top of it for mobile, I created a highly decoupled, transport-agnostic system that enables a true omnichannel user experience without breaking the bank. Here is a retrospective deep dive into what I built during this sprint—who talks to whom, what worked, and where our cost-driven trade-offs live

1. The Architecture at a Glance¶

My system features three distinct transport paths, all terminating in RabbitMQ as the common backbone:

┌─────────────────────────────────────────────────────────────────────────┐
│  Desktop bridge (Python, wattlog_bridge)                                │
│    BLE ──→ amqps://<broker>:<port>     ──→ ble.events.{client_id}       │
│    BLE ←── amqps://<broker>:<port>     ←── ble.commands.{client_id}     │
│    DIRECTLY to RabbitMQ — TCP + TLS over AMQPS                          │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                          ┌──────────────────┐
                          │    RabbitMQ      │ ← single broker, per-env queues
                          │  (prod)          │   ble.events.{X}, ble.commands.{X}
                          └──────────────────┘
                                  ▲
                                  │
┌─────────────────────────────────────────────────────────────────────────┐
│  Mobile bridge (Android Java / iOS Swift / fallback BleBridgeLoop.ts)   │
│    BLE ──→ WebSocket /ws/bridge ──→ [backend forwards to RMQ]           │
│    BLE ←── WebSocket /ws/bridge ←── [backend reads from RMQ commands]   │
│    VIA BACKEND as a proxy — does NOT connect to RMQ directly            │
└─────────────────────────────────────────────────────────────────────────┘
                                  ▲
                                  │
┌─────────────────────────────────────────────────────────────────────────┐
│  Backend (WorkoutSessionManager)                                        │
│    Consumes ble.events.{client_id} → live UI                            │
│    Publishes to ble.commands.{client_id} → ERG/slope/scan               │
└─────────────────────────────────────────────────────────────────────────┘

Why mobile doesn't connect directly to RabbitMQ¶

While the desktop bridge connects natively via AMQP, mobile devices route their traffic through a WebSocket Protocol Translation Gateway on the backend. I made this choice for three reasons:

Client stability. iOS and Android lack stable, native AMQP clients comparable to Python's aio-pika. RMQ-over-WebSocket via the STOMP plugin exists, but pushes unnecessary complexity into the broker layer.
Unified authentication. RabbitMQ requires per-client credentials. Injecting and rotating those securely on mobile apps is risky. A WebSocket proxy lets the mobile app reuse the same JWT as the rest of our REST API.
Battery and network optimisation. A single TLS WebSocket connection survives mobile OS sleep states and network transitions much better than AMQP heartbeats.

Logically, the mobile client sees the same RabbitMQ queues as the desktop client; the transport is simply WebSockets instead of AMQP.

Achieving true omnichannel capabilities¶

The most powerful aspect of this design is that the core business logic (WorkoutSessionManager) is completely transport-agnostic. The backend only listens to ble.events.{client_id} — it does not care if the bytes arrived via AMQP from a laptop or via WebSocket from an iPhone.

A user can start a workout on their desktop, pause, and seamlessly resume on mobile mid-workout. The backend keeps reading the unbroken flow of messages from the same queue, blind to the fact that the underlying hardware and transport protocol just changed.

2. AWS Infrastructure Layer¶

The whole platform is provisioned with Terraform on AWS. We deliberately chose a single-VPC, single-EC2 layout — small enough to be cheap, big enough to scale vertically before I ever need to split it.

                      ┌─────────────────────────────┐
       Users ─────►   │  CloudFront (TLS edge)      │  ◄── ACM cert (us-east-1)
                      │  /api/*, /ws/*, SPA assets  │
                      └────────────┬────────────────┘
                                   │ HTTP :8080 (CF prefix list)
                                   ▼
              ┌──────────────────────────────────────────────┐
              │  EC2 (Ubuntu, Docker Compose)                │
              │  ┌────────────────────────────────────────┐  │
              │  │  nginx — :80 :443 :8080 :5671 :5673    │  │
              │  ├────────────────────────────────────────┤  │
              │  │  FastAPI api (prod)                    │  │
              │  ├────────────────────────────────────────┤  │
              │  │  RabbitMQ                              │  │
              │  ├────────────────────────────────────────┤  │
              │  │  PostgreSQL + PgBouncer                │  │
              │  └────────────────────────────────────────┘  │
              │  Elastic IP, gp3 encrypted root volume       │
              └──────────────────────────────────────────────┘
                                   │
       ┌───────────────────────────┼───────────────────────────┐
       ▼                           ▼                           ▼
┌─────────────┐           ┌────────────────┐         ┌──────────────────┐
│  Route 53   │           │ Secrets Manager│         │ S3 (releases +   │
│  DNS + ACM  │           │ JWT/DB/RMQ pwd │         │   DB backups)    │
└─────────────┘           └────────────────┘         └──────────────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │  SES (e-mail)   │
                          └─────────────────┘

What each service does¶

Service	Role
CloudFront	TLS edge, geo-distributed cache, SPA delivery, forwards `/api/` and `/ws/` to the EC2 origin.
ACM	Public TLS certificates (CloudFront cert pinned to `us-east-1`).
Route 53	DNS + ACM DNS-01 validation.
EC2 + Docker	Single instance running nginx + FastAPI + RabbitMQ + PostgreSQL + PgBouncer via Docker Compose.
Secrets Manager	Stores `JWT_SECRET`, `POSTGRES_PASSWORD`, `RABBITMQ_PASSWORD`. Pulled at boot by the app.
IAM	EC2 instance profile with least-privilege policies for Secrets, SES, S3 (releases + backups), R53.
S3	Signed downloads of desktop installer + nightly PostgreSQL dumps.
SES	Transactional e-mail (verification, password reset, waitlist).

Why I kept it on a single EC2¶

For a project at this stage (hundreds of concurrent rides, not millions), the operational simplicity of one box wins. There is no cross-AZ chatter, no inter-service latency, and PostgreSQL → PgBouncer → API talks over local Docker networking. The Terraform module is small enough that any engineer can read the whole infrastructure in one sitting.

The trade-off is obvious — no high availability. Section 5 covers exactly how I plan to evolve out of that without rewriting the app.

In a POC phase, engineering velocity and iteration speed are more valuable than five-nines of availability.

3. Testing Strategy¶

A distributed system with three transports (AMQP, WebSocket, HTTP), two client platforms, and stateful in-memory sessions is impossible to verify with unit tests alone. I layered testing into four tiers.

3.1 Unit tests — pure business logic¶

The interval engine, FTP detection, zone resolver, HRV analysis, decoupling metric, etc. are all isolated from I/O. They run in milliseconds and form the bulk of the suite (tests/test_*.py). Anything in services/ that does math or rules lives here.

3.2 Integration tests — broker + DB in-process¶

WorkoutSessionManager is tested against an in-process aio-pika robust connection and a temporary SQLite/PostgreSQL session factory. This proves the engine + RMQ publisher + DB writer wiring works without needing a full HTTP stack.

3.3 E2E scenario tests — real broker, real API¶

tests/e2e/test_scenarios.py boots the actual FastAPI server, RabbitMQ, and a rmq_bridge_sim subprocess pretending to be a desktop bridge. Each test walks a full ride:

Register + log in.
Create athlete + register a BLE bridge client.
Spawn the simulator subprocess with that client_id.
Import a .zwo workout file.
Start a session, watch /ws/live for samples.
Stop and save — assert sample rows landed in PostgreSQL.

3.4 Stress tests — N concurrent simulated users¶

tests/e2e/test_stress.py runs the same flow in parallel across four tiers:

Tier	Concurrent users	Used for
`smoke`	5	CI sanity check
`basic`	10	Local dev validation
`load`	100	Pre-release performance gate
`spike`	500	Capacity planning experiments

A semaphore caps simultaneous simulator spawns at 20 to avoid drowning the event loop. The harness collects p50 / p95 / p99 latency per stage, total throughput, and first error per stage — which is how the admission control limit (40 concurrent sessions) was originally calibrated.

3.5 What I explicitly do NOT mock¶

Database, broker, and the BLE bridge protocol are all real in E2E. Mocking those would defeat the purpose of testing a distributed system — the bugs I care about live precisely in the seams between transports.

4. Security¶

Security in an IoT system spans hardware credentials, transport, message authentication, and tenancy. Here is how each layer is handled.

4.1 Transport security¶

All public traffic terminates TLS at CloudFront or at the EC2 nginx layer (TLSv1.2 / TLSv1.3 only, modern cipher list).
The AMQP port is exposed only as AMQPS — the nginx stream {} block terminates TLS using Let's Encrypt certificates and proxies plaintext AMQP to the broker on the loopback network. The broker is never reachable in plaintext from the public internet.
HTTP → HTTPS redirect is enforced at the CloudFront viewer policy.

4.2 Authentication¶

Users: bcrypt password hashing with a dedicated thread pool (bcrypt is CPU-bound and blocks the event loop if mishandled). JWTs signed with HS256; secret is loaded from AWS Secrets Manager at boot, and the app refuses to start in production if the secret falls back to the dev default. While Argon2 is the modern standard, bcrypt was sufficient for this POC, and offloading it to a thread pool was the critical architectural fix
Refresh tokens: stored hashed in PostgreSQL with an individually-revocable row per device.
Rate limiting: login is rate-limited per username (not just per IP) via slowapi, so distributed credential stuffing against a single account is still throttled.

4.3 BLE bridge authentication¶

This is the most subtle part. Each bridge gets a dedicated RabbitMQ account provisioned through the RabbitMQ HTTP Management API:

User: client_id (UUID v4).
Permissions:
configure → deny everything (bridge never declares topology).
write → ble.events.{client_id} and ble.commands.{client_id}.
read → ble.commands.{client_id}.
The password is generated server-side and returned once to the bridge at registration time.

This means a compromised bridge can only publish to its own queues, and can only read its own commands — it cannot eavesdrop on or impersonate another user's hardware.

4.4 Multi-tenant isolation in the backend¶

WorkoutSessionManager instances are keyed by (user_id, athlete_id), not globally. A bug here previously caused cross-user 409s under load — it is now covered by a dedicated stress test.
The /ws/live and /ws/bridge endpoints verify that the supplied client_id belongs to the authenticated user_id via BleClientRepository.get(client_id=..., user_id=...) before accepting the connection. Unowned client IDs return WS close code 4003.

4.5 Secrets handling¶

Application secrets (JWT_SECRET, DB password, RMQ password) live in AWS Secrets Manager and are pulled at process startup by the API container.
Terraform stores only placeholder strings — real values are injected out-of-band with aws secretsmanager put-secret-value. The lifecycle { ignore_changes = [secret_string] } block prevents accidental rotation via terraform apply.
EC2 access is via an SSH key generated by Terraform (tls_private_key ed25519) and stored locally with 0600 permissions. The SSH security group is allow-listed by CIDR — never 0.0.0.0/0.

4.6 Encryption in transit¶

Every byte that crosses a network boundary is encrypted. The matrix below tracks each transport hop end-to-end:

Hop	Protocol	TLS terminator	Cert source
Browser / mobile app → CloudFront	HTTPS (TLS 1.2/1.3)	CloudFront edge	ACM (us-east-1)
Browser → EC2 direct (bypass CF)	HTTPS (TLS 1.2/1.3)	nginx `:443`	Let's Encrypt
Browser / mobile → WebSocket `/ws/*`	WSS (TLS 1.2/1.3)	CloudFront / nginx	ACM / Let's Encrypt
Desktop bridge → broker	AMQPS	nginx `stream {}` `:5671`	Let's Encrypt (dedicated subdomain)
Mobile bridge → broker	WSS → app → AMQP loopback	CloudFront → nginx	ACM
CloudFront → EC2 origin	HTTP `:8080`	terminated at CF; allow-listed by CloudFront managed prefix list	n/a
API → RabbitMQ (intra-container)	AMQP	Docker loopback network	n/a
API → PostgreSQL (via PgBouncer)	TCP	Docker loopback network	n/a
App → AWS (Secrets Manager, SES, S3)	HTTPS	AWS SDK signs + TLS	AWS managed

Key points worth calling out explicitly:

AMQP is never exposed in plaintext. The broker listens on :5672 only on the internal Docker network. The public-facing port is :5671, an nginx stream {} block that terminates TLS (TLS 1.2/1.3 only) with a dedicated Let's Encrypt cert before proxying the decrypted AMQP frames to the broker on loopback.
CloudFront → origin is intentionally HTTP, not HTTPS. The "edge to origin" hop is restricted by an AWS managed prefix list (com.amazonaws.global.cloudfront.origin-facing) at the security group level — only CloudFront edge IPs can reach :8080. The trade-off (skip origin TLS to avoid double-termination cost) is acceptable because the alternative path requires forging both a CloudFront-only source IP and the AWS-managed prefix list. We acknowledge this traffic travels unencrypted over the AWS backbone/public net, which is acceptable for this validation phase. For production, enabling full End-to-End TLS via an Application Load Balancer or local origin cert is on the roadmap.
WebSocket survives TLS termination. Both /ws/live and /ws/bridge are upgraded at the public TLS edge and proxied as plaintext WebSocket to the FastAPI container on the Docker network. The connection stays open via proxy_read_timeout 3600s on both paths.
Modern protocols only. Every TLS terminator pins TLSv1.2 TLSv1.3 — TLS 1.0/1.1 and SSLv3 are disabled at the nginx layer, and CloudFront uses TLSv1.2_2021 as the minimum security policy.
Cert rotation is automated. Let's Encrypt renews via certbot using Route 53 DNS-01 (IAM policy certbot_dns on the EC2 role). ACM certs renew automatically. No human in the loop.

4.7 Defence in depth¶

nginx terminates :443 directly so even if CloudFront is misconfigured, the origin still rejects plaintext.
Root EBS volume is encrypted at rest (encrypted = true).
Netdata for performance monitoring is bound to 127.0.0.1 only and reached over an SSH tunnel — never exposed to the internet.

5. Performance Optimisations¶

A workout-tracking app is, at its core, a time-series workload. Every active ride emits roughly 1 sample per second per metric (heart rate, power, cadence, speed), so a single 90-minute session produces ~5,000 rows in heart_rate_samples and sensor_samples combined. At 40 concurrent rides that is ~14M rows per month — and the rate is write-heavy, append-only, and queried almost exclusively by training_id + time range.

That shape is exactly what TimescaleDB was designed for. Here is the optimisation roadmap, ordered from cheapest to most invasive.

5.1 What is already in place¶

Server-side batching. WorkoutSessionManager buffers HR and sensor samples in memory and flushes every 60 seconds in one transaction per table (_FLUSH_INTERVAL_SECONDS = 60). One commit for ~60 rows instead of 60 commits for 1 row each — a roughly 50× reduction in WAL pressure.
PgBouncer. Transaction pooling sits between FastAPI and PostgreSQL, so the asyncpg pool can be small (DB_POOL_SIZE=5) per worker without starving connections at burst.
Composite indexes. (training_id, timestamp) is indexed on both sample tables — every analytics query is a bounded range scan, never a sequential scan.
CASCADE deletes with passive_deletes. Training deletion no longer loads child rows into Python — Postgres deletes them in-place.

5.2 TimescaleDB — the natural next step¶

TimescaleDB is a PostgreSQL extension (not a separate database) that turns regular tables into hypertables — transparently partitioned by time. It is a drop-in replacement: existing SQL keeps working, ORM code does not change, Alembic migrations still apply.

For our workload it unlocks four wins:

a) Hypertables for sample tables¶

Convert heart_rate_samples and sensor_samples to hypertables partitioned by timestamp with a 7-day chunk interval. Effects:

INSERTs hit the smallest, hottest chunk — index pages stay in RAM during a ride.
Range queries on WHERE timestamp BETWEEN ... automatically prune irrelevant chunks (constraint exclusion), often skipping >95% of data without an index lookup.
VACUUM stays cheap because old chunks are immutable.

SELECT create_hypertable('heart_rate_samples', 'timestamp',
                         chunk_time_interval => INTERVAL '7 days',
                         migrate_data => TRUE);

SELECT create_hypertable('sensor_samples', 'timestamp',
                         chunk_time_interval => INTERVAL '7 days',
                         migrate_data => TRUE);

b) Native compression on cold chunks¶

Chunks older than ~14 days are read-only history. TimescaleDB can columnar-compress them in-place, typically achieving 10–20× size reduction on this kind of numeric time-series data, with queries still working transparently.

ALTER TABLE sensor_samples
  SET (timescaledb.compress,
       timescaledb.compress_segmentby = 'training_id, data_type',
       timescaledb.compress_orderby   = 'timestamp');

SELECT add_compression_policy('sensor_samples', INTERVAL '14 days');

Result: a year of historical data fits comfortably on the existing gp3 volume; backups shrink proportionally; cold queries (yearly trends, fitness analytics) run faster because they read fewer pages.

c) Continuous aggregates for analytics¶

Today, the fitness-analytics, weekly-summary, and decoupling endpoints all aggregate raw samples on read. That works at 40 sessions but won't at 4,000. Continuous aggregates are materialised views that TimescaleDB keeps incrementally refreshed in the background:

CREATE MATERIALIZED VIEW power_1min
WITH (timescaledb.continuous) AS
SELECT training_id,
       time_bucket(INTERVAL '1 minute', timestamp) AS bucket,
       AVG(value)::float AS avg_power,
       MAX(value)::float AS max_power
  FROM sensor_samples
 WHERE data_type = 'POWER'
 GROUP BY training_id, bucket;

SELECT add_continuous_aggregate_policy('power_1min',
        start_offset => INTERVAL '7 days',
        end_offset   => INTERVAL '1 minute',
        schedule_interval => INTERVAL '1 minute');

A weekly-summary query that scans 50,000 raw rows becomes a query against ~800 pre-aggregated rows — orders of magnitude faster, with no application changes beyond pointing the analytics service at the view.

d) Retention policies¶

If the product ever wants "free tier keeps 90 days of detail", that is one SQL statement:

SELECT add_retention_policy('sensor_samples', INTERVAL '90 days');

Old chunks are dropped wholesale (a DDL operation), not row-by-row deletes — which means no bloat, no VACUUM storm.

5.3 Migration cost (honest assessment)¶

TimescaleDB is genuinely low-risk for this codebase:

Application code does not change. ORM queries on hypertables work identically.
Alembic migration is a single CREATE EXTENSION + two create_hypertable() calls. Existing rows are migrated in-place with migrate_data => TRUE.
The official timescale/timescaledb-ha Docker image is a drop-in for postgres:16-alpine — change one line in docker-compose.prod.yml.
On RDS, switch to Amazon RDS for PostgreSQL with the Timescale extension enabled (shared_preload_libraries = timescaledb).

The one watch-out: foreign keys into a hypertable from a regular table are supported, but FKs out of a hypertable (e.g. sensor_samples.device_id → devices.id) are not enforced across chunks the way they are on a normal table. The current schema is fine because device_id and training_id are validated at insert time by the application — but it is worth keeping in mind for future migrations.

5.4 Other performance wins (no Timescale needed)¶

These are independent improvements that compound with the above:

COPY instead of multi-row INSERT for the flush path. asyncpg's copy_records_to_table() is roughly 3–5× faster than INSERT ... VALUES (...), (...) for batches >100 rows. At 60-second flushes with 40 concurrent rides this matters.
Drop the id PK on sample tables, use (training_id, timestamp) as a composite primary key. The autoincrement BIGINT is dead weight — nothing references samples by id, and the PK index becomes redundant with the time-range index.
JSON columns → JSONB for extra_data and rr_intervals_ms. Today they are stored as TEXT and re-parsed in Python on every read. JSONB is binary, indexed, and parseable by Postgres itself.
Read replicas for analytics endpoints. Fitness-analytics, weekly-summary, and history queries can be routed to a replica while writes stay on the primary. Combined with continuous aggregates, this fully isolates the read path from the live-ride write path.
EXPLAIN-driven indexing for Training queries. The trainings(user_id, started_at DESC) partial index covering only status = 'completed' would speed up the history list endpoint without bloating writes during a ride.

5.5 What this changes at the architecture level¶

Adopting TimescaleDB makes the time-series story explicit instead of implicit. The architecture stays the same — RabbitMQ as the event bus, WorkoutSessionManager as the per-session brain, PostgreSQL as the system of record — but PostgreSQL becomes a proper time-series store for the workload that demands it, without forcing a second database into the stack.

6. Scalability¶

The current architecture comfortably handles 40 concurrent active sessions (the value of MAX_ACTIVE_SESSIONS, enforced at the start of every session-creation endpoint with a 503 system_at_capacity and a Retry-After header). That number is not a property of RabbitMQ — it is a property of one EC2 box and the in-memory _sessions dict.

Here is how the system scales today, and how it can scale further without rewriting the application.

6.1 Where the limits actually live¶

Component	Bottleneck today	Hard limit
RabbitMQ	Memory + file descriptors for per-client queues	Tens of thousands of queues
FastAPI workers	One asyncio loop; CPU at samples × users	~hundreds of sessions/box
`_sessions` dict	In-memory state on one API process	Pinned to a single host
PostgreSQL	Write throughput on sample inserts, mitigated by batching	Vertical scaling first
PgBouncer	Pool exhaustion under burst	Tunable, low-cost to raise

6.2 Built-in scalability properties (already in place)¶

Total transport decoupling. Core business logic is isolated from connection drops and protocol specifics — the same _sessions[...] manager serves AMQP and WebSocket clients identically.
Per-client queues. Because every bridge gets its own ble.events.{client_id} and ble.commands.{client_id}, fanout is bounded per user — there is no global "firehose" exchange to saturate.
Buffering for free. If the API is under heavy load, samples queue up in RMQ rather than getting dropped at the wire.
Stateless API endpoints (except _sessions). Auth, athletes, templates, history, exports — none of those touch in-memory state.

6.3 The hard problem: `_sessions` is in-process¶

The single biggest obstacle to horizontal scaling is that WorkoutSessionManager instances live inside one Python process's memory. You cannot run two API replicas and load-balance arbitrarily, because session-affinity is required for tick scheduling, ERG commands, and the live WebSocket fan-out.

There are three complementary ways to break this constraint:

Step 1 — Sticky routing (cheap)¶

Add a session-affinity layer at the load balancer (CloudFront does not do this; an ALB or Envoy does). Hash on user_id or on a session cookie so all WebSocket connections for one user always hit the same API replica. This unlocks N replicas immediately — N × 40 sessions — at the cost of slightly imbalanced load.

Step 2 — Externalise session state (medium)¶

Move _sessions from a Python dict into Redis (or a dedicated "session-coordinator" service) keyed by (user_id, athlete_id). The manager itself still runs in one worker — but which worker becomes discoverable, so any API node can route the user there. This is the same pattern as sticky routing, but enforced by the application rather than the LB, which means it survives LB swaps.

Step 3 — Pull manager logic off the API hot path (deep)¶

Today, ticks, ERG decisions, and DB flushes all happen inside the FastAPI process. A natural next step is to spin them out as a small SessionWorker service that consumes from ble.events.{client_id} directly. The API would become a thin shell that only:

accepts HTTP/WS connections,
streams ble.live to UI,
publishes commands.

At that point RabbitMQ stops being a buffer and starts being the actual work queue — and SessionWorkers scale horizontally and independently of the API. This is the pattern that lets one cluster handle thousands of concurrent rides without redesigning the BLE bridge protocol.

6.4 Infrastructure-level scaling moves¶

These don't require touching application code at all, but they do require lifting the $20 budget:

Vertical first. The current EC2 is a single instance. Moving to a larger instance type (more vCPUs + RAM) is by far the cheapest way to raise the 40-session limit before any sharding is needed.
Managed broker. Replace the self-hosted RabbitMQ container with Amazon MQ for RabbitMQ — same protocol, but multi-AZ, automatic failover, and managed upgrades. The application connection string is the only change.
Managed database. Move PostgreSQL to RDS with read replicas. Reads (history, analytics, exports) go to a replica; writes (sample inserts during a ride) stay on the primary. Backups become RDS snapshots instead of cron + S3.
Multi-AZ EC2 behind ALB. Once the application is replica-safe (Step 1 or 2 above), run two EC2s in two AZs behind an Application Load Balancer. CloudFront keeps its current role; ALB sits between CloudFront and the EC2 fleet.
Auto-scaling group. Driven by CPU + active-session count, not just CPU — because asyncio CPU usage under-reports real concurrency.

6.5 Roadmap, in priority order¶

Vertical scale the EC2 (zero code change).
Sticky routing at ALB + multi-AZ replicas (Step 1).
Move RabbitMQ to Amazon MQ + PostgreSQL to RDS Multi-AZ.
Externalise _sessions to Redis (Step 2).
Extract SessionWorker as an independent consumer (Step 3).

Steps 1–3 buy roughly a 10× headroom each without touching the bridge protocol. Steps 4–5 are when you start serving tens of thousands of concurrent rides.

7. Architectural Trade-offs¶

Every system design balances trade-offs. Here is how this one stacks up.

Pros¶

Extreme cost efficiency. Running this entire real-time stack for under $20/month proves that Event-Driven Architecture does not require a massive cloud bill from day one.
Total decoupling. Core business logic is isolated from connection drops and protocol specifics, making it highly testable and maintainable.
Platform optimisation. No one-size-fits-all transport. Desktop gets the raw performance of direct AMQP; mobile gets the battery-friendly resilience of WebSockets.
Horizontal scalability headroom. RabbitMQ acts as a buffer, and the per-client queue topology means future workers can consume any user's stream from any node.
Strong tenancy story. Per-client RMQ accounts mean a compromised bridge cannot reach another user's data.

Cons¶

Mobile latency penalty. Mobile traffic takes an extra hop (Mobile → Proxy → RMQ → Backend), adding minor serialisation overhead.
State-management complexity. The WS proxy must translate WebSocket connection states into AMQP logic, ensuring graceful teardowns when mobile users drop offline.
Tracing difficulty. Debugging an end-to-end failure means checking mobile logs, proxy logs, RMQ queues, and backend logs — correlation IDs across all four are non-negotiable.
Single-host today. The in-memory _sessions dict is the price paid for current simplicity; Section 6 is the exit plan.

What I learned¶

Building this architecture in a compressed 3-week timeline, under a strict financial constraint, forced us to be incredibly pragmatic. It proved that separating the transport protocol from the event bus early prevents massive rewrites down the line. If I were starting over on day one, I would likely introduce an internal correlation ID middleware from the first commit to streamline log tracing across the WebSocket translation layer. However, keeping the system stateful, relying on Docker Compose, and embracing the single-box paradigm allowed us to iterate quickly and build a rock-solid, production-ready foundation for the cycling analytics core—all for the price of a few cups of coffee.

*This document describes the system at a high level.