Retrospective: Building a Seamless Omnichannel IoT Architecture with RabbitMQ (A 3-Week POC)¶
Table of Contents¶
- 1. The Architecture at a Glance
- 2. AWS Infrastructure Layer
- 3. Testing Strategy
- 4. Security
- 5. Performance Optimisations
- 6. Scalability
- 7. Architectural Trade-offs
- What I learned
When building a modern application that bridges local Bluetooth Low Energy (BLE) hardware with a cloud backend, one of the biggest challenges is handling wildly different client environments. Desktop apps have horsepower and stable network connections, while mobile apps face strict battery constraints, backgrounding limitations, and frequent network switches. Over the last three weeks, I set out to build a Proof of Concept (POC) to solve this—with a hard infrastructure budget of just $20 USD per month. I designed an Event-Driven Architecture (EDA) centered around a RabbitMQ message bus. By layering a Backend-for-Frontend (BFF) pattern on top of it for mobile, I created a highly decoupled, transport-agnostic system that enables a true omnichannel user experience without breaking the bank. Here is a retrospective deep dive into what I built during this sprint—who talks to whom, what worked, and where our cost-driven trade-offs live
1. The Architecture at a Glance¶
My system features three distinct transport paths, all terminating in RabbitMQ as the common backbone:
┌─────────────────────────────────────────────────────────────────────────┐
│ Desktop bridge (Python, wattlog_bridge) │
│ BLE ──→ amqps://<broker>:<port> ──→ ble.events.{client_id} │
│ BLE ←── amqps://<broker>:<port> ←── ble.commands.{client_id} │
│ DIRECTLY to RabbitMQ — TCP + TLS over AMQPS │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────┐
│ RabbitMQ │ ← single broker, per-env queues
│ (prod) │ ble.events.{X}, ble.commands.{X}
└──────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────────────────┐
│ Mobile bridge (Android Java / iOS Swift / fallback BleBridgeLoop.ts) │
│ BLE ──→ WebSocket /ws/bridge ──→ [backend forwards to RMQ] │
│ BLE ←── WebSocket /ws/bridge ←── [backend reads from RMQ commands] │
│ VIA BACKEND as a proxy — does NOT connect to RMQ directly │
└─────────────────────────────────────────────────────────────────────────┘
▲
│
┌─────────────────────────────────────────────────────────────────────────┐
│ Backend (WorkoutSessionManager) │
│ Consumes ble.events.{client_id} → live UI │
│ Publishes to ble.commands.{client_id} → ERG/slope/scan │
└─────────────────────────────────────────────────────────────────────────┘
Why mobile doesn't connect directly to RabbitMQ¶
While the desktop bridge connects natively via AMQP, mobile devices route their traffic through a WebSocket Protocol Translation Gateway on the backend. I made this choice for three reasons:
- Client stability. iOS and Android lack stable, native AMQP clients
comparable to Python's
aio-pika. RMQ-over-WebSocket via the STOMP plugin exists, but pushes unnecessary complexity into the broker layer. - Unified authentication. RabbitMQ requires per-client credentials. Injecting and rotating those securely on mobile apps is risky. A WebSocket proxy lets the mobile app reuse the same JWT as the rest of our REST API.
- Battery and network optimisation. A single TLS WebSocket connection survives mobile OS sleep states and network transitions much better than AMQP heartbeats.
Logically, the mobile client sees the same RabbitMQ queues as the desktop client; the transport is simply WebSockets instead of AMQP.
Achieving true omnichannel capabilities¶
The most powerful aspect of this design is that the core business logic
(WorkoutSessionManager) is completely transport-agnostic. The backend
only listens to ble.events.{client_id} — it does not care if the bytes
arrived via AMQP from a laptop or via WebSocket from an iPhone.
A user can start a workout on their desktop, pause, and seamlessly resume on mobile mid-workout. The backend keeps reading the unbroken flow of messages from the same queue, blind to the fact that the underlying hardware and transport protocol just changed.
2. AWS Infrastructure Layer¶
The whole platform is provisioned with Terraform on AWS. We deliberately chose a single-VPC, single-EC2 layout — small enough to be cheap, big enough to scale vertically before I ever need to split it.
┌─────────────────────────────┐
Users ─────► │ CloudFront (TLS edge) │ ◄── ACM cert (us-east-1)
│ /api/*, /ws/*, SPA assets │
└────────────┬────────────────┘
│ HTTP :8080 (CF prefix list)
▼
┌──────────────────────────────────────────────┐
│ EC2 (Ubuntu, Docker Compose) │
│ ┌────────────────────────────────────────┐ │
│ │ nginx — :80 :443 :8080 :5671 :5673 │ │
│ ├────────────────────────────────────────┤ │
│ │ FastAPI api (prod) │ │
│ ├────────────────────────────────────────┤ │
│ │ RabbitMQ │ │
│ ├────────────────────────────────────────┤ │
│ │ PostgreSQL + PgBouncer │ │
│ └────────────────────────────────────────┘ │
│ Elastic IP, gp3 encrypted root volume │
└──────────────────────────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌────────────────┐ ┌──────────────────┐
│ Route 53 │ │ Secrets Manager│ │ S3 (releases + │
│ DNS + ACM │ │ JWT/DB/RMQ pwd │ │ DB backups) │
└─────────────┘ └────────────────┘ └──────────────────┘
│
▼
┌─────────────────┐
│ SES (e-mail) │
└─────────────────┘
What each service does¶
| Service | Role |
|---|---|
| CloudFront | TLS edge, geo-distributed cache, SPA delivery, forwards /api/* and /ws/* to the EC2 origin. |
| ACM | Public TLS certificates (CloudFront cert pinned to us-east-1). |
| Route 53 | DNS + ACM DNS-01 validation. |
| EC2 + Docker | Single instance running nginx + FastAPI + RabbitMQ + PostgreSQL + PgBouncer via Docker Compose. |
| Secrets Manager | Stores JWT_SECRET, POSTGRES_PASSWORD, RABBITMQ_PASSWORD. Pulled at boot by the app. |
| IAM | EC2 instance profile with least-privilege policies for Secrets, SES, S3 (releases + backups), R53. |
| S3 | Signed downloads of desktop installer + nightly PostgreSQL dumps. |
| SES | Transactional e-mail (verification, password reset, waitlist). |
Why I kept it on a single EC2¶
For a project at this stage (hundreds of concurrent rides, not millions), the operational simplicity of one box wins. There is no cross-AZ chatter, no inter-service latency, and PostgreSQL → PgBouncer → API talks over local Docker networking. The Terraform module is small enough that any engineer can read the whole infrastructure in one sitting.
The trade-off is obvious — no high availability. Section 5 covers exactly how I plan to evolve out of that without rewriting the app.
In a POC phase, engineering velocity and iteration speed are more valuable than five-nines of availability.
3. Testing Strategy¶
A distributed system with three transports (AMQP, WebSocket, HTTP), two client platforms, and stateful in-memory sessions is impossible to verify with unit tests alone. I layered testing into four tiers.
3.1 Unit tests — pure business logic¶
The interval engine, FTP detection, zone resolver, HRV analysis, decoupling
metric, etc. are all isolated from I/O. They run in milliseconds and form
the bulk of the suite (tests/test_*.py). Anything in services/ that
does math or rules lives here.
3.2 Integration tests — broker + DB in-process¶
WorkoutSessionManager is tested against an in-process aio-pika robust
connection and a temporary SQLite/PostgreSQL session factory. This proves
the engine + RMQ publisher + DB writer wiring works without needing a full
HTTP stack.
3.3 E2E scenario tests — real broker, real API¶
tests/e2e/test_scenarios.py boots the actual FastAPI server, RabbitMQ,
and a rmq_bridge_sim subprocess pretending to be a desktop bridge. Each
test walks a full ride:
- Register + log in.
- Create athlete + register a BLE bridge client.
- Spawn the simulator subprocess with that
client_id. - Import a
.zwoworkout file. - Start a session, watch
/ws/livefor samples. - Stop and save — assert sample rows landed in PostgreSQL.
3.4 Stress tests — N concurrent simulated users¶
tests/e2e/test_stress.py runs the same flow in parallel across four tiers:
| Tier | Concurrent users | Used for |
|---|---|---|
smoke |
5 | CI sanity check |
basic |
10 | Local dev validation |
load |
100 | Pre-release performance gate |
spike |
500 | Capacity planning experiments |
A semaphore caps simultaneous simulator spawns at 20 to avoid drowning the event loop. The harness collects p50 / p95 / p99 latency per stage, total throughput, and first error per stage — which is how the admission control limit (40 concurrent sessions) was originally calibrated.
3.5 What I explicitly do NOT mock¶
Database, broker, and the BLE bridge protocol are all real in E2E. Mocking those would defeat the purpose of testing a distributed system — the bugs I care about live precisely in the seams between transports.
4. Security¶
Security in an IoT system spans hardware credentials, transport, message authentication, and tenancy. Here is how each layer is handled.
4.1 Transport security¶
- All public traffic terminates TLS at CloudFront or at the EC2 nginx layer (TLSv1.2 / TLSv1.3 only, modern cipher list).
- The AMQP port is exposed only as AMQPS — the nginx
stream {}block terminates TLS using Let's Encrypt certificates and proxies plaintext AMQP to the broker on the loopback network. The broker is never reachable in plaintext from the public internet. - HTTP → HTTPS redirect is enforced at the CloudFront viewer policy.
4.2 Authentication¶
- Users: bcrypt password hashing with a dedicated thread pool
(bcrypt is CPU-bound and blocks the event loop if mishandled). JWTs
signed with
HS256; secret is loaded from AWS Secrets Manager at boot, and the app refuses to start in production if the secret falls back to the dev default. While Argon2 is the modern standard, bcrypt was sufficient for this POC, and offloading it to a thread pool was the critical architectural fix - Refresh tokens: stored hashed in PostgreSQL with an individually-revocable row per device.
- Rate limiting: login is rate-limited per username (not just per IP)
via
slowapi, so distributed credential stuffing against a single account is still throttled.
4.3 BLE bridge authentication¶
This is the most subtle part. Each bridge gets a dedicated RabbitMQ account provisioned through the RabbitMQ HTTP Management API:
- User:
client_id(UUID v4). - Permissions:
configure→ deny everything (bridge never declares topology).write→ble.events.{client_id}andble.commands.{client_id}.read→ble.commands.{client_id}.- The password is generated server-side and returned once to the bridge at registration time.
This means a compromised bridge can only publish to its own queues, and can only read its own commands — it cannot eavesdrop on or impersonate another user's hardware.
4.4 Multi-tenant isolation in the backend¶
WorkoutSessionManagerinstances are keyed by(user_id, athlete_id), not globally. A bug here previously caused cross-user 409s under load — it is now covered by a dedicated stress test.- The
/ws/liveand/ws/bridgeendpoints verify that the suppliedclient_idbelongs to the authenticateduser_idviaBleClientRepository.get(client_id=..., user_id=...)before accepting the connection. Unowned client IDs return WS close code4003.
4.5 Secrets handling¶
- Application secrets (
JWT_SECRET, DB password, RMQ password) live in AWS Secrets Manager and are pulled at process startup by the API container. - Terraform stores only placeholder strings — real values are injected
out-of-band with
aws secretsmanager put-secret-value. Thelifecycle { ignore_changes = [secret_string] }block prevents accidental rotation viaterraform apply. - EC2 access is via an SSH key generated by Terraform (
tls_private_key ed25519) and stored locally with0600permissions. The SSH security group is allow-listed by CIDR — never0.0.0.0/0.
4.6 Encryption in transit¶
Every byte that crosses a network boundary is encrypted. The matrix below tracks each transport hop end-to-end:
| Hop | Protocol | TLS terminator | Cert source |
|---|---|---|---|
| Browser / mobile app → CloudFront | HTTPS (TLS 1.2/1.3) | CloudFront edge | ACM (us-east-1) |
| Browser → EC2 direct (bypass CF) | HTTPS (TLS 1.2/1.3) | nginx :443 |
Let's Encrypt |
Browser / mobile → WebSocket /ws/* |
WSS (TLS 1.2/1.3) | CloudFront / nginx | ACM / Let's Encrypt |
| Desktop bridge → broker | AMQPS | nginx stream {} :5671 |
Let's Encrypt (dedicated subdomain) |
| Mobile bridge → broker | WSS → app → AMQP loopback | CloudFront → nginx | ACM |
| CloudFront → EC2 origin | HTTP :8080 |
terminated at CF; allow-listed by CloudFront managed prefix list | n/a |
| API → RabbitMQ (intra-container) | AMQP | Docker loopback network | n/a |
| API → PostgreSQL (via PgBouncer) | TCP | Docker loopback network | n/a |
| App → AWS (Secrets Manager, SES, S3) | HTTPS | AWS SDK signs + TLS | AWS managed |
Key points worth calling out explicitly:
- AMQP is never exposed in plaintext. The broker listens on
:5672only on the internal Docker network. The public-facing port is:5671, an nginxstream {}block that terminates TLS (TLS 1.2/1.3 only) with a dedicated Let's Encrypt cert before proxying the decrypted AMQP frames to the broker on loopback. - CloudFront → origin is intentionally HTTP, not HTTPS. The "edge
to origin" hop is restricted by an AWS managed prefix list
(
com.amazonaws.global.cloudfront.origin-facing) at the security group level — only CloudFront edge IPs can reach:8080. The trade-off (skip origin TLS to avoid double-termination cost) is acceptable because the alternative path requires forging both a CloudFront-only source IP and the AWS-managed prefix list. We acknowledge this traffic travels unencrypted over the AWS backbone/public net, which is acceptable for this validation phase. For production, enabling full End-to-End TLS via an Application Load Balancer or local origin cert is on the roadmap. - WebSocket survives TLS termination. Both
/ws/liveand/ws/bridgeare upgraded at the public TLS edge and proxied as plaintext WebSocket to the FastAPI container on the Docker network. The connection stays open viaproxy_read_timeout 3600son both paths. - Modern protocols only. Every TLS terminator pins
TLSv1.2 TLSv1.3— TLS 1.0/1.1 and SSLv3 are disabled at the nginx layer, and CloudFront usesTLSv1.2_2021as the minimum security policy. - Cert rotation is automated. Let's Encrypt renews via certbot
using Route 53 DNS-01 (IAM policy
certbot_dnson the EC2 role). ACM certs renew automatically. No human in the loop.
4.7 Defence in depth¶
- nginx terminates
:443directly so even if CloudFront is misconfigured, the origin still rejects plaintext. - Root EBS volume is encrypted at rest (
encrypted = true). - Netdata for performance monitoring is bound to
127.0.0.1only and reached over an SSH tunnel — never exposed to the internet.
5. Performance Optimisations¶
A workout-tracking app is, at its core, a time-series workload. Every
active ride emits roughly 1 sample per second per metric (heart rate,
power, cadence, speed), so a single 90-minute session produces ~5,000
rows in heart_rate_samples and sensor_samples combined. At 40
concurrent rides that is ~14M rows per month — and the rate is
write-heavy, append-only, and queried almost exclusively by
training_id + time range.
That shape is exactly what TimescaleDB was designed for. Here is the optimisation roadmap, ordered from cheapest to most invasive.
5.1 What is already in place¶
- Server-side batching.
WorkoutSessionManagerbuffers HR and sensor samples in memory and flushes every 60 seconds in one transaction per table (_FLUSH_INTERVAL_SECONDS = 60). One commit for ~60 rows instead of 60 commits for 1 row each — a roughly 50× reduction in WAL pressure. - PgBouncer. Transaction pooling sits between FastAPI and
PostgreSQL, so the asyncpg pool can be small (
DB_POOL_SIZE=5) per worker without starving connections at burst. - Composite indexes.
(training_id, timestamp)is indexed on both sample tables — every analytics query is a bounded range scan, never a sequential scan. - CASCADE deletes with
passive_deletes. Training deletion no longer loads child rows into Python — Postgres deletes them in-place.
5.2 TimescaleDB — the natural next step¶
TimescaleDB is a PostgreSQL extension (not a separate database) that turns regular tables into hypertables — transparently partitioned by time. It is a drop-in replacement: existing SQL keeps working, ORM code does not change, Alembic migrations still apply.
For our workload it unlocks four wins:
a) Hypertables for sample tables¶
Convert heart_rate_samples and sensor_samples to hypertables
partitioned by timestamp with a 7-day chunk interval. Effects:
- INSERTs hit the smallest, hottest chunk — index pages stay in RAM during a ride.
- Range queries on
WHERE timestamp BETWEEN ...automatically prune irrelevant chunks (constraint exclusion), often skipping >95% of data without an index lookup. - VACUUM stays cheap because old chunks are immutable.
SELECT create_hypertable('heart_rate_samples', 'timestamp',
chunk_time_interval => INTERVAL '7 days',
migrate_data => TRUE);
SELECT create_hypertable('sensor_samples', 'timestamp',
chunk_time_interval => INTERVAL '7 days',
migrate_data => TRUE);
b) Native compression on cold chunks¶
Chunks older than ~14 days are read-only history. TimescaleDB can columnar-compress them in-place, typically achieving 10–20× size reduction on this kind of numeric time-series data, with queries still working transparently.
ALTER TABLE sensor_samples
SET (timescaledb.compress,
timescaledb.compress_segmentby = 'training_id, data_type',
timescaledb.compress_orderby = 'timestamp');
SELECT add_compression_policy('sensor_samples', INTERVAL '14 days');
Result: a year of historical data fits comfortably on the existing gp3 volume; backups shrink proportionally; cold queries (yearly trends, fitness analytics) run faster because they read fewer pages.
c) Continuous aggregates for analytics¶
Today, the fitness-analytics, weekly-summary, and decoupling endpoints all aggregate raw samples on read. That works at 40 sessions but won't at 4,000. Continuous aggregates are materialised views that TimescaleDB keeps incrementally refreshed in the background:
CREATE MATERIALIZED VIEW power_1min
WITH (timescaledb.continuous) AS
SELECT training_id,
time_bucket(INTERVAL '1 minute', timestamp) AS bucket,
AVG(value)::float AS avg_power,
MAX(value)::float AS max_power
FROM sensor_samples
WHERE data_type = 'POWER'
GROUP BY training_id, bucket;
SELECT add_continuous_aggregate_policy('power_1min',
start_offset => INTERVAL '7 days',
end_offset => INTERVAL '1 minute',
schedule_interval => INTERVAL '1 minute');
A weekly-summary query that scans 50,000 raw rows becomes a query against ~800 pre-aggregated rows — orders of magnitude faster, with no application changes beyond pointing the analytics service at the view.
d) Retention policies¶
If the product ever wants "free tier keeps 90 days of detail", that is one SQL statement:
SELECT add_retention_policy('sensor_samples', INTERVAL '90 days');
Old chunks are dropped wholesale (a DDL operation), not row-by-row deletes — which means no bloat, no VACUUM storm.
5.3 Migration cost (honest assessment)¶
TimescaleDB is genuinely low-risk for this codebase:
- Application code does not change. ORM queries on hypertables work identically.
- Alembic migration is a single
CREATE EXTENSION+ twocreate_hypertable()calls. Existing rows are migrated in-place withmigrate_data => TRUE. - The official
timescale/timescaledb-haDocker image is a drop-in forpostgres:16-alpine— change one line indocker-compose.prod.yml. - On RDS, switch to Amazon RDS for PostgreSQL with the Timescale
extension enabled (
shared_preload_libraries = timescaledb).
The one watch-out: foreign keys into a hypertable from a regular
table are supported, but FKs out of a hypertable (e.g.
sensor_samples.device_id → devices.id) are not enforced across
chunks the way they are on a normal table. The current schema is fine
because device_id and training_id are validated at insert time by
the application — but it is worth keeping in mind for future
migrations.
5.4 Other performance wins (no Timescale needed)¶
These are independent improvements that compound with the above:
- COPY instead of multi-row INSERT for the flush path. asyncpg's
copy_records_to_table()is roughly 3–5× faster thanINSERT ... VALUES (...), (...)for batches >100 rows. At 60-second flushes with 40 concurrent rides this matters. - Drop the
idPK on sample tables, use(training_id, timestamp)as a composite primary key. The autoincrement BIGINT is dead weight — nothing references samples byid, and the PK index becomes redundant with the time-range index. - JSON columns → JSONB for
extra_dataandrr_intervals_ms. Today they are stored as TEXT and re-parsed in Python on every read. JSONB is binary, indexed, and parseable by Postgres itself. - Read replicas for analytics endpoints. Fitness-analytics, weekly-summary, and history queries can be routed to a replica while writes stay on the primary. Combined with continuous aggregates, this fully isolates the read path from the live-ride write path.
- EXPLAIN-driven indexing for
Trainingqueries. Thetrainings(user_id, started_at DESC)partial index covering onlystatus = 'completed'would speed up the history list endpoint without bloating writes during a ride.
5.5 What this changes at the architecture level¶
Adopting TimescaleDB makes the time-series story explicit instead of implicit. The architecture stays the same — RabbitMQ as the event bus, WorkoutSessionManager as the per-session brain, PostgreSQL as the system of record — but PostgreSQL becomes a proper time-series store for the workload that demands it, without forcing a second database into the stack.
6. Scalability¶
The current architecture comfortably handles 40 concurrent active
sessions (the value of MAX_ACTIVE_SESSIONS, enforced at the start of
every session-creation endpoint with a 503 system_at_capacity and a
Retry-After header). That number is not a property of RabbitMQ — it is
a property of one EC2 box and the in-memory _sessions dict.
Here is how the system scales today, and how it can scale further without rewriting the application.
6.1 Where the limits actually live¶
| Component | Bottleneck today | Hard limit |
|---|---|---|
| RabbitMQ | Memory + file descriptors for per-client queues | Tens of thousands of queues |
| FastAPI workers | One asyncio loop; CPU at samples × users | ~hundreds of sessions/box |
_sessions dict |
In-memory state on one API process | Pinned to a single host |
| PostgreSQL | Write throughput on sample inserts, mitigated by batching | Vertical scaling first |
| PgBouncer | Pool exhaustion under burst | Tunable, low-cost to raise |
6.2 Built-in scalability properties (already in place)¶
- Total transport decoupling. Core business logic is isolated from
connection drops and protocol specifics — the same
_sessions[...]manager serves AMQP and WebSocket clients identically. - Per-client queues. Because every bridge gets its own
ble.events.{client_id}andble.commands.{client_id}, fanout is bounded per user — there is no global "firehose" exchange to saturate. - Buffering for free. If the API is under heavy load, samples queue up in RMQ rather than getting dropped at the wire.
- Stateless API endpoints (except
_sessions). Auth, athletes, templates, history, exports — none of those touch in-memory state.
6.3 The hard problem: _sessions is in-process¶
The single biggest obstacle to horizontal scaling is that
WorkoutSessionManager instances live inside one Python process's
memory. You cannot run two API replicas and load-balance arbitrarily,
because session-affinity is required for tick scheduling, ERG commands,
and the live WebSocket fan-out.
There are three complementary ways to break this constraint:
Step 1 — Sticky routing (cheap)¶
Add a session-affinity layer at the load balancer (CloudFront does not
do this; an ALB or Envoy does). Hash on user_id or on a session cookie
so all WebSocket connections for one user always hit the same API
replica. This unlocks N replicas immediately — N × 40 sessions — at the
cost of slightly imbalanced load.
Step 2 — Externalise session state (medium)¶
Move _sessions from a Python dict into Redis (or a dedicated
"session-coordinator" service) keyed by (user_id, athlete_id). The
manager itself still runs in one worker — but which worker becomes
discoverable, so any API node can route the user there. This is the same
pattern as sticky routing, but enforced by the application rather than
the LB, which means it survives LB swaps.
Step 3 — Pull manager logic off the API hot path (deep)¶
Today, ticks, ERG decisions, and DB flushes all happen inside the
FastAPI process. A natural next step is to spin them out as a small
SessionWorker service that consumes from ble.events.{client_id}
directly. The API would become a thin shell that only:
- accepts HTTP/WS connections,
- streams
ble.liveto UI, - publishes commands.
At that point RabbitMQ stops being a buffer and starts being the actual work queue — and SessionWorkers scale horizontally and independently of the API. This is the pattern that lets one cluster handle thousands of concurrent rides without redesigning the BLE bridge protocol.
6.4 Infrastructure-level scaling moves¶
These don't require touching application code at all, but they do require lifting the $20 budget:
- Vertical first. The current EC2 is a single instance. Moving to a larger instance type (more vCPUs + RAM) is by far the cheapest way to raise the 40-session limit before any sharding is needed.
- Managed broker. Replace the self-hosted RabbitMQ container with Amazon MQ for RabbitMQ — same protocol, but multi-AZ, automatic failover, and managed upgrades. The application connection string is the only change.
- Managed database. Move PostgreSQL to RDS with read replicas. Reads (history, analytics, exports) go to a replica; writes (sample inserts during a ride) stay on the primary. Backups become RDS snapshots instead of cron + S3.
- Multi-AZ EC2 behind ALB. Once the application is replica-safe (Step 1 or 2 above), run two EC2s in two AZs behind an Application Load Balancer. CloudFront keeps its current role; ALB sits between CloudFront and the EC2 fleet.
- Auto-scaling group. Driven by CPU + active-session count, not just CPU — because asyncio CPU usage under-reports real concurrency.
6.5 Roadmap, in priority order¶
- Vertical scale the EC2 (zero code change).
- Sticky routing at ALB + multi-AZ replicas (Step 1).
- Move RabbitMQ to Amazon MQ + PostgreSQL to RDS Multi-AZ.
- Externalise
_sessionsto Redis (Step 2). - Extract
SessionWorkeras an independent consumer (Step 3).
Steps 1–3 buy roughly a 10× headroom each without touching the bridge protocol. Steps 4–5 are when you start serving tens of thousands of concurrent rides.
7. Architectural Trade-offs¶
Every system design balances trade-offs. Here is how this one stacks up.
Pros¶
- Extreme cost efficiency. Running this entire real-time stack for under $20/month proves that Event-Driven Architecture does not require a massive cloud bill from day one.
- Total decoupling. Core business logic is isolated from connection drops and protocol specifics, making it highly testable and maintainable.
- Platform optimisation. No one-size-fits-all transport. Desktop gets the raw performance of direct AMQP; mobile gets the battery-friendly resilience of WebSockets.
- Horizontal scalability headroom. RabbitMQ acts as a buffer, and the per-client queue topology means future workers can consume any user's stream from any node.
- Strong tenancy story. Per-client RMQ accounts mean a compromised bridge cannot reach another user's data.
Cons¶
- Mobile latency penalty. Mobile traffic takes an extra hop (Mobile → Proxy → RMQ → Backend), adding minor serialisation overhead.
- State-management complexity. The WS proxy must translate WebSocket connection states into AMQP logic, ensuring graceful teardowns when mobile users drop offline.
- Tracing difficulty. Debugging an end-to-end failure means checking mobile logs, proxy logs, RMQ queues, and backend logs — correlation IDs across all four are non-negotiable.
- Single-host today. The in-memory
_sessionsdict is the price paid for current simplicity; Section 6 is the exit plan.
What I learned¶
Building this architecture in a compressed 3-week timeline, under a strict financial constraint, forced us to be incredibly pragmatic. It proved that separating the transport protocol from the event bus early prevents massive rewrites down the line. If I were starting over on day one, I would likely introduce an internal correlation ID middleware from the first commit to streamline log tracing across the WebSocket translation layer. However, keeping the system stateful, relying on Docker Compose, and embracing the single-box paradigm allowed us to iterate quickly and build a rock-solid, production-ready foundation for the cycling analytics core—all for the price of a few cups of coffee.
*This document describes the system at a high level.