System Design Lab

Ad tracking architecture changes only when constraints force it.

Start with the simplest click/impression collector. Use the scenarios for a normal evolution path, then adjust the workload to see exactly when one host, one shared database, or one partition stops being a good answer.

Normal evolution scenarios

Click left to right for the intended demo path. Each card changes the workload inputs.

Workload

These are inputs, not preset architecture stages.

Normal event rate Clicks and impressions arriving each second before traffic spikes. Peak multiplier Launches, auctions, bot bursts, and retry storms compress traffic into short windows. Reporting queries Dashboards, advertiser reports, billing checks, and analyst queries. Raw retention How long raw events must stay queryable or replayable. Hottest campaign share Traffic owned by the largest campaign, useful for seeing hot partition risk. Metric freshness target How quickly counters, billing candidates, or risk signals should update. Retry / duplicate rate Client retries and collector retries create repeated events that need idempotency. Billing-grade durability Every accepted event must be replayable and auditable.

Recommended shape

Single host collector

Keep ingestion, validation, and storage together while the workload fits one machine.

Current architecture path Ad events -> single collector -> shared database

Clients

Client ad server emits click/impression events

Edge / API

Load balancer fanout and health checks once hosts scale out

Collector service validate, dedupe keys, and accept events

Event backbone

Durable event log buffer, replay, and decouple consumers

Partition key campaign or bucket decides ordering and hot spots

Processing

Stream workers windows, dedupe, counters, and late-event policy

Storage + serving

Primary DB simple raw store and reports while load is small

Serving stores OLAP, billing, dashboard, and risk views

Warehouse offline truth, retention, audit, and replay checks

Bottlenecks

Single host ingestion

Shared DB pressure

Raw storage

Hot partition

Freshness pressure

Why this changes

Decision tradeoffs

Multi host collectors

Shared database

Durable event log

Partitioning

Stream aggregation

Serving stores

Source-backed rules

These are the durable system-design claims behind the model. The exact slider thresholds are deliberately labeled as teaching assumptions.

Verified rule

Partitions scale throughput, but ordering is partition-local

Kafka topics are split into partitions across brokers; consumers see ordered events within a topic-partition. This is why the lab treats partition key as both a scaling tool and an ordering tradeoff.

Apache Kafka docs

Verified rule

Durable queues and streams decouple failure and back pressure

Queueing and streaming systems isolate producers from consumers and let components scale or fail independently. This supports the event-log step when direct writes start coupling ingestion to reports.

AWS Well-Architected

Verified rule

Realtime windows need event time, watermarks, and late-event policy

Flink uses watermarks to track event-time progress, and late or out-of-order events force latency/correctness tradeoffs. This backs the streaming and freshness parts of the lab.

Apache Flink docs

Verified rule

Realtime analytics is an event-stream serving problem

Realtime analytics systems derive insights from event streams soon after generation. This is why dashboard/risk/billing views should split away from the raw ingest path as read pressure grows.

ClickHouse docs

Teaching assumptions

Capacity numbers in the sliders are teaching thresholds, not production benchmarks.
Real capacity depends on payload size, batching, replication, acks, indexes, query shape, disk, network, and operational SLOs.
The lab is strongest for interview reasoning: show the simplest design first, then name the exact constraint that forces the next component.