System Design Lab

Ad tracking architecture changes only when constraints force it.

Start with the simplest click/impression collector. Use the scenarios for a normal evolution path, then adjust the workload to see exactly when one host, one shared database, or one partition stops being a good answer.

Normal evolution scenarios

Click left to right for the intended demo path. Each card changes the workload inputs.

Workload

These are inputs, not preset architecture stages.

Recommended shape

Single host collector

Keep ingestion, validation, and storage together while the workload fits one machine.

Current architecture path Ad events -> single collector -> shared database
Ad click tracking architecture diagram Whiteboard-style architecture diagram for an ad click and impression tracking pipeline. Clients Edge / API Event backbone Processing Storage + serving Client ad server Load balancer health + fanout Single collector validate + write Event log buffer + replay Partition key campaign / bucket Stream workers dedupe windows Primary DB raw + reports Serving stores OLAP / billing Warehouse offline truth
Clients
Client ad server emits click/impression events
Edge / API
Load balancer fanout and health checks once hosts scale out
Collector service validate, dedupe keys, and accept events
Event backbone
Durable event log buffer, replay, and decouple consumers
Partition key campaign or bucket decides ordering and hot spots
Processing
Stream workers windows, dedupe, counters, and late-event policy
Storage + serving
Primary DB simple raw store and reports while load is small
Serving stores OLAP, billing, dashboard, and risk views
Warehouse offline truth, retention, audit, and replay checks

Bottlenecks

Single host ingestion

Shared DB pressure

Raw storage

Hot partition

Freshness pressure

Why this changes

    Decision tradeoffs

    Multi host collectors

    Shared database

    Durable event log

    Partitioning

    Stream aggregation

    Serving stores

    Source-backed rules

    These are the durable system-design claims behind the model. The exact slider thresholds are deliberately labeled as teaching assumptions.

    Verified rule

    Partitions scale throughput, but ordering is partition-local

    Kafka topics are split into partitions across brokers; consumers see ordered events within a topic-partition. This is why the lab treats partition key as both a scaling tool and an ordering tradeoff.

    Apache Kafka docs
    Verified rule

    Durable queues and streams decouple failure and back pressure

    Queueing and streaming systems isolate producers from consumers and let components scale or fail independently. This supports the event-log step when direct writes start coupling ingestion to reports.

    AWS Well-Architected
    Verified rule

    Realtime windows need event time, watermarks, and late-event policy

    Flink uses watermarks to track event-time progress, and late or out-of-order events force latency/correctness tradeoffs. This backs the streaming and freshness parts of the lab.

    Apache Flink docs
    Verified rule

    Realtime analytics is an event-stream serving problem

    Realtime analytics systems derive insights from event streams soon after generation. This is why dashboard/risk/billing views should split away from the raw ingest path as read pressure grows.

    ClickHouse docs

    Teaching assumptions

    • Capacity numbers in the sliders are teaching thresholds, not production benchmarks.
    • Real capacity depends on payload size, batching, replication, acks, indexes, query shape, disk, network, and operational SLOs.
    • The lab is strongest for interview reasoning: show the simplest design first, then name the exact constraint that forces the next component.