URL Shortener
from Zero to Staff Engineer
Every concept explained with Why, What, When, Where, How, Drawbacks and Advantages. Built for someone who wants to think like a 30-year veteran, not just memorise answers.
How to Use This Cookbook
π‘ Golden Rule: Don't memorise answers. Understand the reasoning. An interviewer can always ask a twist. If you understand WHY, you can answer any variant.
How to push to GitHub Pages: Save this file as index.html. Create a GitHub repo. Go to Settings β Pages β Source: main branch, /root. Your cookbook will be live at https://yourusername.github.io/reponame within 2 minutes. Works 100% offline too β no internet needed once loaded.
Complete Mind Map
Every major concept in the URL Shortener design and how they connect. Use this as your orientation before going deep into any chapter.
Snowflake ID
Base62 encode
SKIP LOCKED
Circuit breaker
L2 Redis
L3 CDN
LFU eviction
SETNX mutex
Cassandra (read)
CQRS pattern
ClickHouse (analytics)
WAL replication
PACELC
LOCAL_QUORUM
Read-your-own-writes
Eventual consistency
Leader election (etcd)
Split brain prevention
Fencing token
Canary recovery
min.insync.replicas
Consumer groups
MirrorMaker 2
Dead letter queue
Rate limiting
Token bucket
SSRF prevention
301 vs 302
p99 latency
Synthetic testing
Chaos engineering
Predictive alerts
The 5 Questions to Answer in Every Interview
Always start by establishing numbers. DAU, reads/sec, writes/sec, storage. Every design decision follows from scale. A system for 1000 users is completely different from one for 100 million.
At each scale level, identify the bottleneck. DB at 1000 r/s. Cache miss at 10,000 r/s. Network at 100,000 r/s. Hot partition at 1M r/s. Design is about managing bottlenecks.
This single choice drives 80% of your architecture decisions. For URL shortener: Availability. A stale redirect is OK. A 503 error is not. This justifies Cassandra, eventual consistency, async replication.
URL shortener is 100:1 read-heavy. This justifies: separate read DB, heavy caching, read replicas, CDN. If it were write-heavy, the entire architecture would differ.
For every component you add, the interviewer WILL ask "what if that fails?". Think in failure modes first. Circuit breaker for pool. Cassandra RF=3 for node failure. Kafka replay for datacenter failure. GeoDNS for region failure. Design the happy path last.
Scale & Numbers
These numbers are not magic β each one is derived. Know the derivation, not just the result. An interviewer will ask "how did you get that?"
π Expert move: Always establish numbers in the first 3 minutes. Say "Before I design anything, let me estimate scale." This immediately signals seniority.
Traffic Estimation
// Full derivation β speak this out loud in the interview DAU = 100,000,000 URL creation rate = DAU Γ 0.1 = 10,000,000 / day Seconds in a day = 24 Γ 60 Γ 60 = 86,400 Write TPS = 10,000,000 / 86,400 β 115 writes/sec Read:Write ratio = 100:1 (users click far more than they create) Read TPS = 115 Γ 100 = 11,500 reads/sec Peak traffic (3Γ avg) = 345 writes/sec, 34,500 reads/sec
Storage Estimation
// Per-record breakdown short_code = 8 bytes (base62, 6-8 chars) long_url = 200 bytes (average URL length) user_id = 16 bytes (UUID = 128-bit) created_at = 8 bytes (TIMESTAMPTZ = 8 bytes in Postgres) expires_at = 8 bytes is_active = 1 byte metadata = 60 bytes (geo, IP, custom alias flag, etc) βββββββββββββ Total β 300 bytes per row 5-year total rows = 10M/day Γ 365 Γ 5 = 18.25 billion rows Raw storage = 18.25B Γ 300B = 5.5 TB Replication (RF=3)= 5.5 Γ 3 = 16.5 TB Index overhead 20% β 20 TB total on disk
URL Length β Why 6 Characters?
Characters: a-z (26) + A-Z (26) + 0-9 (10) = 62 characters. All URL-safe. No encoding needed.
Base64 uses + and / which are URL-special characters. They need percent-encoding in URLs. Base62 avoids this entirely.
base62^6 = 56 billion combinations. base62^7 = 3.5 trillion. 18.25 billion records over 5 years fits comfortably in 7 chars with room to spare.
Base58 (used by Bitcoin) removes visually confusing chars: 0, O, l, I. Valid choice if URLs will be typed by humans. Minor trade-off: slightly fewer combinations per character.
Bandwidth & Latency Targets
Write bandwidth: 115 w/s Γ 2 KB (request) = 230 KB/s inbound Read bandwidth: 11,500 r/s Γ 500 B (302 response) = 5.7 MB/s outbound Peak read: 34,500 r/s Γ 500 B = 17 MB/s outbound Latency targets: p50 redirect latency: < 10ms (CDN hit) p99 redirect latency: < 50ms (cache hit) p99 redirect latency: < 200ms (DB hit, worst case) Cache math: If L1 hit rate = 80%, L2 = 18%, L3 = 1.5%: DB reads = 11,500 Γ 0.005 = 57 reads/sec to Cassandra β trivial!
ID Generation & Hashing
The core question: how do you generate a globally unique 6-8 character short URL across multiple servers in multiple continents without collisions?
β οΈ Common wrong answer: "I'll just hash the long URL with MD5 and take the first 6 characters." This fails at scale due to the birthday paradox. Know exactly WHY it fails before proposing it.
Approach 1 β MD5 Truncation (Naive, Wrong)
Hash the long URL with MD5, take first 6 characters of the hex output, encode as base62.
Birthday paradox: with 56B combinations and 18B URLs, collision probability grows non-linearly. At 50% of space used, ~50% of new URLs collide. Retry loop under heavy load becomes O(n) unbounded.
If you randomly pick from N items, you expect your first collision at approximately βN picks. With base62^6 = 56 billion combinations: first collision expected at β(56B) β 237,000 URLs. At 10M URLs/day, you hit collisions within hours. This is catastrophic.
Approach 2 β Pre-Generated Pool (Recommended)
Pre-generate thousands of unique short codes and store them in a pool table. On write request: atomically pop one. Refill the pool asynchronously when it drops below threshold.
Removes all collision handling from the write path. The pool contains only unique, verified codes. Pop is O(1), no retries, no race conditions within a region.
When write latency must be ultra-low and predictable. When you can afford a background job for pre-generation. Not suitable if you need truly random-looking URLs with no pattern.
SELECT ... FOR UPDATE SKIP LOCKED is the key. Multiple workers can pop from the pool concurrently β each skips rows already locked by others. No deadlocks, no waits, true parallelism.
O(1) write path. No collisions at runtime. Predictable latency. Easy to monitor pool health. Survives burst traffic (pool absorbs the load).
Cross-region coordination is impossible β two regions cannot share one pool without a central coordinator (defeats the purpose). Solution: regional pools with prefixes.
-- Pre-generation job (runs per region, independently) INSERT INTO url_pool (short_code, region, taken) SELECT base62(nextval('url_seq')), 'US-EAST', false FROM generate_series(1, 10000); -- Atomic pop β O(1), concurrent-safe, no deadlock WITH popped AS ( SELECT short_code FROM url_pool WHERE taken = false AND region = 'US-EAST' LIMIT 1 FOR UPDATE SKIP LOCKED -- β KEY: skips locked rows instantly ) UPDATE url_pool SET taken = true WHERE short_code = (SELECT short_code FROM popped) RETURNING short_code; -- SKIP LOCKED explanation: -- Worker A locks row "abc123" β Worker B sees it locked β skips it -- Worker B immediately takes "def456" β no wait, no deadlock -- 100 concurrent workers can pop simultaneously
β‘ Regional pool problem: US-East and EU cannot share one pool. If both try to pop the same code simultaneously, we get duplicates. Solution: give each region a prefix. US=1xxxx, EU=2xxxx, Asia=3xxxx. Prefixes guarantee global uniqueness. Each region manages its own pool independently.
Approach 3 β Snowflake ID (Best Fallback)
A 64-bit integer composed of: timestamp bits + region/node ID bits + sequence bits. No database needed. Each server generates IDs independently. Then base62-encode the integer to get the short code.
Uniqueness is guaranteed by construction: same timestamp + same node can never produce the same sequence number. No central coordinator. No lock. Pure mathematics.
When the pool is empty (circuit breaker fallback). When you want no-dependency ID generation. When you need IDs to be monotonically increasing (good for DB B-tree index locality).
41 bits for timestamp (69 years), 6 bits for region (64 regions), 6 bits for node (64 nodes per region), 11 bits for sequence (2048 IDs per millisecond per node). Total: 64 bits.
No coordinator. No database. No collision. Sortable by time. Good B-tree locality. Survives any infrastructure failure.
Clock skew: if a server's clock drifts backward, you can generate duplicate IDs. Mitigation: wait until clock catches up, or use NTP with tight synchronisation.
// Snowflake ID bit layout // ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββ // β 41 bits β 6 bits β 6 bits β 11 bits β // β timestamp (ms) β regionId β nodeId β sequence β // ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββ // 41 bits β 2^41 ms = 69 years from custom EPOCH // 6 bits β 64 regions max // 6 bits β 64 nodes per region // 11 bits β 2048 IDs per millisecond per node public class SnowflakeIdGenerator { private static final long EPOCH = 1700000000000L; // Nov 2023 private final long regionId; private final long nodeId; private long sequence = 0; private long lastTimestamp = -1; public synchronized long nextId() { long ts = System.currentTimeMillis() - EPOCH; if (ts == lastTimestamp) { sequence = (sequence + 1) & 0x7FFL; // 11-bit mask if (sequence == 0) ts = waitNextMs(lastTimestamp); } else { sequence = 0; } lastTimestamp = ts; return (ts << 23) | (regionId << 17) | (nodeId << 11) | sequence; } public String shortCode() { return Base62.encode(nextId()); // β 7 char string } }
Circuit Breaker Pattern β Pool Empty Scenario
A circuit breaker monitors the health of a resource (here: the URL pool). When the pool is empty or the refill job is dead, it "trips" and switches the entire system to a fallback path.
Without a circuit breaker: empty pool β all writes fail β service down. With circuit breaker: empty pool β switch to Snowflake ID generation β service continues at slightly higher latency. Graceful degradation over hard failure.
CLOSED: Normal. Pool requests flow through. OPEN: Pool failed. All requests routed to Snowflake fallback. HALF-OPEN: Probe: try pool once. If succeeds, close. If fails, stay open.
Prevents cascading failures. Automatic recovery. No human intervention needed. System survives a dead refill job at 3 AM.
State management complexity. False positives can unnecessarily switch to fallback. Need tuning: how many failures trigger open state? How long before half-open probe?
Any time you have a primary path and a fallback path. Pool vs Snowflake. Redis vs DB. Primary region vs secondary region.
// Predictive monitoring β Staff-level signal // Don't alert when pool = 0 (already broken) // Alert when pool will hit 0 in 10 minutes double consumptionRate = getConsumedPerSecond(); // e.g. 115/s long remaining = getPoolRemaining(); // e.g. 50,000 double timeToEmpty = remaining / consumptionRate; // seconds if (timeToEmpty < 600) { // < 10 minutes alertOncall("Pool empties in " + timeToEmpty + "s β refill NOW"); } if (timeToEmpty < 120) { // < 2 minutes β emergency circuitBreaker.trip(); // switch to Snowflake immediately }
Caching β All Layers
Caching is what makes an 11,500 reads/sec system survive with only 57 database reads/sec. The three layers work together: each catches what the previous missed.
π Key insight: The goal is 99%+ cache hit rate. At 11,500 r/s with 99.5% hit rate, only 57 r/s reach Cassandra. This is the entire reason caching exists β to make the database irrelevant for 99% of traffic.
L1 β Caffeine (In-Process Cache)
A Java in-memory cache running inside the same JVM as your application. No network call. Sub-millisecond access. Uses the W-TinyLFU algorithm for near-optimal eviction.
Guava uses LRU (Least Recently Used). Caffeine uses W-TinyLFU β a window-based tiny least-frequently-used algorithm. W-TinyLFU achieves near-optimal hit rate for skewed (Zipf) distributions like URL access patterns. 10-30% better hit rate than pure LRU.
Uses a Count-Min Sketch (probabilistic frequency counter, O(1) space) to estimate how often each key is accessed. Items admitted to main cache only if their frequency exceeds the victim being evicted. A "window" cache handles newly popular items before they have frequency history.
Any read-heavy Java application. Data that fits in heap memory (10kβ500k entries). When network round-trip to Redis would dominate latency.
Zero network overhead. Sub-microsecond access. No external dependency. Survives Redis outage. Best hit rate for skewed access patterns.
Per-JVM cache β 10 instances = 10 separate caches. Cache staleness between instances (eventual consistency). Consumes JVM heap. Lost on restart. Not shared across services.
Cache<String, String> l1Cache = Caffeine.newBuilder() .maximumSize(10_000) // top 10k URLs in memory .expireAfterWrite(5, TimeUnit.MINUTES) // staleness limit .recordStats() // enables hitRate() metric .build(); // Access β absolutely zero network call String longUrl = l1Cache.getIfPresent(shortCode); if (longUrl == null) { longUrl = l2Cache.get(shortCode); // fall to Redis l1Cache.put(shortCode, longUrl); // populate L1 } // Count-Min Sketch β how W-TinyLFU estimates frequency // 4 hash functions, each maps key to a counter array cell // Frequency estimate = minimum of the 4 cells // Space: O(1) regardless of number of distinct keys // Error: bounded by Ξ΅ with probability 1-Ξ΄
L2 β Redis Cluster (Regional Cache)
An in-memory key-value store running as a separate service, shared across all application instances in a region. Accessed via network (< 1ms within same datacenter).
Redis supports data structures (strings, hashes, sorted sets, counters), persistence, pub/sub, Lua scripting, and cluster mode. Memcached is simpler but has no persistence or advanced structures. Redis is the default choice for most production systems.
With maxmemory-policy allkeys-lfu, Redis uses LFU to decide what to evict. It maintains a frequency counter per key (using a probabilistic approximation β Morris counter). Keys with lowest frequency are evicted first, regardless of recency. Perfect for URL shortener β viral URLs stay cached even if not accessed for hours.
Shared cache across multiple application instances. Data too large for L1 heap. When you need persistence (RDB/AOF) or pub/sub. Cross-service caching.
Shared across all instances (no per-instance staleness). Large capacity (limited by RAM). Rich data structures. TTL per key. Atomic operations. Persistence options.
Network hop (1β5ms). Single point of failure (mitigated by Redis Sentinel or Cluster). Memory cost. Serialisation overhead. Cluster resharding complexity.
The Thundering Herd Problem β Full Deep Dive
A popular URL's cache entry expires. At that exact moment, 10,000 simultaneous requests arrive. All miss the cache. All 10,000 hit Cassandra simultaneously. Cassandra gets overloaded and dies. System cascades to failure.
Viral URLs get massive concurrent traffic. TTL expiry is a simultaneous event. Without protection, the cache miss translates directly to a DB spike proportional to traffic volume.
SETNX = SET if Not eXists. First thread to call it "wins" and fetches from DB. All others see the lock exists and wait. When the winner populates the cache, all others read from cache. DB sees only 1 request instead of 10,000.
DB sees exactly 1 request per cache miss, regardless of concurrent traffic volume. Thundering herd completely eliminated.
Lock holder crashes mid-fetch β lock never released β all waiting threads starve. Mitigation: always set a TTL on the lock (e.g. 500ms). If TTL expires without cache being populated, next thread retries.
Cache stampede, dogpile effect. Same problem, different names. Know all three terms.
// SETNX mutex β prevents thundering herd / cache stampede public String getWithMutex(String shortCode) { // 1. Try cache first String cached = redis.get(shortCode); if (cached != null) return cached; String lockKey = "lock:" + shortCode; String lockVal = UUID.randomUUID().toString(); // unique owner ID // 2. Try to acquire lock (NX=only if not exists, PX=expire in 500ms) Boolean acquired = redis.set(lockKey, lockVal, SetParams.setParams().nx().px(500)); if (acquired != null) { // 3. WE got the lock β fetch from DB try { String longUrl = cassandra.get(shortCode); redis.setex(shortCode, 86400, longUrl); // cache 24h return longUrl; } finally { // 4. Release lock ONLY if we still own it (atomic Lua script) redis.eval("if redis.call('get',KEYS[1])==ARGV[1] " + "then return redis.call('del',KEYS[1]) " + "else return 0 end", List.of(lockKey), List.of(lockVal)); } } else { // 5. SOMEONE ELSE has lock β wait and retry Thread.sleep(50); return redis.get(shortCode); // should be populated now } }
L3 β CDN Edge Cache
A distributed network of edge servers globally (CloudFront, Fastly, Cloudflare). Caches the redirect response (302 + Location header) at the edge closest to the user. Tokyo user gets served from Tokyo edge, not from your US-East datacenter.
Eliminates intercontinental latency for hot URLs. A cache hit at CDN = ~5ms instead of ~200ms (AsiaβUS round trip). For viral URLs with millions of clicks, CDN handles 99% of load without touching your infrastructure.
CDN cache is hard to invalidate instantly. Purging a URL from all global edge nodes takes 1β5 minutes. If you delete or update a URL, users may get the old redirect for minutes. This is why CDN TTL should be short (1 hour max) for mutable URLs.
Call CDN invalidation API (e.g. CloudFront CreateInvalidation) when a URL is updated or deleted. Propagates globally in 1β5 minutes. For instant invalidation, use short TTL instead of explicit purge.
The Three Cache Failure Modes
What: Requests for short codes that don't exist in the system bypass cache every time (cache returns null β hits DB β DB returns null β nothing to cache β infinite DB hits).
Fix: Cache negative results ("NULL" with short TTL like 60s). Or use a Bloom filter at the API gateway: if the Bloom filter says "definitely not exists", return 404 immediately without touching cache or DB.
Bloom filter guarantee: Zero false negatives (if item is in the filter, it's definitely in the system). ~1% false positives (might say item exists when it doesn't β harmless, just a cache miss).
What: Many cache entries expire at the same time β mass DB requests β DB overwhelmed β system down.
Cause: All entries written at the same time with the same TTL (e.g., after a service restart or cold start).
Fix: Add random jitter to TTL: TTL = baseTTL + random(0, baseTTL Γ 0.2). This spreads expiry events over time instead of bunching them.
What: A scraper or bot accesses millions of unique, low-popularity URLs once each. These fill the cache, evicting frequently-accessed popular URLs. Cache hit rate drops catastrophically.
Fix: LFU eviction (Caffeine W-TinyLFU, Redis allkeys-lfu). Items with frequency=1 (accessed once) are evicted before items with frequency=1000. Scraper traffic cannot pollute the cache with LFU.
L1 Caffeine (in-process) still serves 80% of reads β zero Redis involvement. The 20% that miss L1 fall through directly to Cassandra. At 11,500 r/s with 80% L1 hit rate, only 2,300 r/s reach Cassandra β well within capacity. Circuit breaker on Redis connection pool disables the L2 path and routes directly to DB. p99 latency increases from 20ms to 50ms. Service degrades gracefully; it does not fail. This is why L1 in-process cache is critical β it's your shield when external dependencies die.
Database Design
Why two databases? Because read and write operations have fundamentally different requirements that cannot be optimally satisfied by a single data store at this scale.
CQRS β Command Query Responsibility Segregation: Separate your write model (commands that change state) from your read model (queries that return data). Each store is independently optimised, scaled, and tuned. This is not over-engineering at 100M DAU β it's necessary.
PostgreSQL β Write Database
A relational database with full ACID guarantees, used exclusively for the write path. All URL creation, update, and deletion goes here. Never touched by the read path.
ACID transactions ensure no duplicate short codes. UNIQUE constraint is a hard database-level guarantee. WAL (Write-Ahead Log) enables reliable async replication to Cassandra and other replicas. Rich SQL for complex write-side queries (user dashboards, bulk operations).
Atomicity: Write succeeds entirely or fails entirely. No partial writes. Consistency: UNIQUE constraint enforced at DB level. Isolation: Concurrent writes don't interfere. Durability: Committed writes survive crashes (WAL + fsync).
Doesn't scale horizontally for reads (hence Cassandra for reads). Single-region primary means intercontinental writes have higher latency. Complex sharding if writes exceed single-machine capacity.
Every data change in PostgreSQL is written to the WAL (a sequential log file) BEFORE being applied to the actual data files. This enables: crash recovery (replay WAL on restart), replication (stream WAL to replicas), and point-in-time recovery (replay WAL to any past moment). Logical replication streams WAL to Cassandra and other consumers.
-- PostgreSQL schema β write-optimised CREATE TABLE url_mappings ( id BIGINT PRIMARY KEY, -- Snowflake ID short_code VARCHAR(8) NOT NULL UNIQUE, -- UNIQUE = DB-level guarantee long_url TEXT NOT NULL, user_id UUID, -- NULL = anonymous created_at TIMESTAMPTZ DEFAULT NOW(), expires_at TIMESTAMPTZ, -- NULL = no expiry is_active BOOLEAN DEFAULT TRUE, custom_alias BOOLEAN DEFAULT FALSE ); -- Indexes β each one has a specific purpose CREATE UNIQUE INDEX idx_short_code ON url_mappings(short_code); -- fast lookup by short code CREATE INDEX idx_user_id ON url_mappings(user_id); -- user's URL list CREATE INDEX idx_expires_active ON url_mappings(expires_at) WHERE expires_at IS NOT NULL; -- PARTIAL INDEX: only non-null rows -- Partial index is smaller and faster than full index -- Only indexes the ~10% of rows that have an expiry date
Cassandra β Read Database
A distributed NoSQL database optimised for high-throughput key-value reads. Multi-region active-active. No single point of failure. Scales horizontally by adding nodes.
Our read pattern is simple: given short_code, return long_url. Cassandra is a masterclass at exactly this β single-key lookups at massive scale. It distributes data across nodes using consistent hashing, so any node can serve any key. Adding more nodes linearly increases throughput.
In Cassandra, the partition key determines which node stores the data. Our partition key = short_code. High cardinality (millions of unique codes) means data is spread evenly across all nodes. No hot partitions.
Cassandra maps each short_code to a token on a ring. Each node owns a range of tokens. When a node is added or removed, only the adjacent tokens are remapped β not all data. This is why Cassandra scales without downtime.
Active-active multi-region. No primary node (any node can serve reads). Linear scalability. Tunable consistency (ONE, LOCAL_QUORUM, QUORUM). Built-in replication factor. No joins = no lock contention.
No ACID. No joins. No secondary indexes at scale. Schema must be designed for access patterns (not normalisation). Compaction can cause latency spikes. Repair jobs required for consistency.
-- Cassandra keyspace with multi-region replication CREATE KEYSPACE url_shortener WITH replication = { 'class': 'NetworkTopologyStrategy', 'us_east': 3, -- 3 replicas in US-East 'eu_west': 3, -- 3 replicas in EU-West 'asia_pac': 3 -- 3 replicas in Asia-Pacific }; -- Primary lookup table CREATE TABLE url_mappings ( short_code TEXT, long_url TEXT, created_at TIMESTAMP, expires_at TIMESTAMP, is_active BOOLEAN, PRIMARY KEY (short_code) -- short_code IS the partition key ) WITH compaction = {'class': 'LeveledCompactionStrategy'}; -- LeveledCompactionStrategy (LCS) vs SizeTieredCompactionStrategy (STCS) -- STCS: better for write-heavy. Large SSTable merges. Higher read amplification. -- LCS: better for read-heavy. Maintains sorted levels. Lower read amplification. -- For URL shortener (100:1 read:write) β LCS is correct choice.
Hot Partition Problem β Deep Dive
A viral URL (e.g., World Cup score link) gets 10 million clicks in 60 seconds. All clicks β same Cassandra partition key β same node β node CPU at 100% β reads slow β eventually node dies β other nodes can't replicate fast enough β cascade failure.
Add a bucket_id to the partition key. Bucket = hash(short_code) % N. Since the hash is deterministic, both read and write always go to the same bucket β no scatter-gather needed. Spreads load across N partitions.
Random bucket on write, then scatter-gather on read (query all N buckets). This turns 1 read into N reads. At 10M r/min with N=100, you get 1 billion Cassandra queries per minute. Worse than the original problem.
Consistent hashing + deterministic bucket = no coordination needed. Write knows exactly which bucket. Read knows exactly which bucket. O(1) lookup, load spread across N nodes.
ClickHouse β Analytics Database
A columnar OLAP (Online Analytical Processing) database designed for high-speed aggregations over large datasets. Stores data column-by-column instead of row-by-row.
Query: "total clicks on URL X per hour for the last 30 days." This reads only the click_count and click_hour columns β ignoring all other columns. Row-based DBs read entire rows even for single-column aggregations. Columnar = 10-100x faster for analytical queries.
Never in the write path. Kafka consumer reads url.clicked events β aggregates β bulk inserts into ClickHouse. Decoupled from user-facing latency. Analytics can be delayed by seconds or minutes β that's acceptable.
BigQuery (Google's managed columnar DB) when you want zero infrastructure management. ClickHouse when you want self-hosted with more control and lower cost at scale. Both are columnar, append-only, eventual consistency.
| Database | CAP | ACID | Scale pattern | Best for | URL Shortener role |
|---|---|---|---|---|---|
| PostgreSQL | CP | Yes | Vertical + read replicas | Writes, transactions, complex queries | Write primary |
| Cassandra | AP | No | Horizontal (add nodes) | High-throughput key-value reads, multi-region | Read store |
| Redis | AP | No | Cluster sharding | Caching, rate limiting, pub/sub | L2 Cache |
| ClickHouse | AP | No | Horizontal sharding | Analytics, columnar aggregations | Analytics store |
| DynamoDB | AP/CP | Partial | Managed horizontal | Serverless key-value, managed ops | Alternative to Cassandra |
| CockroachDB | CP | Yes | Horizontal (Raft) | Geo-distributed ACID SQL | Alternative if global consistency needed |
Consistency Models
The hardest topic in distributed systems. Most candidates know CAP. Staff-level candidates know PACELC, LOCAL_QUORUM, read-your-own-writes, and how to solve each with concrete mechanisms.
CAP Theorem β What It Actually Means
Eric Brewer's theorem (2000): In the presence of a network Partition, a distributed system must choose between Consistency and Availability. You cannot guarantee both.
Network partitions are not theoretical β they happen in production regularly (switch failure, network congestion, datacenter isolation). When they happen, your design choice determines whether users see stale data or no data at all.
AP β Availability over Consistency. A stale redirect (301βold URL) is a minor UX issue. A 503 Service Unavailable during a partition is catastrophic. We choose to serve potentially stale data rather than refuse requests.
"CA systems" (consistent AND available, no partition tolerance) only exist as single-node databases. Any networked distributed system MUST tolerate partitions β the real choice is always C vs A during a partition.
| System | CAP Choice | During Partition Behaviour |
|---|---|---|
| Cassandra | AP | Returns potentially stale data. Continues to accept writes. |
| etcd / ZooKeeper | CP | Refuses reads/writes if quorum lost. Safety over availability. |
| DynamoDB | AP (tunable) | Eventually consistent by default. Strong consistency optional. |
| Spanner | CP (TrueTime) | Globally consistent using atomic clocks. Accepts higher latency. |
| PostgreSQL (single) | CA* | *Only works as single node β no real partition tolerance. |
PACELC β The Real Model
PACELC extends CAP: "if Partition β choose Availability vs Consistency; Else (normal) β choose Latency vs Consistency." CAP only covers the partition scenario. PACELC covers the normal operation trade-off too. This is the model production engineers actually use.
URL Shortener PACELC position: if Partition β choose Availability (serve stale data, don't refuse) else β choose Latency (ONE consistency level, fast reads) Cassandra is PA/EL: Available during partition, Low-latency normally. Spanner is PC/EC: Consistent during partition, Consistent (higher latency) normally.
Cassandra Consistency Levels β Complete Reference
| Level | Reads from | Writes to | Latency | When to use |
|---|---|---|---|---|
ONE | 1 replica | 1 replica | Lowest | Fast reads, stale OK. URL redirect reads. |
LOCAL_ONE | 1 local DC replica | 1 local DC | Lowest (no cross-DC) | Regional reads only |
QUORUM | Majority of ALL replicas | Majority global | High (cross-DC) | Strong consistency globally |
LOCAL_QUORUM | Majority in local DC | Majority local DC | Medium (no cross-DC) | Consistent within region. URL writes. |
ALL | Every replica | Every replica | Highest | Maximum consistency. Fragile β one node down = failure. |
EACH_QUORUM | Quorum per DC | Quorum per DC | Highest | Global quorum. Very expensive. |
// Quorum formula β when do you get strong consistency? // R + W > N (RF = replication factor) // RF=3: QUORUM reads (R=2) + QUORUM writes (W=2) β 2+2=4 > 3 β // RF=3: ONE reads (R=1) + ONE writes (W=1) β 1+1=2 β€ 3 β eventual // Our choices: // Write to Cassandra: LOCAL_QUORUM (consistent within DC, no cross-DC latency) // Read from Cassandra: ONE (fastest, stale OK for redirects)
Read-Your-Own-Writes β The Hardest Problem
Tokyo user creates a short URL β written to US-East primary PostgreSQL β async replicated to Asia Cassandra (200ms lag). 50ms later: same user clicks the URL β Asia Cassandra β not replicated yet β returns 404. User sees their own creation fail immediately. Terrible UX.
After a write, return a token: X-Write-Region: us-east, X-Write-Ts: 1234567890. Client sends this header on next request. Gateway sees it and routes reads to US-East for 5 seconds. After 5s, Asia has the data and normal routing resumes.
If Asia Cassandra returns null: retry the read against US-East PostgreSQL (source of truth). Cache the result locally in Asia Cassandra (async repair). User gets their URL. Slight tail latency increase but correct result.
Use Spanner or CockroachDB with globally synchronous writes. Solves problem perfectly. But: 200ms write latency (US-to-Asia synchronous), 10Γ cost, operational complexity. Only justified if business requirement explicitly demands it.
Replication Modes β Sync vs Async
β οΈ Never do synchronous intercontinental replication for writes. USβAsia round trip = ~200ms. Waiting for ACK from all 3 DCs = 200ms per write. At 115 writes/sec, each write stacks. Under any variance, this cascades into timeouts. Use async replication with bounded RPO via Kafka instead.
| Mode | Write latency | Data loss risk | Use case |
|---|---|---|---|
| Synchronous (all DCs) | 200-400ms | Zero | Financial transactions (Spanner) |
| Semi-synchronous (1 DC) | 5-20ms local | Low (1 DC loss max) | MySQL semi-sync, high-value writes |
| LOCAL_QUORUM | 5-10ms | Cross-DC lag only | Our choice β fast and safe within DC |
| Asynchronous | 1-5ms | RPO = replication lag | Analytics, non-critical cross-DC sync |
Kafka & Event Streaming
Kafka is not just a message queue. It is a durable, replayable, distributed commit log. This distinction is what makes it the backbone of both analytics AND disaster recovery.
π Staff-level insight: The single most important reason to use Kafka in this system is not analytics β it is disaster recovery replay. Without Kafka, if US-East dies, all in-flight writes are lost permanently. With Kafka, every write event is durably stored and replayable, giving you bounded RPO (Recovery Point Objective).
Kafka vs Traditional Message Queue
| Feature | Kafka | RabbitMQ / SQS | Why it matters |
|---|---|---|---|
| Message retention | Days/weeks (configurable) | Until consumed | Kafka allows replay. Queue does not. |
| Multiple consumers | Yes β consumer groups, independently | Competing consumers only | Kafka fans out to analytics, Cassandra, ML simultaneously |
| Replay | Seek to any past offset | Impossible | Replay = disaster recovery, debugging, backfill |
| Throughput | Millions/sec per partition | Thousands/sec | URL click volume can spike to millions |
| Ordering | Per-partition ordering | Per-queue (usually) | All clicks for one URL ordered = correct analytics |
Key Kafka Configuration β Why Each Setting Matters
Producer waits for ALL in-sync replicas (ISR) to acknowledge the message before considering it sent. Maximum durability. If the leader dies after acks=all, at least one replica has the message. No data loss.
Minimum number of replicas that must be in-sync for a produce request to succeed. With RF=3 and min.insync.replicas=2: if 2 replicas die, writes fail (rather than risking data loss). This is the safety floor.
Makes the producer exactly-once at the broker level. Each message gets a sequence number. If a retry delivers a duplicate (network timeout after send), broker deduplicates using the sequence number. Enables exactly-once semantics.
Kafka remembers where each consumer group left off (the offset). If a consumer crashes and restarts, it picks up from where it stopped. No message loss, no re-processing (with idempotent consumers). Offset committed to __consumer_offsets internal topic.
At-least-once: messages delivered one or more times. Consumer must be idempotent (handle duplicates). Simpler to implement. Exactly-once: requires idempotent producer + transactional consumer. Harder but no duplicates. For analytics, at-least-once + idempotent aggregation is fine.
When a consumer fails to process a message after max retries (e.g., malformed event), it sends the message to a DLQ topic instead of blocking. The main consumer continues. DLQ messages are inspected manually or by a separate consumer. Never let one bad message block the entire consumer.
// Producer configuration β maximum durability Properties props = new Properties(); props.put("acks", "all"); // wait for all ISR replicas props.put("min.insync.replicas", "2"); // at least 2 replicas in sync props.put("enable.idempotence", "true"); // exactly-once producer props.put("retries", Integer.MAX_VALUE); // retry forever props.put("max.in.flight.requests.per.connection", "5"); // ordering // Topic design // url.created β partition key = short_code (ordering per URL) // url.clicked β partition key = short_code (all clicks ordered per URL) // url.expired β partition key = short_code // url.clicked.dlq β dead letter queue for failed consumers // Consumer groups β each independently consumes the same events // analytics-consumer: url.clicked β aggregates β ClickHouse // cassandra-updater: url.created β writes to Cassandra read store // expiry-processor: url.expired β marks inactive in PostgreSQL // ml-pipeline: url.clicked β trains recommendation model
MirrorMaker 2 β Cross-Region Replication
Kafka's built-in cross-cluster replication tool. Mirrors topics from source cluster (US-East) to target clusters (EU, Asia). Both analytics and DR use this.
If US-East Kafka cluster dies, EU Kafka cluster has a mirror of every event up to the moment of failure. When US-East recovers, it can replay from the EU mirror. This bounds RPO to the MirrorMaker replication lag β typically under 1 second.
Offset translation: the same message has different offsets in source and mirror clusters. MirrorMaker 2 provides offset translation APIs, but consumers must use them correctly when switching clusters during failover.
Confluent's commercial cross-cluster replication. More features, better monitoring, easier offset management. Worth considering for production if budget allows.
Full Architecture
How all components connect. Three regions, two paths (read and write), one global DNS layer, and a durable event backbone.
Component Overview
| Component | Technology | Purpose | Failure behaviour |
|---|---|---|---|
| GeoDNS | Route53 / GSLB | Route users to nearest healthy region | Remove unhealthy region in 30-60s |
| API Gateway | Kong / Envoy / AWS APIGW | Rate limiting, auth, routing, SSL termination | Redundant instances; LB in front |
| Identity Provider | Keycloak / Auth0 / Cognito | JWT issuance and validation | Gateway caches public key; stateless validation |
| Write Service | Java Spring Boot | Pop pool, write PostgreSQL, publish Kafka | Stateless; restart in < 10s |
| Read Service | Java Spring Boot + Caffeine | L1βL2βL3βCassandra lookup, 302 redirect | Stateless; L1 continues without L2 |
| URL Pool | PostgreSQL table | Pre-generated short codes | Circuit breaker β Snowflake fallback |
| PostgreSQL | RDS PostgreSQL / self-hosted | Source of truth for writes | Read replicas serve reads; primary auto-failover (RDS) |
| Redis Cluster | Redis 7+ Cluster mode | L2 cache, rate limiting, mutex | L1 absorbs 80% of reads; DB serves rest |
| Cassandra | Cassandra 4.x | Read-optimised URL store | RF=3; ONE consistency; 2 nodes can die |
| Kafka | Confluent / MSK / self-hosted | Event log for analytics and DR | RF=3; min.insync=2; MirrorMaker cross-region |
| ClickHouse | ClickHouse / BigQuery | Analytics queries and dashboard | Async; analytics delay acceptable |
| etcd | etcd 3.x | Leader election, distributed locks | Raft consensus; 3 nodes; 1 can die |
API Design
// Write API β create short URL POST /api/v1/urls Authorization: Bearer {jwt} Content-Type: application/json { "long_url": "https://example.com/very/long/path?query=value", "custom_alias": "my-brand", // optional "expires_in": 86400 // optional: seconds } β 201 Created { "short_code": "abc123", "short_url": "https://short.ly/abc123", "long_url": "https://example.com/...", "expires_at": "2026-05-25T00:00:00Z" } Headers: X-Write-Region: us-east, X-Write-Ts: 1747900000 // Read API β redirect GET /abc123 β 302 Found Location: https://example.com/very/long/path?query=value X-Served-By: cache-l1 // or cache-l2, cache-l3, db // Bulk API POST /api/v1/urls/bulk [{"long_url": "...", "custom_alias": "..."}, ...] // max 1000 β 202 Accepted (async processing) {"batch_id": "batch-uuid-123"} GET /api/v1/urls/bulk/{batch_id} β 200 OK {"status": "completed", "results": [...]}
Write Path Deep Dive
Every step a URL creation request takes, with the exact decision and failure mode at each step.
β‘ Kafka publish is async but critical: We publish to Kafka BEFORE returning the response (it's fast β < 3ms with acks=all). This ensures the event is durably stored before we tell the user "success." If Kafka is down, do we fail the write? Design decision: for URL shortener, yes β Kafka durability is core to our DR story. The URL was created in PostgreSQL; Kafka failure means Cassandra won't be updated. Acceptable trade-off: brief Kafka downtime causes read-your-own-writes failures but not data loss.
Read Path Deep Dive
Every step a click takes. This path must be under 50ms for p99. The entire caching architecture exists to make this fast.
// Read service β full lookup chain public String resolveLongUrl(String shortCode, String writeRegion, Long writeTs) { // L1: In-process cache String url = l1Cache.getIfPresent(shortCode); if (url != null) return url; // L2: Redis url = redis.get(shortCode); if (url != null) { l1Cache.put(shortCode, url); return url; } // L3: CDN handled at infrastructure level, not here // DB: Cassandra with ONE consistency url = cassandra.get(shortCode, ConsistencyLevel.ONE); if (url == null) { // Check read-your-own-writes: did this user just create it? if (writeRegion != null && isRecent(writeTs)) { url = postgresql.get(shortCode); // fallback to source of truth } } if (url != null) { redis.setex(shortCode, 86400, url); l1Cache.put(shortCode, url); return url; } return null; // β 404 }
Failover & Disaster Recovery
The section that separates Senior from Staff. Designing the happy path is Senior. Designing how the system survives at 3 AM when a datacenter dies is Staff.
π Key mindset shift: Don't design for availability. Design for controlled degradation. The question is never "will it fail?" β it will. The question is "when it fails, does it fail gracefully or catastrophically?"
RTO & RPO β The Two DR Metrics
Maximum amount of data loss acceptable. "How far back in time can we afford to roll back?" Our target: RPO β€ 1 second, bounded by Kafka replication lag. Without Kafka: RPO = undefined (could be minutes of lost writes).
Maximum acceptable downtime. "How fast must we recover?" Our target: RTO β€ 60 seconds for automatic failover (GeoDNS TTL + health check time). Full recovery (US-East back online): 30-60 minutes for safe canary ramp.
Every write event is published to Kafka BEFORE we return success to the user (acks=all). MirrorMaker 2 replicates events to EU and Asia. When US-East recovers, it replays Kafka from last committed offset. RPO = Kafka replication lag at time of failure (typically < 1 second).
PostgreSQL async replication to EU might have 500ms lag. US-East dies. Last 500ms of writes are gone. No log to replay. Unrecoverable. This is why Kafka is not just analytics infrastructure β it is your DR backbone.
GeoDNS Failover β Exact Mechanism
// Route53 health check configuration HealthCheckConfig: Type: HTTPS ResourcePath: /health FailureThreshold: 3 // 3 consecutive failures before marking unhealthy RequestInterval: 10 // check every 10 seconds // β declares unhealthy after 3 Γ 10 = 30 seconds DNS TTL: 60 seconds // clients respect this TTL // Total failover time: 30s (health check) + 60s (TTL propagation) = ~90s // Anycast alternative (faster failover): // Same IP announced from all regions via BGP // BGP withdrawal takes ~30s to propagate // No DNS TTL dependency β faster than GeoDNS // Used by Cloudflare, Fastly
Leader Election β Preventing Split Brain
US-East dies. EU detects this and promotes itself to write primary. US-East recovers (maybe it was a network blip). Now BOTH think they are primary. Both accept writes. Data diverges. You cannot automatically merge divergent writes. This is the worst failure mode in distributed systems.
Write service holds a lease in etcd with 30s TTL. Must renew every 10s. If US-East dies, lease expires after 30s. EU watches for lease expiry, races to acquire it. Only one region can hold the lease. The lease IS the primary writer token.
Every write includes the etcd lease version (a monotonically increasing number). Storage layer (PostgreSQL, Cassandra) rejects writes with a version number lower than the highest seen. US-East comes back zombie, tries to write with old version β rejected β cannot corrupt data.
etcd uses Raft consensus (simpler, better understood). ZooKeeper uses ZAB protocol. Both work. etcd is lighter, has a cleaner API, and is the Kubernetes default β most cloud infrastructure teams prefer it now. ZooKeeper has more history and battle-testing. Either is valid in an interview.
// Leader election with etcd β simplified while (true) { // Try to acquire write primary role LeaseGrantResponse lease = etcd.leaseClient().grant(30).get(); // 30s TTL PutResponse put = etcd.kvClient().put( ByteString.of("/primary-writer"), ByteString.of("us-east"), PutOption.newBuilder().withLeaseId(lease.getID()).build() ).get(); if (put.getPrevKv() == null) { // We got it! No previous value β we are primary startHeartbeat(lease.getID()); // renew every 10s break; } else { // Someone else is primary β watch for key deletion watchForExpiry("/primary-writer"); } } // Fencing token β every write includes lease version // Storage layer: if write.version < max_seen_version β REJECT // This kills zombie primaries that come back after being presumed dead
US-East Dies β Full Runbook
β‘ Canary promotion must be automated with SLO gates: Error rate < 0.01% AND p99 latency within 20% of baseline AND Kafka replication lag = 0ms. These three conditions must ALL be true before auto-promoting to the next tier. Human approval required for the final 100% promotion. This prevents re-introducing a broken node at full blast.
Security
Security is often glossed over in system design interviews. Knowing it in detail signals production experience.
Authentication β JWT & OAuth2
JSON Web Token. Three Base64-encoded parts: Header (algorithm), Payload (claims: userId, email, exp, iat), Signature (HMAC or RSA). The signature proves the token was issued by the IdP and hasn't been tampered with.
API Gateway validates JWT by verifying the signature using the IdP's public key (RS256 = RSA). No call to the Auth service per request. The public key is cached at the gateway. Massive throughput: validation is a CPU operation (< 1ms), not a network call.
HS256 uses a shared secret β any party with the secret can forge tokens. RS256 uses asymmetric keys β IdP signs with private key, everyone else verifies with public key. Only the IdP can issue tokens. RS256 is correct for multi-service architectures.
JWTs are stateless β you cannot "un-issue" one. If a user is banned, their JWT is still valid until expiry. Solutions: short expiry (15 min) + refresh tokens, or maintain a revocation list (sacrifices statelessness), or use opaque tokens with introspection (back to stateful).
Rate Limiting β Token Bucket Deep Dive
Limits requests per user/IP per time window. Prevents abuse, DoS, and resource exhaustion. Different limits per tier (free vs pro vs enterprise).
Token bucket: bucket fills at constant rate. Each request consumes a token. Allows burst (up to bucket capacity). Leaky bucket: requests processed at fixed rate regardless of when they arrive. No burst allowed. Token bucket is more user-friendly.
Exact algorithm: store timestamp of every request in a sorted set (Redis ZADD). On each request: remove entries older than 1 minute (ZREMRANGEBYSCORE), count remaining (ZCARD). If count β₯ limit β reject. Exact but high memory: O(requests) per user.
If limit is 100/minute and window resets at :00, a user can send 100 at :59 and 100 at :01 β 200 requests in 2 seconds. The boundary allows 2Γ the limit. Sliding window solves this.
301 vs 302 Redirect β Full Analysis
- Browser always calls our server on every click
- We capture every click for analytics
- We can update or expire the URL at any time
- Rate limiting works on every request
- We detect malicious usage patterns
- Browser caches redirect permanently
- Subsequent clicks never reach our server
- Analytics broken β we see each URL clicked once
- Cannot update long_url after creation
- Cannot expire/deactivate URLs for cached clients
307 vs 302: 302 may convert POST to GET when following redirect. 307 preserves the HTTP method. For URL shortener, users are redirecting from a GET click, so 302 and 307 behave identically. Know the difference for completeness.
SSRF Prevention
Server-Side Request Forgery: attacker creates a short URL pointing to an internal service (e.g., http://169.254.169.254/metadata β AWS instance metadata). When the server "validates" the URL by fetching it, it inadvertently exposes internal infrastructure.
Before storing a URL: resolve the hostname to IP. Check the IP against blocked ranges. Block: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16 (metadata), 127.0.0.0/8 (localhost). Also: allowlist schemes (https only, block file://, ftp://).
Observability & SLOs
"How do you prove the system is working correctly in production?" This is the final question that separates implementers from owners.
SLO / SLA / SLI β Exact Definitions
Service Level Indicator. A specific metric: redirect success rate (%), p99 redirect latency (ms), URL creation success rate (%). Must be objective and measurable.
Service Level Objective. Internal target derived from SLIs: redirect success rate β₯ 99.9%, p99 redirect latency β€ 50ms. No contractual obligation. Used to guide engineering decisions.
Service Level Agreement. Contractual promise: usually SLO minus a buffer (99.5% availability). Breaching SLA triggers compensation (credits, refunds). Never set SLA = SLO β you'll be paying credits constantly.
99.9% SLO = 0.1% allowed errors. Monthly: 0.1% Γ 30 Γ 24 Γ 60 = 43.2 minutes of allowed downtime. Error budget burn rate: if burning at 10Γ normal rate β page oncall before budget runs out. This is Google SRE's core concept.
Synthetic Monitoring β Proving Correctness
// Synthetic test β runs every 60 seconds from each region void syntheticTest() { String unique = "https://test-target.com/" + UUID.randomUUID(); // Step 1: Create short URL Response create = POST("/api/v1/urls", {"long_url": unique}); assert(create.status == 201); String shortCode = create.body.short_code; // Step 2: Resolve short URL (no redirect-follow) Response redirect = GET("/" + shortCode, followRedirects=false); assert(redirect.status == 302); assert(redirect.header("Location").equals(unique)); // Step 3: Assert total latency assert(totalMs < 200); // Step 4: Record metrics metrics.record("synthetic.latency", totalMs); metrics.record("synthetic.success", 1); } // This catches: // - Cache misconfiguration (wrong URL returned) // - Replication lag breaking read-your-own-writes // - SSL certificate expiry // - DNS misconfiguration // - Database inconsistency (short_code exists but wrong long_url)
Predictive Monitoring β The Staff Signal
// REACTIVE (Junior): Alert when pool = 0 β already broken if (poolRemaining == 0) alert("Pool empty!"); // too late // PREDICTIVE (Staff): Alert when pool WILL hit 0 in N minutes double rate = metrics.rate("pool.consumed", 5, MINUTES); // per second long remaining = getPoolRemaining(); double timeToEmpty = remaining / rate; // seconds if (timeToEmpty < 600) alert(P2, "Pool empties in 10min"); if (timeToEmpty < 120) alert(P0, "Pool empties in 2min"); if (timeToEmpty < 30) circuitBreaker.trip(); // auto-failover // PromQL equivalent: // url_pool_remaining / rate(url_pool_consumed_total[5m]) < 600 // β alert: "Pool depletes in less than 10 minutes"
Chaos Engineering
Cross Questions β Interview Traps
Every question an interviewer has ever asked about URL shortener design, with the exact answer that signals senior/staff level thinking.
Category 1 β "Why not just use X?"
Category 2 β "What happens when X fails?"
Category 3 β "How would you change the design if...?"
INSERT INTO url_mappings (...) USING TTL 86400. Cassandra automatically deletes the row after 24h. PostgreSQL: background job β UPDATE url_mappings SET is_active=false WHERE expires_at < NOW(), runs every 5 minutes. Publish url.expired events to Kafka. Cache: set cache TTL = min(24h, time remaining until expiry). Calculate at cache-write time. CDN: CDN invalidation API call when URL expires. Set Cache-Control max-age to remaining TTL. On redirect: check is_active and expires_at before returning 302. Return 410 Gone (not 404) β 410 means "permanently removed" which tells search engines and browsers to remove this from their indexes.Full Glossary β 80+ Terms
Every term you need to know, with a one-liner definition and when to use it in an interview. Sorted by category.
Concurrency & Locking
| Term | One-Line Definition | Interview context |
|---|---|---|
SKIP LOCKED | SQL hint: skip rows already locked by other transactions instead of waiting. Enables concurrent pool pops without deadlock. | URL pool pop mechanism |
FOR UPDATE | Lock selected rows within transaction to prevent concurrent modification. | Pool pop + SKIP LOCKED |
| Optimistic locking | Assume no conflict; detect at commit using version column. Fast reads, retries on conflict. | High-read, low-write contention scenarios |
| Pessimistic locking | Lock resource immediately on access. Guaranteed no conflict but blocks others. | High-write contention scenarios |
| Deadlock | Two transactions each wait for a lock the other holds. Neither can proceed. DB detects and kills one. | Why SKIP LOCKED is better than FOR UPDATE alone |
| MVCC | Multi-Version Concurrency Control. Multiple versions of data exist simultaneously. Readers never block writers. | How PostgreSQL achieves high concurrency |
| CAS (Compare-and-Swap) | Atomic: "set value only if current value = expected." Foundation of lock-free data structures and etcd leader election. | Leader election mechanism |
| Idempotency | Operation can be applied multiple times with same result. Essential for retry logic. | Kafka producer, API design |
| Mutex | Mutual exclusion lock. Only one thread can hold it. Redis SETNX implements distributed mutex. | Thundering herd protection |
| Semaphore | Like a mutex but allows N threads simultaneously. Redis can implement with INCR/DECR. | Concurrency control, rate limiting |
Caching
| Term | One-Line Definition | Interview context |
|---|---|---|
SETNX | Redis SET if Not eXists. Returns 1 if set, 0 if key already existed. Used for distributed mutex. | Thundering herd prevention |
| LRU | Least Recently Used. Evicts item not accessed for longest time. Recency-based. | Compare to LFU; Guava uses this |
| LFU | Least Frequently Used. Evicts item accessed fewest times. Frequency-based. Better for Zipf distributions. | Redis allkeys-lfu, Caffeine uses W-TinyLFU variant |
| ARC | Adaptive Replacement Cache. Balances LRU and LFU dynamically. Used in ZFS, some SSD controllers. | Mention as alternative to LRU/LFU |
| W-TinyLFU | Window Tiny LFU. Caffeine's algorithm. Count-Min Sketch estimates frequency. Near-optimal hit rate. | Why Caffeine beats Guava |
| Count-Min Sketch | Probabilistic frequency counter using multiple hash functions. O(1) space, approximate counts. | How W-TinyLFU estimates access frequency |
| Bloom filter | Probabilistic membership test. Zero false negatives, small false positive rate. O(1) space. | Cache penetration prevention |
| Thundering herd | Many concurrent cache misses on same key β mass DB requests β DB overwhelmed. | Problem; SETNX is the fix |
| Cache stampede | Same as thundering herd. Also called dogpile effect. | Know all three names |
| Cache penetration | Requests for nonexistent keys bypass cache every time. Fix: negative caching or Bloom filter. | Security + performance concern |
| Cache avalanche | Mass simultaneous TTL expiry β mass DB requests. Fix: TTL jitter. | Cold start scenario |
| Cache pollution | Low-frequency items evict high-frequency items. Fix: LFU eviction. | Scraper/bot traffic scenario |
| Write-through | Write to cache + DB synchronously. Strong consistency. Write latency penalty. | Compare cache invalidation strategies |
| Write-back | Write to cache only, async flush to DB. Fast writes, risk of data loss on cache failure. | High-write scenarios |
| Cache-aside | App manages cache: read cache β miss β DB β populate cache. Most common pattern. | Our URL shortener read pattern |
| TTL jitter | Random offset added to TTL to prevent simultaneous expiry: TTL = base + random(0, 20%). | Cache avalanche prevention |
Distributed Systems
| Term | One-Line Definition | Interview context |
|---|---|---|
| CAP theorem | During network Partition: choose Consistency or Availability. Cannot have both. | Justify AP choice for URL shortener |
| PACELC | Extends CAP: if PartitionβA vs C; ElseβLatency vs Consistency. More practical than CAP alone. | Staff-level consistency discussion |
| ACID | Atomicity, Consistency, Isolation, Durability. PostgreSQL guarantees. Strong but slow cross-region. | Why PostgreSQL for writes |
| BASE | Basically Available, Soft state, Eventually consistent. Cassandra's philosophy. | Why Cassandra for reads |
| Eventual consistency | All replicas converge to same value given time and no new updates. | URL redirect reads (ONE consistency) |
| Strong consistency | Every read returns the most recent write. Requires coordination = higher latency. | Write path requirement |
| Linearizability | Strictest consistency: operations appear instantaneous and sequential. Spanner provides this. | Alternative to eventual consistency (costly) |
| Read-your-own-writes | After writing, you always see your own write on subsequent reads. Violated by async replication. | Cross-region replication problem |
| Monotonic reads | Once you've seen data version N, you never see an older version N-1. Time travel prevention. | Consistency guarantee weaker than strong |
| Consistent hashing | Maps keys to nodes on a ring. Adding/removing a node moves minimal keys. Used by Cassandra, Redis Cluster. | How Cassandra distributes data |
| Virtual nodes (vnodes) | Each physical node owns multiple virtual positions on ring. Better load distribution. | Cassandra internals |
| Quorum | Majority (N/2 + 1) must agree. R + W > N = strong consistency. Key formula. | Cassandra consistency levels |
| Raft | Consensus algorithm for leader election and log replication. Used by etcd, CockroachDB. Simpler than Paxos. | etcd leader election |
| Paxos | Original distributed consensus algorithm. Basis for Raft and ZAB. Proven correct but complex. | Historical context for Raft/ZAB |
| ZAB | ZooKeeper Atomic Broadcast. ZooKeeper's consensus protocol. Similar to Raft. | ZooKeeper internals |
| Split brain | Two nodes both believe they are primary. Causes unrecoverable data divergence. | Why fencing tokens are necessary |
| Fencing token | Monotonically increasing token from lock service. Storage rejects writes with old tokens. | Split brain prevention mechanism |
| Two-phase commit (2PC) | Distributed transaction: prepare phase + commit phase. Blocking if coordinator fails. Avoid at scale. | Why we don't use it (ZooKeeper range approach) |
| Saga pattern | Distributed transaction via sequence of local transactions with compensating rollbacks. | Alternative to 2PC for microservices |
| WAL | Write-Ahead Log. All changes logged before applying. Enables replication, recovery, point-in-time restore. | PostgreSQL replication mechanism |
Networking & Infrastructure
| Term | One-Line Definition | Interview context |
|---|---|---|
| Anycast | Same IP announced from multiple locations. BGP routes to nearest. Used by CDNs. | Faster failover than GeoDNS (no TTL wait) |
| GeoDNS | DNS returns different IPs based on requester's geographic location. | Region routing for URL shortener |
| GSLB | Global Server Load Balancer. Routes globally based on health, latency, geography. | Enterprise alternative to Route53 |
| BGP | Border Gateway Protocol. Internet routing protocol. BGP withdrawal = region removed from routing. | Anycast failover mechanism |
| PoP | Point of Presence. CDN edge node in a city. Cloudflare: 300+ PoPs globally. | CDN geography discussion |
| Circuit breaker | Stops calling failing service. CLOSEDβOPENβHALF-OPEN states. Prevents cascade failure. | Pool empty, Redis down, service failure |
| Bulkhead pattern | Isolate failures via separate thread pools per downstream service. One slow service doesn't starve others. | Microservice resilience |
| Sidecar proxy | Service mesh component (Envoy). Handles retries, circuit breaking, mTLS without app code changes. | Istio/Linkerd architecture |
| mTLS | Mutual TLS. Both client and server authenticate with certificates. Service-to-service security. | Internal service security |
| Backpressure | Slow consumer signals producer to slow down. Prevents memory overflow and cascade failure. | Kafka consumer lag management |
Reliability & Performance
| Term | One-Line Definition | Interview context |
|---|---|---|
| p99 latency | 99th percentile: 99% of requests complete faster than this value. More meaningful than average. | SLO definition |
| p99.9 latency | 99.9th percentile: the tail. Often 10-100Γ worse than p99. Where real user pain lives. | Staff-level latency discussion |
| Zipf distribution | Power-law: small number of items get vast majority of traffic. Top 20% URLs = 80% traffic. | Justifies LFU caching over LRU |
| SLI | Service Level Indicator. What you measure: success rate, p99 latency. | Foundation of SLO |
| SLO | Service Level Objective. Internal target: p99 β€ 50ms, availability β₯ 99.9%. | Engineering goal, not contractual |
| SLA | Service Level Agreement. Contractual promise. Usually SLO minus buffer. Breach = credits. | Customer-facing guarantee |
| Error budget | Allowed failure quota from SLO: 99.9% = 43.2 min/month downtime allowed. | SRE decision framework |
| RPO | Recovery Point Objective. Max acceptable data loss. Our target: < 1s (Kafka bounded). | DR planning |
| RTO | Recovery Time Objective. Max acceptable downtime. Our target: < 60s (GeoDNS failover). | DR planning |
| MTTR | Mean Time To Recovery. Average time to restore service after failure. | DR metrics |
| Canary deployment | Route small % of traffic to new version. 1%β5%β25%β100%. Automated SLO-gated promotion. | Safe recovery and deployment |
| Blue-green deployment | Two identical environments. Instant traffic switch. Instant rollback. | Zero-downtime deployment |
| Feature flag | Toggle functionality without deployment. Enables gradual rollout, A/B testing, kill switches. | Progressive feature rollout |
| Chaos engineering | Intentionally inject failures in production to find weaknesses before real incidents do. | How to prove the system actually works |
| Birthday paradox | In a random sample of N items from M combinations, first collision expected at βM picks. | Why MD5 truncation fails for ID generation |
Kafka Specific
| Term | One-Line Definition | Interview context |
|---|---|---|
| acks=all | Producer waits for all in-sync replicas to acknowledge. Max durability. | DR β why data survives leader failure |
| min.insync.replicas | Minimum ISR count for produce to succeed. Set to 2 with RF=3 for safety floor. | Pair with acks=all |
| enable.idempotence | Producer assigns sequence numbers. Broker deduplicates retries. Exactly-once at producer level. | At-least-once vs exactly-once |
| Consumer group | Multiple consumers sharing partitions. Each partition consumed by exactly one member. | Parallel consumption, independent read progress |
| Consumer offset | Position of last read message. Committed to __consumer_offsets topic. Enables crash recovery. | Kafka replay for DR |
| Log compaction | Kafka retains only latest value per key. Enables event sourcing and efficient state rebuild. | URL state as a compacted log |
| DLQ | Dead Letter Queue. Failed messages after max retries sent here for inspection. Never block main consumer. | Consumer error handling |
| MirrorMaker 2 | Kafka's cross-cluster replication. Mirrors topics between DCs. Enables DR and analytics sync. | Cross-region event replication |
| At-least-once delivery | Message delivered one or more times. Consumer must be idempotent. | Default Kafka guarantee |
| Exactly-once semantics | Message delivered exactly once. Requires idempotent producer + transactional consumer. | Critical for financial, dangerous for performance |
Security
| Term | One-Line Definition | Interview context |
|---|---|---|
| JWT | JSON Web Token. Stateless bearer token with signed claims. Verified by signature, no DB lookup. | Stateless authentication at scale |
| RS256 | RSA signature with SHA-256. Asymmetric β only IdP can sign, anyone can verify with public key. | Multi-service JWT verification |
| HS256 | HMAC signature with SHA-256. Symmetric shared secret β any holder can forge tokens. Avoid for public APIs. | Why RS256 is preferred |
| OAuth2 | Authorization framework. Delegates access. Flows: Auth Code (web), Client Credentials (service-to-service). | API authentication for URL shortener |
| SSRF | Server-Side Request Forgery. Attacker tricks server into making requests to internal services. | URL validation requirement |
| Token bucket | Rate limiting: bucket fills at constant rate. Request consumes a token. Allows bursts. | API rate limiting implementation |
| Sliding window log | Exact rate limiting: store request timestamps, count within window. Memory-intensive but precise. | Compare rate limit algorithms |
| CORS | Cross-Origin Resource Sharing. Browser policy for cross-domain requests. Controlled via headers. | Web API security |
| 301 vs 302 | 301=permanent (browser caches forever). 302=temporary (browser always calls server). Use 302 for analytics. | Redirect type choice and reasoning |
| 410 Gone | HTTP status for permanently removed resource. Browser and search engines remove from index. Use for expired URLs. | URL expiry handling, more informative than 404 |
1-Page Cheat Sheet
Cover this the night before. If you can answer every item below without looking, you are ready.
π Interview opening script: "Before I design anything, let me establish scale. 100M DAU, 10M writes/day = 115 writes/sec, 100:1 read ratio = 11,500 reads/sec, 5-year storage ~20TB. The system is heavily read-biased and intercontinental, and I'll choose availability over consistency because a stale redirect is acceptable but a 503 is not."
Components β One-Line Each
| Component | Technology | Why this choice |
|---|---|---|
| ID Generation | Pool (SKIP LOCKED) + Snowflake fallback | O(1) write path, no collisions, circuit breaker protected |
| L1 Cache | Caffeine (W-TinyLFU) | In-process, zero network, best hit rate for Zipf distribution |
| L2 Cache | Redis Cluster (allkeys-lfu) | Shared across instances, LFU for viral URLs, SETNX mutex |
| L3 Cache | CDN (CloudFront/Fastly) | Eliminates intercontinental latency for hot URLs |
| Write DB | PostgreSQL (LOCAL_QUORUM) | ACID, UNIQUE constraint, WAL for replication |
| Read DB | Cassandra (ONE) | Linear scale, multi-region active-active, key-value optimised |
| Analytics | ClickHouse / BigQuery | Columnar, append-only, async from Kafka |
| Event log | Kafka (acks=all, RF=3) | DR replay + analytics fan-out + decoupling |
| Cross-region Kafka | MirrorMaker 2 | DR: replay events after datacenter recovery |
| Leader election | etcd (Raft) | Split brain prevention, fencing token support |
| Traffic routing | GeoDNS (Route53) / Anycast | Route to nearest healthy region, 30-90s failover |
| Auth | JWT (RS256) + OAuth2 | Stateless validation, no per-request Auth service call |
| Rate limiting | Token bucket (Redis) | Allows bursts, per-user, tier-aware |
| Redirect type | 302 Found | Enables analytics, allows URL updates/expiry |
The 8 Terms That Impress Interviewers Most
- W-TinyLFU β "Caffeine uses W-TinyLFU which maintains near-optimal hit rate via a Count-Min Sketch frequency estimator β far better than Guava's LRU for skewed access patterns."
- SKIP LOCKED β "The pool pop uses FOR UPDATE SKIP LOCKED β multiple workers can pop concurrently without blocking each other, which is impossible with plain FOR UPDATE."
- PACELC β "CAP only covers partition scenarios. PACELC is more useful: during partition we choose Availability; in normal operation we choose Latency over Consistency β ONE reads in Cassandra."
- Fencing token β "To prevent split brain, every write includes the etcd lease version. Storage rejects writes with a lower version, killing zombie primaries."
- LeveledCompactionStrategy β "For Cassandra reads I'd use LCS over the default STCS β it maintains sorted SSTables in levels, reducing read amplification for our 100:1 read-heavy workload."
- Predictive depletion monitoring β "Rather than alerting when the pool hits zero β which is already broken β I'd alert when time-to-empty drops below 10 minutes based on the current consumption rate."
- acks=all + min.insync.replicas β "Kafka producers use acks=all with min.insync.replicas=2. This bounds our RPO to the Kafka replication lag β typically under 1 second β enabling deterministic disaster recovery via replay."
- 410 Gone vs 404 β "Expired URLs return 410 Gone, not 404. 410 is semantically permanent β it tells browsers and search engines to remove the URL from their index. 404 implies the resource might come back."
β You are ready when: You can narrate the entire system design β from requirements to components to failure modes to observability β in 45 minutes without notes, and correctly answer any follow-up on any component without hesitation. Use this cookbook to practice out loud, not just to read.