The System Design Interview Playbook: 2026 Edition with AI Considerations
A comprehensive guide to acing system design interviews in 2026. Covers traditional distributed systems plus AI-native architectures, LLM-powered apps, and vector databases.
The System Design Interview Playbook: 2026 Edition with AI Considerations
System design interviews have evolved. In 2026, you’re not just designing databases and load balancers—you’re architecting LLM-powered systems, vector search pipelines, and AI-native applications.
I’ve conducted 50+ system design interviews at FAANG companies and startups, and I’ve noticed a clear shift: candidates who only know traditional patterns are falling behind. The bar has been raised.
This guide gives you the framework to ace any system design interview, whether it’s a classic “Design Twitter” or a modern “Design an AI Agent Platform.”
The 4S Framework
System design is overwhelming because there’s too much to cover. Use the 4S Framework to stay structured:
- Scope - Understand requirements and constraints
- Sketch - Design high-level architecture
- Scale - Handle millions of users
- Solidify - Deep dive into critical components
Let’s walk through each step with real interview examples.
Phase 1: Scope (5-10 minutes)
Never start designing without understanding what you’re building. Interviewers will penalize you for jumping into solutions prematurely.
Functional Requirements
What does the system do?
Example: Design a URL shortener
✓ Create short URLs from long URLs
✓ Redirect short URLs to original
✓ Custom aliases (optional)
✓ Analytics (optional)
✗ Image uploads (out of scope)
✗ User authentication (clarify)
Key Questions to Ask:
- “Should users be able to create custom short URLs?”
- “Do we need analytics on click-through rates?”
- “Can users delete or update their URLs?”
- “Is there a time limit for short URLs?”
Non-Functional Requirements
How should the system behave?
Priority Matrix:
┌─────────────────┬──────────┬──────────┐
│ Requirement │ Priority │ Notes │
├─────────────────┼──────────┼──────────┤
│ Availability │ High │ 99.99% │
│ Latency │ High │ <100ms │
│ Scalability │ High │ 100M/day │
│ Durability │ Medium │ No loss │
│ Security │ Medium │ HTTPS │
└─────────────────┴──────────┴──────────┘
Key Questions to Ask:
- “What’s the expected scale? Daily active users, requests per second?”
- “What’s the acceptable latency for URL creation vs redirection?”
- “What’s more important: consistency or availability?”
- “Any compliance requirements (GDPR, etc.)?”
Back-of-the-Envelope Math
Quick calculations show you understand scale:
URL Shortener Math:
- 100 million new URLs per day
- 100:1 read:write ratio
- 10 billion redirects per day
Storage:
- Each URL record: ~500 bytes
- 100M × 500B = 50GB per day
- 5 years: 50GB × 365 × 5 = ~90TB
QPS (Queries Per Second):
- Writes: 100M/day ÷ 86,400s = ~1,200 WPS
- Reads: 10B/day ÷ 86,400s = ~115,000 RPS
Pro Tip: Memorize these numbers:
- 1 million seconds ≈ 11.5 days
- 1 billion seconds ≈ 31.7 years
- 1 KB ≈ 1,000 bytes, 1 MB ≈ 1 million, etc.
- SSD read: ~1ms, HDD read: ~10ms, Network (same DC): ~0.5ms
Phase 2: Sketch (10-15 minutes)
Design the high-level architecture. Keep it simple—details come later.
API Design
Start with the interface:
# URL Shortener API
POST /api/v1/urls
Request:
{
"long_url": "https://example.com/very/long/path",
"custom_alias": "mylink" # optional
}
Response:
{
"short_url": "https://short.io/abc123",
"created_at": "2026-03-17T10:00:00Z",
"expires_at": "2027-03-17T10:00:00Z"
}
GET /{short_code}
Response: 302 Redirect to original URL
GET /api/v1/urls/{short_code}/analytics
Response:
{
"total_clicks": 15000,
"unique_visitors": 12000,
"clicks_by_country": {...}
}
Data Model
Keep it minimal for the sketch phase:
-- URL Shortener Schema
urls table:
- id: BIGINT (primary key)
- short_code: VARCHAR(10) (indexed, unique)
- long_url: VARCHAR(2048)
- user_id: BIGINT (nullable)
- created_at: TIMESTAMP
- expires_at: TIMESTAMP
- click_count: BIGINT
analytics table:
- id: BIGINT
- url_id: BIGINT (foreign key)
- timestamp: TIMESTAMP
- country: VARCHAR(2)
- referrer: VARCHAR(512)
- user_agent: VARCHAR(512)
High-Level Architecture Diagram
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Client │────▶│ Load Balancer│────▶│ API │
│ │ │ (Round-Robin)│ │ Servers │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
┌───────────────────────┼───────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌───▼──────┐
│ Cache │ │ Primary │ │ Read │
│ (Redis) │ │ Database │ │ Replicas│
└───────────┘ │ (PostgreSQL) └──────────┘
└───────────┘
Phase 3: Scale (10-15 minutes)
Now make it handle millions of users.
Horizontal Scaling
API Layer: Scale web servers horizontally behind load balancer
Before: 1 server handling 1000 QPS
After: 10 servers each handling 100 QPS
Database Layer: Read replicas + Sharding
Read Replicas:
- 1 Primary (writes)
- 5 Replicas (reads)
- Replication lag: <100ms (acceptable for URL shortener)
Sharding Strategy:
- Shard by short_code hash
- 16 shards to start
- Consistent hashing for rebalancing
Caching Strategy
Cache Layers:
1. Browser Cache (client-side):
- Cache 301 redirects permanently
- Reduces repeated requests
2. CDN (CloudFlare/AWS CloudFront):
- Cache popular URLs at edge
- 80% of traffic served from cache
3. Application Cache (Redis):
- Hot URLs in memory
- TTL: 24 hours
- Size: Top 1 million URLs
Cache Hit Rates to Target:
- CDN: 80%+
- Redis: 15-20%
- Database: <5%
Rate Limiting
Prevent abuse:
# Token bucket algorithm
class RateLimiter:
def allow_request(self, user_id):
# 100 requests per minute per user
tokens = redis.get(f"tokens:{user_id}")
if tokens > 0:
redis.decr(f"tokens:{user_id}")
return True
return False
# Different tiers
FREE_TIER = 10/minute
PREMIUM_TIER = 1000/minute
ENTERPRISE = 10000/minute
Phase 4: Solidify (15-20 minutes)
Deep dive into the most critical components.
Component 1: Short Code Generation
Approach 1: Hashing (Simple but collisions possible)
import hashlib
def generate_short_code(long_url):
# MD5 hash, take first 7 characters
hash = hashlib.md5(long_url.encode()).hexdigest()[:7]
return hash
# Problem: Collisions!
# Solution: Check uniqueness, append counter if collision
Approach 2: Base62 Encoding (Better)
import string
BASE62 = string.ascii_letters + string.digits # a-zA-Z0-9
def encode_base62(num):
"""Convert ID to base62 string"""
if num == 0:
return BASE62[0]
result = []
while num:
result.append(BASE62[num % 62])
num //= 62
return ''.join(reversed(result))
# Generate unique ID from database
# ID 1 -> "a", ID 61 -> "Z", ID 62 -> "aa"
# Supports 62^7 = 3.5 trillion URLs with 7 characters
Approach 3: Pre-generated Keys (Best for scale)
Key Generation Service:
1. Pre-generate billions of short codes
2. Store in Redis "available_codes" set
3. When creating URL: POP one code from set
4. If unused after 24 hours: Return to set
Advantages:
- No collision check needed
- O(1) generation time
- Predictable performance
Component 2: Database Sharding
Sharding by short_code:
Shard 0: codes starting with a-d
Shard 1: codes starting with e-h
...
Shard 15: codes starting with # (custom)
shard = hash(short_code) % 16
Benefits:
- Even distribution
- Easy to add shards
- Range queries possible (for analytics)
Component 3: Analytics Pipeline
Don’t slow down redirects with analytics writes:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ User │────▶│ API │────▶│ Kafka │────▶│ Spark │
│ Request │ │ Server │ │ Topic │ │ Streaming│
└──────────┘ └──────────┘ └──────────┘ └────┬─────┘
│
┌─────────────────────────┼─────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─▼──────┐
│ Time │ │ Data │ │ Real- │
│ Series │ │ Warehouse│ │ time │
│ DB │ │ (BigQuery)│ │Dashboard│
│(InfluxDB) │ │ │ │ │
└───────────┘ └───────────┘ └────────┘
Flow:
1. API server publishes event to Kafka (async)
2. Spark processes streams
3. Stores aggregates in InfluxDB (fast queries)
4. Stores raw data in BigQuery (analysis)
2026 Addition: AI-Native System Design
Traditional interviews are being supplemented with AI architecture questions. Here’s how to handle them.
Pattern 1: LLM-Powered Request Handling
Question: “Design a customer support system that uses LLMs”
Key Components:
1. Request Router:
- Classify queries (simple vs complex)
- Route simple to FAQ bot
- Route complex to LLM pipeline
2. Context Assembly:
- Retrieve relevant docs (RAG)
- Fetch user history
- Build prompt with context
3. LLM Service:
- Load balancing across providers (OpenAI, Anthropic, etc.)
- Fallback strategies
- Token usage tracking
4. Response Pipeline:
- Fact-checking layer
- Safety/content filtering
- Confidence scoring
- Human escalation if needed
Scale Considerations:
- LLM API costs: ~$0.01-0.10 per request
- Latency: 500ms-3s (much slower than traditional APIs)
- Caching: Cache similar queries aggressively
- Model selection: Use cheaper models for simple queries
Pattern 2: Vector Database Integration
Question: “Design a semantic search system”
Architecture:
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ User │────▶│ Embedding │────▶│ Vector │
│ Query │ │ Service │ │ Database │
└──────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┼──────────┐
│ │ │
┌─────▼─────┐ ┌────▼─────┐ ┌─▼─────┐
│ OpenAI │ │ Pinecone │ │ Weaviate│
│ API │ │ or │ │ or │
│ │ │ pgvector │ │Milvus │
└───────────┘ └──────────┘ └────────┘
Key Metrics:
- Embedding time: ~100-500ms
- Vector search: ~10-50ms
- Index size: ~1KB per vector (1536 dimensions)
Pattern 3: RAG (Retrieval-Augmented Generation)
Question: “Design a document Q&A system like ChatGPT with file upload”
RAG Pipeline:
Ingestion:
1. Document Upload → PDF/Text parsing
2. Chunking (semantic boundaries, ~500 tokens)
3. Embedding generation (batch)
4. Vector DB storage with metadata
Query:
1. User asks question
2. Embed query
3. Vector similarity search (top-k=5)
4. Assemble context from retrieved chunks
5. Build prompt: "Answer based on: {context}\nQuestion: {query}"
6. Send to LLM
7. Stream response
Optimization:
- Re-ranking: Cross-encoder on top-k results
- Hybrid search: Vector + BM25 keyword
- Caching: Store common query embeddings
- Pre-computation: Embed popular documents ahead of time
Scale Math:
- 1M documents
- ~100 chunks per document
- 100M total chunks
- Each chunk: 1536 dimensions × 4 bytes = ~6KB
- Total: 600GB vector storage
- Query: 100 QPS, 50ms latency
Complete Example: Design Twitter/X
Requirements
- Post tweets (280 chars)
- Follow/unfollow users
- View timeline
- Like/retweet
- Scale: 500M DAU, 100M tweets/day
High-Level Design
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Client │────▶│ CDN │────▶│ API │
│ │ │ │ │ Gateway │
└──────────┘ └──────────┘ └────┬─────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌────▼─────┐ ┌─────▼──────┐ ┌────▼─────┐
│ Tweet │ │ Timeline │ │ User │
│ Service │ │ Service │ │ Service │
└────┬─────┘ └─────┬──────┘ └────┬─────┘
│ │ │
┌────▼─────┐ ┌─────▼──────┐ ┌────▼─────┐
│ Tweet │ │ Timeline │ │ User │
│ DB │ │ Cache │ │ DB │
│(Cassandra)│ │ (Redis) │ │(PostgreSQL)│
└──────────┘ └────────────┘ └──────────┘
Timeline Generation (Fan-out Approach)
Problem: User with 1M followers posts
Solution: Fan-out on write
1. User posts tweet
2. Store in Tweet DB
3. Push to all followers' timelines (async)
4. Timeline is pre-computed, fast to read
Trade-off:
- Write is expensive (1M writes for 1M followers)
- Read is cheap (O(1) lookup)
- Acceptable for Twitter (reads >> writes)
For celebrities (10M+ followers):
- Don't fan-out
- Merge on read: "Followers tweets + Celebrity tweets"
Database Choice
Tweet Storage:
- Cassandra (write-heavy, time-series)
- Shard by user_id
- 3 replicas for durability
Timeline:
- Redis (in-memory, fast reads)
- TTL: 7 days
- Fallback to DB for old tweets
User Relationships:
- Graph DB (Neo4j) or PostgreSQL
- Followers/following as edges
Complete Example: Design ChatGPT
AI-Native Considerations
Components:
1. Request Handler:
- Token usage validation
- Rate limiting (RPM/TPM)
- Model selection (GPT-4 vs GPT-3.5)
2. Context Management:
- Conversation history retrieval
- Token counting (4 chars ≈ 1 token)
- Context window management (8K, 32K, 128K limits)
3. LLM Router:
- Load balancing across providers
- Fallback: OpenAI → Anthropic → Azure
- Circuit breaker for failures
4. Streaming:
- SSE (Server-Sent Events) for token streaming
- Buffer tokens for smooth delivery
- Handle backpressure
5. Cost Tracking:
- Input tokens: $0.01-0.03 per 1K
- Output tokens: $0.03-0.06 per 1K
- Track per user/organization
- Budget alerts
Conversation Storage
Conversation Model:
- conversation_id: UUID
- user_id: BIGINT
- model: VARCHAR (gpt-4, claude-3, etc.)
- messages: JSONB [
{role: "user", content: "...", timestamp: "..."},
{role: "assistant", content: "...", tokens_used: 150}
]
- total_tokens: INT
- created_at: TIMESTAMP
Storage Optimization:
- Compress old conversations (>30 days)
- Archive to S3 after 90 days
- Hot storage: Redis for active conversations
Red Flags Interviewers Watch For
Avoid these mistakes:
red_flag_1: Ignoring Trade-offs
“I’ll use a distributed database” without explaining why or the trade-offs.
Better: “I’m choosing Cassandra over PostgreSQL because we need to scale writes horizontally, but this means we lose strong consistency and have to handle eventual consistency in the application.”
red_flag_2: Over-engineering
Designing for 1 billion users when requirements say 1 million.
Better: Start simple, then scale. “I’ll start with a single database, then shard when we hit 10M users.”
red_flag_3: Single Points of Failure
“Here’s my database server” (singular)
Better: “I have a primary and two replicas. If the primary fails, one replica is promoted.”
red_flag_4: No Monitoring
No mention of observability, metrics, or alerting.
Better: “I’ll use Prometheus for metrics, Grafana for dashboards, and PagerDuty for critical alerts.”
red_flag_5: Ignoring Security
No mention of HTTPS, authentication, or data protection.
Better: “All traffic uses HTTPS. Sensitive data is encrypted at rest. API requires authentication via JWT.”
Practice Strategy
Week 1-2: Learn Patterns
- Study the systems above
- Draw architectures on a whiteboard (or paper)
- Time yourself: 35 minutes per design
Week 3-4: Practice Problems
Classic questions to master:
- Design URL Shortener
- Design Twitter
- Design Uber/Lyft
- Design WhatsApp
- Design YouTube
- Design Rate Limiter
- Design Key-Value Store
- Design Web Crawler
Modern questions (2026):
- Design ChatGPT
- Design AI Agent Platform
- Design Semantic Search
- Design Real-time Translation
Week 5-6: Mock Interviews
- Practice with friends
- Use Pramp or interviewing.io
- Record yourself, review timing
Resources and Next Steps
Free Resources
- System Design Primer (GitHub): donnemartin/system-design-primer
- ByteByteGo Newsletter: Weekly system design concepts
- Designing Data-Intensive Applications (book): The bible of system design
Paid Resources
- ByteByteGo Course: $50-100 (comprehensive, worth it)
- DesignGuru: Subscription-based system design course
- Interviewing.io: Mock interviews with FAANG engineers
Mock Interview Services
If you want personalized feedback, I offer mock system design interviews:
- 1-hour session: $150
- Includes detailed feedback and improvement plan
- Focus on your target company (Google, Meta, etc.)
Quick Reference Card
Always Mention:
- Load balancing (distribute traffic)
- Caching (reduce latency)
- Database sharding (scale storage)
- Replication (availability + read scaling)
- CDN (serve static content)
- Monitoring (observability)
Always Ask:
- Scale expectations (DAU, QPS)
- Latency requirements
- Consistency vs availability preference
- Budget constraints
Always Explain:
- Why you chose technology X over Y
- Trade-offs of your decisions
- How you’ll scale from 1K to 1M users
Conclusion
System design interviews test your ability to:
- Scope problems effectively
- Sketch coherent architectures
- Scale to real-world traffic
- Solidify critical components
The 4S Framework keeps you structured. AI-native patterns are now essential. Practice until you can design Twitter in 35 minutes without breaking a sweat.
You’ve got this.
Want personalized feedback on your system design approach? I offer mock interviews and coaching—reach out here.
Last updated: March 17, 2026. System design is evolving—check back for updates on AI-native architectures.
Related posts: System Design: From Zero to Production, Building AI Agents