The System Design Interview Playbook: 2026 Edition with AI Considerations

System design interviews have evolved. In 2026, you’re not just designing databases and load balancers—you’re architecting LLM-powered systems, vector search pipelines, and AI-native applications.

I’ve conducted 50+ system design interviews at FAANG companies and startups, and I’ve noticed a clear shift: candidates who only know traditional patterns are falling behind. The bar has been raised.

This guide gives you the framework to ace any system design interview, whether it’s a classic “Design Twitter” or a modern “Design an AI Agent Platform.”

The 4S Framework

System design is overwhelming because there’s too much to cover. Use the 4S Framework to stay structured:

Scope - Understand requirements and constraints
Sketch - Design high-level architecture
Scale - Handle millions of users
Solidify - Deep dive into critical components

Let’s walk through each step with real interview examples.

Phase 1: Scope (5-10 minutes)

Never start designing without understanding what you’re building. Interviewers will penalize you for jumping into solutions prematurely.

Functional Requirements

What does the system do?

Example: Design a URL shortener
✓ Create short URLs from long URLs
✓ Redirect short URLs to original
✓ Custom aliases (optional)
✓ Analytics (optional)
✗ Image uploads (out of scope)
✗ User authentication (clarify)

Key Questions to Ask:

“Should users be able to create custom short URLs?”
“Do we need analytics on click-through rates?”
“Can users delete or update their URLs?”
“Is there a time limit for short URLs?”

Non-Functional Requirements

How should the system behave?

Priority Matrix:
┌─────────────────┬──────────┬──────────┐
│ Requirement     │ Priority │ Notes    │
├─────────────────┼──────────┼──────────┤
│ Availability    │ High     │ 99.99%   │
│ Latency         │ High     │ <100ms   │
│ Scalability     │ High     │ 100M/day │
│ Durability      │ Medium   │ No loss  │
│ Security        │ Medium   │ HTTPS    │
└─────────────────┴──────────┴──────────┘

Key Questions to Ask:

“What’s the expected scale? Daily active users, requests per second?”
“What’s the acceptable latency for URL creation vs redirection?”
“What’s more important: consistency or availability?”
“Any compliance requirements (GDPR, etc.)?”

Back-of-the-Envelope Math

Quick calculations show you understand scale:

URL Shortener Math:
- 100 million new URLs per day
- 100:1 read:write ratio
- 10 billion redirects per day

Storage:
- Each URL record: ~500 bytes
- 100M × 500B = 50GB per day
- 5 years: 50GB × 365 × 5 = ~90TB

QPS (Queries Per Second):
- Writes: 100M/day ÷ 86,400s = ~1,200 WPS
- Reads: 10B/day ÷ 86,400s = ~115,000 RPS

Pro Tip: Memorize these numbers:

1 million seconds ≈ 11.5 days
1 billion seconds ≈ 31.7 years
1 KB ≈ 1,000 bytes, 1 MB ≈ 1 million, etc.
SSD read: ~1ms, HDD read: ~10ms, Network (same DC): ~0.5ms

Phase 2: Sketch (10-15 minutes)

Design the high-level architecture. Keep it simple—details come later.

API Design

Start with the interface:

# URL Shortener API

POST /api/v1/urls
Request:
{
    "long_url": "https://example.com/very/long/path",
    "custom_alias": "mylink"  # optional
}

Response:
{
    "short_url": "https://short.io/abc123",
    "created_at": "2026-03-17T10:00:00Z",
    "expires_at": "2027-03-17T10:00:00Z"
}

GET /{short_code}
Response: 302 Redirect to original URL

GET /api/v1/urls/{short_code}/analytics
Response:
{
    "total_clicks": 15000,
    "unique_visitors": 12000,
    "clicks_by_country": {...}
}

Data Model

Keep it minimal for the sketch phase:

-- URL Shortener Schema

urls table:
- id: BIGINT (primary key)
- short_code: VARCHAR(10) (indexed, unique)
- long_url: VARCHAR(2048)
- user_id: BIGINT (nullable)
- created_at: TIMESTAMP
- expires_at: TIMESTAMP
- click_count: BIGINT

analytics table:
- id: BIGINT
- url_id: BIGINT (foreign key)
- timestamp: TIMESTAMP
- country: VARCHAR(2)
- referrer: VARCHAR(512)
- user_agent: VARCHAR(512)

High-Level Architecture Diagram

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Client    │────▶│ Load Balancer│────▶│   API       │
│             │     │ (Round-Robin)│     │   Servers   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                        ┌───────────────────────┼───────────┐
                        │                       │           │
                  ┌─────▼─────┐          ┌─────▼─────┐ ┌───▼──────┐
                  │  Cache    │          │  Primary  │ │  Read    │
                  │ (Redis)   │          │  Database │ │  Replicas│
                  └───────────┘          │ (PostgreSQL)  └──────────┘
                                         └───────────┘

Phase 3: Scale (10-15 minutes)

Now make it handle millions of users.

Horizontal Scaling

API Layer: Scale web servers horizontally behind load balancer

Before: 1 server handling 1000 QPS
After: 10 servers each handling 100 QPS

Database Layer: Read replicas + Sharding

Read Replicas:
- 1 Primary (writes)
- 5 Replicas (reads)
- Replication lag: <100ms (acceptable for URL shortener)

Sharding Strategy:
- Shard by short_code hash
- 16 shards to start
- Consistent hashing for rebalancing

Caching Strategy

Cache Layers:

1. Browser Cache (client-side):
   - Cache 301 redirects permanently
   - Reduces repeated requests

2. CDN (CloudFlare/AWS CloudFront):
   - Cache popular URLs at edge
   - 80% of traffic served from cache

3. Application Cache (Redis):
   - Hot URLs in memory
   - TTL: 24 hours
   - Size: Top 1 million URLs

Cache Hit Rates to Target:
- CDN: 80%+
- Redis: 15-20%
- Database: <5%

Rate Limiting

Prevent abuse:

# Token bucket algorithm
class RateLimiter:
    def allow_request(self, user_id):
        # 100 requests per minute per user
        tokens = redis.get(f"tokens:{user_id}")
        if tokens > 0:
            redis.decr(f"tokens:{user_id}")
            return True
        return False

# Different tiers
FREE_TIER = 10/minute
PREMIUM_TIER = 1000/minute
ENTERPRISE = 10000/minute

Phase 4: Solidify (15-20 minutes)

Deep dive into the most critical components.

Component 1: Short Code Generation

Approach 1: Hashing (Simple but collisions possible)

import hashlib

def generate_short_code(long_url):
    # MD5 hash, take first 7 characters
    hash = hashlib.md5(long_url.encode()).hexdigest()[:7]
    return hash

# Problem: Collisions!
# Solution: Check uniqueness, append counter if collision

Approach 2: Base62 Encoding (Better)

import string

BASE62 = string.ascii_letters + string.digits  # a-zA-Z0-9

def encode_base62(num):
    """Convert ID to base62 string"""
    if num == 0:
        return BASE62[0]
    result = []
    while num:
        result.append(BASE62[num % 62])
        num //= 62
    return ''.join(reversed(result))

# Generate unique ID from database
# ID 1 -> "a", ID 61 -> "Z", ID 62 -> "aa"
# Supports 62^7 = 3.5 trillion URLs with 7 characters

Approach 3: Pre-generated Keys (Best for scale)

Key Generation Service:
1. Pre-generate billions of short codes
2. Store in Redis "available_codes" set
3. When creating URL: POP one code from set
4. If unused after 24 hours: Return to set

Advantages:
- No collision check needed
- O(1) generation time
- Predictable performance

Component 2: Database Sharding

Sharding by short_code:

Shard 0: codes starting with a-d
Shard 1: codes starting with e-h
...
Shard 15: codes starting with # (custom)

shard = hash(short_code) % 16

Benefits:
- Even distribution
- Easy to add shards
- Range queries possible (for analytics)

Component 3: Analytics Pipeline

Don’t slow down redirects with analytics writes:

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  User    │────▶│  API     │────▶│  Kafka   │────▶│  Spark   │
│  Request │     │  Server  │     │  Topic   │     │  Streaming│
└──────────┘     └──────────┘     └──────────┘     └────┬─────┘
                                                        │
                              ┌─────────────────────────┼─────────┐
                              │                         │         │
                        ┌─────▼─────┐            ┌─────▼─────┐ ┌─▼──────┐
                        │  Time     │            │  Data     │ │ Real-  │
                        │  Series   │            │  Warehouse│ │ time   │
                        │  DB       │            │  (BigQuery)│ │Dashboard│
                        │(InfluxDB) │            │           │ │         │
                        └───────────┘            └───────────┘ └────────┘

Flow:
1. API server publishes event to Kafka (async)
2. Spark processes streams
3. Stores aggregates in InfluxDB (fast queries)
4. Stores raw data in BigQuery (analysis)

2026 Addition: AI-Native System Design

Traditional interviews are being supplemented with AI architecture questions. Here’s how to handle them.

Pattern 1: LLM-Powered Request Handling

Question: “Design a customer support system that uses LLMs”

Key Components:

1. Request Router:
   - Classify queries (simple vs complex)
   - Route simple to FAQ bot
   - Route complex to LLM pipeline

2. Context Assembly:
   - Retrieve relevant docs (RAG)
   - Fetch user history
   - Build prompt with context

3. LLM Service:
   - Load balancing across providers (OpenAI, Anthropic, etc.)
   - Fallback strategies
   - Token usage tracking

4. Response Pipeline:
   - Fact-checking layer
   - Safety/content filtering
   - Confidence scoring
   - Human escalation if needed

Scale Considerations:

LLM API costs: ~$0.01-0.10 per request
Latency: 500ms-3s (much slower than traditional APIs)
Caching: Cache similar queries aggressively
Model selection: Use cheaper models for simple queries

Pattern 2: Vector Database Integration

Question: “Design a semantic search system”

Architecture:

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│  User    │────▶│  Embedding   │────▶│   Vector     │
│  Query   │     │   Service    │     │   Database   │
└──────────┘     └──────────────┘     └──────┬───────┘
                                             │
                              ┌──────────────┼──────────┐
                              │              │          │
                        ┌─────▼─────┐  ┌────▼─────┐ ┌─▼─────┐
                        │  OpenAI   │  │ Pinecone │ │ Weaviate│
                        │  API      │  │   or     │ │   or    │
                        │           │  │ pgvector │ │Milvus   │
                        └───────────┘  └──────────┘ └────────┘

Key Metrics:
- Embedding time: ~100-500ms
- Vector search: ~10-50ms
- Index size: ~1KB per vector (1536 dimensions)

Pattern 3: RAG (Retrieval-Augmented Generation)

Question: “Design a document Q&A system like ChatGPT with file upload”

RAG Pipeline:

Ingestion:
1. Document Upload → PDF/Text parsing
2. Chunking (semantic boundaries, ~500 tokens)
3. Embedding generation (batch)
4. Vector DB storage with metadata

Query:
1. User asks question
2. Embed query
3. Vector similarity search (top-k=5)
4. Assemble context from retrieved chunks
5. Build prompt: "Answer based on: {context}\nQuestion: {query}"
6. Send to LLM
7. Stream response

Optimization:
- Re-ranking: Cross-encoder on top-k results
- Hybrid search: Vector + BM25 keyword
- Caching: Store common query embeddings
- Pre-computation: Embed popular documents ahead of time

Scale Math:

- 1M documents
- ~100 chunks per document
- 100M total chunks
- Each chunk: 1536 dimensions × 4 bytes = ~6KB
- Total: 600GB vector storage
- Query: 100 QPS, 50ms latency

Complete Example: Design Twitter/X

Requirements

Post tweets (280 chars)
Follow/unfollow users
View timeline
Like/retweet
Scale: 500M DAU, 100M tweets/day

High-Level Design

┌──────────┐     ┌──────────┐     ┌──────────┐
│  Client  │────▶│  CDN     │────▶│  API     │
│          │     │          │     │  Gateway │
└──────────┘     └──────────┘     └────┬─────┘
                                       │
        ┌──────────────────────────────┼──────────────────────────────┐
        │                              │                              │
   ┌────▼─────┐                 ┌─────▼──────┐                 ┌────▼─────┐
   │  Tweet   │                 │  Timeline  │                 │  User    │
   │  Service │                 │  Service   │                 │  Service │
   └────┬─────┘                 └─────┬──────┘                 └────┬─────┘
        │                             │                             │
   ┌────▼─────┐                 ┌─────▼──────┐                 ┌────▼─────┐
   │  Tweet   │                 │  Timeline  │                 │  User    │
   │  DB      │                 │  Cache     │                 │  DB      │
   │(Cassandra)│               │  (Redis)   │                 │(PostgreSQL)│
   └──────────┘                 └────────────┘                 └──────────┘

Timeline Generation (Fan-out Approach)

Problem: User with 1M followers posts
Solution: Fan-out on write

1. User posts tweet
2. Store in Tweet DB
3. Push to all followers' timelines (async)
4. Timeline is pre-computed, fast to read

Trade-off:
- Write is expensive (1M writes for 1M followers)
- Read is cheap (O(1) lookup)
- Acceptable for Twitter (reads >> writes)

For celebrities (10M+ followers):
- Don't fan-out
- Merge on read: "Followers tweets + Celebrity tweets"

Database Choice

Tweet Storage:
- Cassandra (write-heavy, time-series)
- Shard by user_id
- 3 replicas for durability

Timeline:
- Redis (in-memory, fast reads)
- TTL: 7 days
- Fallback to DB for old tweets

User Relationships:
- Graph DB (Neo4j) or PostgreSQL
- Followers/following as edges

Complete Example: Design ChatGPT

AI-Native Considerations

Components:

1. Request Handler:
   - Token usage validation
   - Rate limiting (RPM/TPM)
   - Model selection (GPT-4 vs GPT-3.5)

2. Context Management:
   - Conversation history retrieval
   - Token counting (4 chars ≈ 1 token)
   - Context window management (8K, 32K, 128K limits)

3. LLM Router:
   - Load balancing across providers
   - Fallback: OpenAI → Anthropic → Azure
   - Circuit breaker for failures

4. Streaming:
   - SSE (Server-Sent Events) for token streaming
   - Buffer tokens for smooth delivery
   - Handle backpressure

5. Cost Tracking:
   - Input tokens: $0.01-0.03 per 1K
   - Output tokens: $0.03-0.06 per 1K
   - Track per user/organization
   - Budget alerts

Conversation Storage

Conversation Model:
- conversation_id: UUID
- user_id: BIGINT
- model: VARCHAR (gpt-4, claude-3, etc.)
- messages: JSONB [
    {role: "user", content: "...", timestamp: "..."},
    {role: "assistant", content: "...", tokens_used: 150}
  ]
- total_tokens: INT
- created_at: TIMESTAMP

Storage Optimization:
- Compress old conversations (>30 days)
- Archive to S3 after 90 days
- Hot storage: Redis for active conversations

Red Flags Interviewers Watch For

Avoid these mistakes:

red_flag_1: Ignoring Trade-offs

“I’ll use a distributed database” without explaining why or the trade-offs.

Better: “I’m choosing Cassandra over PostgreSQL because we need to scale writes horizontally, but this means we lose strong consistency and have to handle eventual consistency in the application.”

red_flag_2: Over-engineering

Designing for 1 billion users when requirements say 1 million.

Better: Start simple, then scale. “I’ll start with a single database, then shard when we hit 10M users.”

red_flag_3: Single Points of Failure

“Here’s my database server” (singular)

Better: “I have a primary and two replicas. If the primary fails, one replica is promoted.”

red_flag_4: No Monitoring

No mention of observability, metrics, or alerting.

Better: “I’ll use Prometheus for metrics, Grafana for dashboards, and PagerDuty for critical alerts.”

red_flag_5: Ignoring Security

No mention of HTTPS, authentication, or data protection.

Better: “All traffic uses HTTPS. Sensitive data is encrypted at rest. API requires authentication via JWT.”

Practice Strategy

Week 1-2: Learn Patterns

Study the systems above
Draw architectures on a whiteboard (or paper)
Time yourself: 35 minutes per design

Week 3-4: Practice Problems

Classic questions to master:

Design URL Shortener
Design Twitter
Design Uber/Lyft
Design WhatsApp
Design YouTube
Design Rate Limiter
Design Key-Value Store
Design Web Crawler

Modern questions (2026):

Design ChatGPT
Design AI Agent Platform
Design Semantic Search
Design Real-time Translation

Week 5-6: Mock Interviews

Practice with friends
Use Pramp or interviewing.io
Record yourself, review timing

Resources and Next Steps

Free Resources

System Design Primer (GitHub): donnemartin/system-design-primer
ByteByteGo Newsletter: Weekly system design concepts
Designing Data-Intensive Applications (book): The bible of system design

Paid Resources

ByteByteGo Course: $50-100 (comprehensive, worth it)
DesignGuru: Subscription-based system design course
Interviewing.io: Mock interviews with FAANG engineers

Mock Interview Services

If you want personalized feedback, I offer mock system design interviews:

1-hour session: $150
Includes detailed feedback and improvement plan
Focus on your target company (Google, Meta, etc.)

Quick Reference Card

Always Mention:

Load balancing (distribute traffic)
Caching (reduce latency)
Database sharding (scale storage)
Replication (availability + read scaling)
CDN (serve static content)
Monitoring (observability)

Always Ask:

Scale expectations (DAU, QPS)
Latency requirements
Consistency vs availability preference
Budget constraints

Always Explain:

Why you chose technology X over Y
Trade-offs of your decisions
How you’ll scale from 1K to 1M users

Conclusion

System design interviews test your ability to:

Scope problems effectively
Sketch coherent architectures
Scale to real-world traffic
Solidify critical components

The 4S Framework keeps you structured. AI-native patterns are now essential. Practice until you can design Twitter in 35 minutes without breaking a sweat.

You’ve got this.

Want personalized feedback on your system design approach? I offer mock interviews and coaching—reach out here.

Last updated: March 17, 2026. System design is evolving—check back for updates on AI-native architectures.