Theoretical Concepts of Latency in FAANG System Design

Objective of Latency in System Design

Latency measures how fast a system responds to a request. For FAANG-level systems, latency isn't just a number—it's a critical user experience (UX) and infrastructure design factor.
Key goals:
Ensure fast, reliable, and consistent user responses.
Satisfy SLAs/SLOs (e.g., 95% of requests < 100ms).
Improve user retention and engagement.
Support high throughput without degrading performance.
Optimize cost-performance trade-offs with smart infra design.

Theoretical Foundations of Latency

Type
What It Means
Example
Network latency
Time to transmit packets
RTT from user → CDN → server
Disk latency
Time to read/write to disk
Reading from HDD vs SSD
Queueing latency
Time waiting in queues
Requests piling in front of app server
Processing latency
Time to process request
API or DB logic execution
End-to-end latency
Total time from user click to response
Sum of all layers above

Latency: Time taken for a single request (e.g., 200ms for a photo upload)
Throughput: Requests handled per second (e.g., 10K uploads/sec)
Trade-off: Reducing latency might reduce throughput and vice versa. FAANG systems aim to optimize both, often with resource scaling and asynchronous processing.

Averages lie. FAANG engineers care about percentile latency:

P50 – Median user

P95 – 5% slowest users

P99 – 1% edge cases

P999 – Critical tail latency

A few slow services (e.g., DB shard or cache miss) can kill UX. Design for worst-case, not average.

Start with a goal like End-to-End Latency ≤ 200ms. Break it down:
Component
Budget (ms)
Frontend + CDN
30ms
Load Balancer
10ms
App Server Logic
40ms
DB Query
50ms
Cache Lookups
20ms
External API Call
30ms
Network Transfer
20ms
Total
200ms
Every component must respect its slice. Helps identify bottlenecks early.

Latency increases with distance.

Fiber has ~5ms latency per 1,000 km.

Global services need:

CDNs (e.g., Cloudflare, Akamai)

Edge caching

Region-based failovers

Read replicas close to users

✨ Example: Instagram loads images via CDN, minimizing latency worldwide.

Techniques to reduce per-request latency:
Batching: Group requests (e.g., write logs in bulk)

Pipelining: Send multiple requests without waiting

Async processing: Offload heavy ops (e.g., send email after response)

Example: A user uploads a video → UI responds fast → transcoding runs async.

Latency reduction via caching layers:
Browser cache → client-side speed

CDN cache → static assets fast

Reverse proxy (e.g., NGINX) → dynamic caching

In-memory cache (Redis/Memcached) → avoid DB hits

Strategy: Cache early, invalidate smartly.

Prevent systems from collapsing under load:
Rate limiting (e.g., 100 req/min)

Queue size limits

Circuit breakers (fail fast)

Retry logic with exponential backoff

Prevents cascading failures & keeps latency consistent under pressure.

Denormalization in NoSQL → reduce joins

Precomputed data → fast lookup

Search indexing (e.g., Elasticsearch)

Materialized views in SQL

Optimize schema for read latency, even if writes get a bit heavier.

Track latency in real-time:
APM tools: Datadog, New Relic

Traces: Jaeger, Zipkin

Metrics: Prometheus + Grafana

Histograms: For percentile tracking

Monitor p95/p99 and correlate with spikes, cache misses, or backend failures.

Summary

Concept
Why It Matters
Latency types
Understand full stack delay
Tail latency
Optimize for worst cases
Budgeting
Design systems within response limits
Geo latency
Serve users fast globally
Caching
Reduce repeat processing
Data modeling
Avoid expensive queries
Monitoring
Catch regressions & tail spikes

Concepts of Latency

Theoretical Concepts of Latency in FAANG System Design

Objective of Latency in System Design

Theoretical Foundations of Latency

Latency: Time taken for a single request (e.g., 200ms for a photo upload)
Throughput: Requests handled per second (e.g., 10K uploads/sec)
Trade-off: Reducing latency might reduce throughput and vice versa. FAANG systems aim to optimize both, often with resource scaling and asynchronous processing.

Averages lie. FAANG engineers care about percentile latency:

P50 – Median user

P95 – 5% slowest users

P99 – 1% edge cases

P999 – Critical tail latency

A few slow services (e.g., DB shard or cache miss) can kill UX. Design for worst-case, not average.

Latency increases with distance.

Fiber has ~5ms latency per 1,000 km.

Global services need:

CDNs (e.g., Cloudflare, Akamai)

Edge caching

Region-based failovers

Read replicas close to users

✨ Example: Instagram loads images via CDN, minimizing latency worldwide.

Techniques to reduce per-request latency:
Batching: Group requests (e.g., write logs in bulk)

Pipelining: Send multiple requests without waiting

Async processing: Offload heavy ops (e.g., send email after response)

Example: A user uploads a video → UI responds fast → transcoding runs async.

Latency reduction via caching layers:
Browser cache → client-side speed

CDN cache → static assets fast

Reverse proxy (e.g., NGINX) → dynamic caching

In-memory cache (Redis/Memcached) → avoid DB hits

Strategy: Cache early, invalidate smartly.

Prevent systems from collapsing under load:
Rate limiting (e.g., 100 req/min)

Queue size limits

Circuit breakers (fail fast)

Retry logic with exponential backoff

Prevents cascading failures & keeps latency consistent under pressure.

Denormalization in NoSQL → reduce joins

Precomputed data → fast lookup

Search indexing (e.g., Elasticsearch)

Materialized views in SQL

Optimize schema for read latency, even if writes get a bit heavier.

Track latency in real-time:
APM tools: Datadog, New Relic

Traces: Jaeger, Zipkin

Metrics: Prometheus + Grafana

Histograms: For percentile tracking

Monitor p95/p99 and correlate with spikes, cache misses, or backend failures.

Summary

Concepts of Latency

Concepts of Latency

Theoretical Concepts of Latency in FAANG System Design

Objective of Latency in System Design

Theoretical Foundations of Latency

Latency: Time taken for a single request (e.g., 200ms for a photo upload)Throughput: Requests handled per second (e.g., 10K uploads/sec)Trade-off: Reducing latency might reduce throughput and vice versa. FAANG systems aim to optimize both, often with resource scaling and asynchronous processing.

Averages lie. FAANG engineers care about percentile latency:P50 – Median userP95 – 5% slowest usersP99 – 1% edge casesP999 – Critical tail latencyA few slow services (e.g., DB shard or cache miss) can kill UX. Design for worst-case, not average.

Start with a goal like End-to-End Latency ≤ 200ms. Break it down:ComponentBudget (ms)Frontend + CDN30msLoad Balancer10msApp Server Logic40msDB Query50msCache Lookups20msExternal API Call30msNetwork Transfer20msTotal200msEvery component must respect its slice. Helps identify bottlenecks early.

Latency increases with distance.Fiber has ~5ms latency per 1,000 km.Global services need:CDNs (e.g., Cloudflare, Akamai)Edge cachingRegion-based failoversRead replicas close to users✨ Example: Instagram loads images via CDN, minimizing latency worldwide.

Techniques to reduce per-request latency:Batching: Group requests (e.g., write logs in bulk)Pipelining: Send multiple requests without waitingAsync processing: Offload heavy ops (e.g., send email after response)Example: A user uploads a video → UI responds fast → transcoding runs async.

Latency reduction via caching layers:Browser cache → client-side speedCDN cache → static assets fastReverse proxy (e.g., NGINX) → dynamic cachingIn-memory cache (Redis/Memcached) → avoid DB hitsStrategy: Cache early, invalidate smartly.

Prevent systems from collapsing under load:Rate limiting (e.g., 100 req/min)Queue size limitsCircuit breakers (fail fast)Retry logic with exponential backoffPrevents cascading failures & keeps latency consistent under pressure.

Denormalization in NoSQL → reduce joinsPrecomputed data → fast lookupSearch indexing (e.g., Elasticsearch)Materialized views in SQLOptimize schema for read latency, even if writes get a bit heavier.

Track latency in real-time:APM tools: Datadog, New RelicTraces: Jaeger, ZipkinMetrics: Prometheus + GrafanaHistograms: For percentile trackingMonitor p95/p99 and correlate with spikes, cache misses, or backend failures.

Summary

ConceptWhy It MattersLatency typesUnderstand full stack delayTail latencyOptimize for worst casesBudgetingDesign systems within response limitsGeo latencyServe users fast globallyCachingReduce repeat processingData modelingAvoid expensive queriesMonitoringCatch regressions & tail spikes

Latency: Time taken for a single request (e.g., 200ms for a photo upload)
Throughput: Requests handled per second (e.g., 10K uploads/sec)
Trade-off: Reducing latency might reduce throughput and vice versa. FAANG systems aim to optimize both, often with resource scaling and asynchronous processing.

Averages lie. FAANG engineers care about percentile latency:

P50 – Median user

P95 – 5% slowest users

P99 – 1% edge cases

P999 – Critical tail latency

A few slow services (e.g., DB shard or cache miss) can kill UX. Design for worst-case, not average.

Latency increases with distance.

Fiber has ~5ms latency per 1,000 km.

Global services need:

CDNs (e.g., Cloudflare, Akamai)

Edge caching

Region-based failovers

Read replicas close to users

✨ Example: Instagram loads images via CDN, minimizing latency worldwide.

Techniques to reduce per-request latency:
Batching: Group requests (e.g., write logs in bulk)

Pipelining: Send multiple requests without waiting

Async processing: Offload heavy ops (e.g., send email after response)

Example: A user uploads a video → UI responds fast → transcoding runs async.

Latency reduction via caching layers:
Browser cache → client-side speed

CDN cache → static assets fast

Reverse proxy (e.g., NGINX) → dynamic caching

In-memory cache (Redis/Memcached) → avoid DB hits

Strategy: Cache early, invalidate smartly.

Prevent systems from collapsing under load:
Rate limiting (e.g., 100 req/min)

Queue size limits

Circuit breakers (fail fast)

Retry logic with exponential backoff

Prevents cascading failures & keeps latency consistent under pressure.

Denormalization in NoSQL → reduce joins

Precomputed data → fast lookup

Search indexing (e.g., Elasticsearch)

Materialized views in SQL

Optimize schema for read latency, even if writes get a bit heavier.

Track latency in real-time:
APM tools: Datadog, New Relic

Traces: Jaeger, Zipkin

Metrics: Prometheus + Grafana

Histograms: For percentile tracking

Monitor p95/p99 and correlate with spikes, cache misses, or backend failures.