Latency Numbers Every Engineer Should Know (2025)
2 min read
Jeff Dean's original latency table is over a decade old. Here is an updated version for modern hardware, plus notes on how to use it for ML serving estimates.
The table
| Operation | Latency |
|---|---|
| L1 cache hit | ~1 ns |
| L2 cache hit | ~4 ns |
| L3 cache hit | ~40 ns |
| DRAM access | ~100 ns |
| NVMe SSD read (4 KB) | ~100 µs |
| SSD sequential read (1 MB) | ~1 ms |
| Same-datacenter round trip | ~500 µs |
| Cross-region (US → EU) | ~75 ms |
| HDD seek | ~10 ms |
| TCP handshake (local) | ~1 ms |
Inference latency budget
A 100 ms p99 budget for an LLM serving endpoint breaks down roughly as:
network ingress ~5 ms
tokenisation ~1 ms
model forward pass ~60–80 ms ← optimise here first
sampling ~2 ms
response serialisation ~1 ms
network egress ~5 ms
Speculative decoding, KV-cache pinning, and dynamic batching all attack the forward pass. Reducing cross-region hops attacks the network terms. Optimise in proportion to where the time actually goes.
Checking your own system
# Round-trip latency to a host
ping -c 10 api.example.com
# Memory bandwidth (Linux)
sysbench memory --memory-block-size=1M --memory-total-size=10G run
# Disk read throughput
dd if=/dev/nvme0n1 of=/dev/null bs=1M count=1024 iflag=directThe rule of thumb
If it crosses a network boundary, assume milliseconds. If it stays on-chip, assume nanoseconds. Everything else is in between.
Use this when estimating whether a caching layer is worth the complexity, or whether a synchronous database call inside a hot path is a problem.