Latency Numbers Every Engineer Should Know (2025)

Jeff Dean's original latency table is over a decade old. Here is an updated version for modern hardware, plus notes on how to use it for ML serving estimates.

The table

Operation	Latency
L1 cache hit	~1 ns
L2 cache hit	~4 ns
L3 cache hit	~40 ns
DRAM access	~100 ns
NVMe SSD read (4 KB)	~100 µs
SSD sequential read (1 MB)	~1 ms
Same-datacenter round trip	~500 µs
Cross-region (US → EU)	~75 ms
HDD seek	~10 ms
TCP handshake (local)	~1 ms

Inference latency budget

A 100 ms p99 budget for an LLM serving endpoint breaks down roughly as:

network ingress        ~5 ms
tokenisation           ~1 ms
model forward pass     ~60–80 ms   ← optimise here first
sampling               ~2 ms
response serialisation ~1 ms
network egress         ~5 ms

Speculative decoding, KV-cache pinning, and dynamic batching all attack the forward pass. Reducing cross-region hops attacks the network terms. Optimise in proportion to where the time actually goes.

Checking your own system

# Round-trip latency to a host
ping -c 10 api.example.com
 
# Memory bandwidth (Linux)
sysbench memory --memory-block-size=1M --memory-total-size=10G run
 
# Disk read throughput
dd if=/dev/nvme0n1 of=/dev/null bs=1M count=1024 iflag=direct

The rule of thumb

If it crosses a network boundary, assume milliseconds. If it stays on-chip, assume nanoseconds. Everything else is in between.

Use this when estimating whether a caching layer is worth the complexity, or whether a synchronous database call inside a hot path is a problem.