bc

Latency Numbers Every Engineer Should Know (2025)

2 min read

Jeff Dean's original latency table is over a decade old. Here is an updated version for modern hardware, plus notes on how to use it for ML serving estimates.

The table

OperationLatency
L1 cache hit~1 ns
L2 cache hit~4 ns
L3 cache hit~40 ns
DRAM access~100 ns
NVMe SSD read (4 KB)~100 µs
SSD sequential read (1 MB)~1 ms
Same-datacenter round trip~500 µs
Cross-region (US → EU)~75 ms
HDD seek~10 ms
TCP handshake (local)~1 ms

Inference latency budget

A 100 ms p99 budget for an LLM serving endpoint breaks down roughly as:

network ingress        ~5 ms
tokenisation           ~1 ms
model forward pass     ~60–80 ms   ← optimise here first
sampling               ~2 ms
response serialisation ~1 ms
network egress         ~5 ms

Speculative decoding, KV-cache pinning, and dynamic batching all attack the forward pass. Reducing cross-region hops attacks the network terms. Optimise in proportion to where the time actually goes.

Checking your own system

# Round-trip latency to a host
ping -c 10 api.example.com
 
# Memory bandwidth (Linux)
sysbench memory --memory-block-size=1M --memory-total-size=10G run
 
# Disk read throughput
dd if=/dev/nvme0n1 of=/dev/null bs=1M count=1024 iflag=direct

The rule of thumb

If it crosses a network boundary, assume milliseconds. If it stays on-chip, assume nanoseconds. Everything else is in between.

Use this when estimating whether a caching layer is worth the complexity, or whether a synchronous database call inside a hot path is a problem.