Hardware & Topology
The reference hardware for HttpArena results is a single-socket AMD Threadripper PRO 3995WX workstation. Every benchmark run uses the same machine, so numbers are directly comparable framework-to-framework. This page explains the CPU topology — NUMA, cache hierarchy, SMT — and how the benchmark harness pins resources against it.
Processor
| Spec | Value |
|---|---|
| Model | AMD Ryzen Threadripper PRO 3995WX |
| Architecture | Zen 2 |
| Physical cores | 64 |
| Logical threads | 128 (SMT2) |
| Sockets | 1 |
| Base / boost clock | 2.7 GHz / 4.2 GHz |
| TDP | 280 W |
| Memory channels | 8 × DDR4-3200 |
| Aggregate DRAM bandwidth | ~205 GB/s |
| Total L3 | 256 MB (16 × 16 MB) |
| Total L2 | 32 MB (64 × 512 KB) |
| Total L1d / L1i | 2 MB each (64 × 32 KB) |
NUMA layout
Confirmed with numactl --hardware:
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257 GB
node distances:
node 0
0: 10One NUMA node, all 128 logical threads, all DRAM channels. This is NPS1 mode (Nodes Per Socket = 1), configured in BIOS. All memory is symmetrically accessible to every core with identical reported latency, so the kernel scheduler has full placement freedom and software doesn’t need NUMA policies.
Could NPS4 help?
No, and it could hurt. NPS4 would split the chip into 4 NUMA nodes (16 cores + 2 DRAM channels each), surfacing physical proximity asymmetries to the kernel. That’s useful for shared-nothing workloads — multiple independent DBs, per-node JVMs, etc. HttpArena’s server workloads are the opposite: heavily shared mutable state (thread-pool queues, IMemoryCache, Npgsql multiplexer state, Postgres shared buffers). Under NPS4 that shared state gets a “home node” and cross-node access becomes measurably slower. Also, at our throughput (~1 KB memory traffic per request × 350K rps ≈ 350 MB/s) we use 0.2% of aggregate memory bandwidth — the resource NPS partitions isn’t one we’re constrained on.
Cache hierarchy
| Level | Size | Shared by |
|---|---|---|
| L1i + L1d | 32 KB each | One physical core (plus its SMT sibling) |
| L2 | 512 KB | One physical core (plus its SMT sibling) |
| L3 | 16 MB | One CCX — 4 physical cores / 8 threads |
Zen 2 arranges cores into CCXs (core complexes) of 4 physical cores. Each CCX has its own exclusive L3 slice. Physical chiplets on the 3995WX contain 2 CCXs each, and there are 8 chiplets → 16 CCXs total, 16 MB L3 per CCX.
Within a CCX, L3 access is ~40 cycles. Crossing CCXs (Infinity Fabric) is ~110 cycles. So CCX boundaries are the real locality boundaries on this chip, more so than NUMA.
CCX-to-CPU mapping
Verified from /sys/devices/system/cpu/cpuN/cache/index3/shared_cpu_list:
| CCX | Physical cores (and SMT siblings) |
|---|---|
| 0 | 0–3, 64–67 |
| 1 | 4–7, 68–71 |
| 2 | 8–11, 72–75 |
| 3 | 12–15, 76–79 |
| 4 | 16–19, 80–83 |
| 5 | 20–23, 84–87 |
| 6 | 24–27, 88–91 |
| 7 | 28–31, 92–95 |
| 8 | 32–35, 96–99 |
| 9 | 36–39, 100–103 |
| 10 | 40–43, 104–107 |
| 11 | 44–47, 108–111 |
| 12 | 48–51, 112–115 |
| 13 | 52–55, 116–119 |
| 14 | 56–59, 120–123 |
| 15 | 60–63, 124–127 |
SMT (Simultaneous Multithreading)
Each physical core has two hardware threads. The sibling of CPU N is CPU N+64:
cpu0 → 0,64
cpu1 → 1,65
...
cpu63 → 63,127Verified from /sys/devices/system/cpu/cpuN/topology/thread_siblings_list.
Why this matters for pinning: SMT siblings share L1, L2, execution units, and decode bandwidth. When you cpuset-pin a consumer, always include both threads of a pair — splitting an SMT pair across two different consumers creates pathological cache thrashing. When the harness assigns, say, 0,64 to Redis and 1-31,65-95 to the server, both sides get coherent pairs.
SMT2 roughly yields +30% throughput on our workload over single-thread-per-core — useful for latency-bound async handlers where one logical thread is usually idle waiting on I/O while its sibling can execute.
How HttpArena pins against this topology
Most profiles use a 64-thread server cpuset: 0-31,64-95. That’s 32 physical cores spanning CCX 0–7 (8 CCXs = 128 MB L3 budget). The load generator (gcannon) gets 32-63,96-127 — symmetric split, CCX 8–15. One half of the chip drives the test, the other half serves it. Same NUMA node either way.
For profiles that need a sidecar (Postgres, Redis), the harness reshuffles:
crud profile (uses Redis):
| consumer | phys | threads | cpuset | L3 reach |
|---|---|---|---|---|
| Redis | 1 | 2 | 0,64 | CCX 0 (shared with 3 server cores) |
| Server | 31 | 62 | 1-31,65-95 | CCX 0–7 (112 MB L3 exclusive + 16 MB shared with Redis) |
| Gcannon | 32 | 64 | 32-63,96-127 | CCX 8–15 (128 MB L3) |
| Postgres | unpinned | — | — | kernel-scheduled, typically lands on server CCXs for request-path L3 locality |
Redis sharing a CCX with a few server cores is beneficial — data the server just wrote to Redis (on cache miss) stays in the shared L3 when it reads back (on hit). Moving Redis to a non-server CCX would introduce a ~70-cycle inter-CCX coherence hop per read.
production-stack profile (explicit multi-service pinning):
| Service | cpuset |
|---|---|
| edge (nginx) | 4-15,68-79 (12 phys) |
| authsvc (JWT verifier) | 16-19,80-83 (4 phys) |
| Redis (cache) | 15,79 (1 phys) |
| Postgres (unpinned) | — |
| server (framework) | 0-3,20-31,64-67,84-95 (16 phys) |
| gcannon | 32-63,96-127 (32 phys) |
This split is empirically tuned — see the CHANGELOG entry for 2026-04-16 for the calibration history and why an edge-heavy allocation works at the given rps.
Kernel tuning applied per run
scripts/lib/system.sh runs before each benchmark:
- CPU governor →
performance(no DVFS ramp delays) net.core.somaxconn→ 65535 (accept queue)net.ipv4.tcp_max_syn_backlog→ 65535net.core.netdev_max_backlog→ 65535net.ipv4.ip_local_port_range→1024 65535(avoid ephemeral port exhaustion under-rreconnect storms)net.ipv4.tcp_tw_reuse→ 1net.ipv4.tcp_max_tw_buckets→ 131072net.core.rmem_max/wmem_max→ 7.5 MB (UDP buffer for QUIC)- Loopback MTU → 1500 (realistic Ethernet; the default 65536 hides kernel segmentation cost)
- Docker daemon restart to pick up the new limits
Post-run system_restore reverts governor and loopback MTU to defaults.
Practical takeaways
- CCXs are the locality unit, not NUMA nodes. Pin consumers to contiguous 4-core groups when you can; avoid splitting a CCX across two consumers unless you explicitly want them to share L3 (like Redis + server).
- Keep SMT pairs together. Every cpuset in the harness respects
(N, N+64)pairing — preserved automatically if you specify cpusets likea-b,(a+64)-(b+64). - NUMA is a non-issue on this chip. Don’t waste time with
numactl --membindor NPS subdivision. - Memory bandwidth is 99.8% idle at our rps. Memory-side optimizations only help workloads with much higher per-request data movement (analytics, streaming) than REST-API-style request handling.