Hardware & Topology

Hardware & Topology

The reference hardware for HttpArena results is a single-socket AMD Threadripper PRO 3995WX workstation. Every benchmark run uses the same machine, so numbers are directly comparable framework-to-framework. This page explains the CPU topology — NUMA, cache hierarchy, SMT — and how the benchmark harness pins resources against it.

Processor

SpecValue
ModelAMD Ryzen Threadripper PRO 3995WX
ArchitectureZen 2
Physical cores64
Logical threads128 (SMT2)
Sockets1
Base / boost clock2.7 GHz / 4.2 GHz
TDP280 W
Memory channels8 × DDR4-3200
Aggregate DRAM bandwidth~205 GB/s
Total L3256 MB (16 × 16 MB)
Total L232 MB (64 × 512 KB)
Total L1d / L1i2 MB each (64 × 32 KB)

NUMA layout

Confirmed with numactl --hardware:

available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257 GB
node distances:
node   0
  0:  10

One NUMA node, all 128 logical threads, all DRAM channels. This is NPS1 mode (Nodes Per Socket = 1), configured in BIOS. All memory is symmetrically accessible to every core with identical reported latency, so the kernel scheduler has full placement freedom and software doesn’t need NUMA policies.

Could NPS4 help?

No, and it could hurt. NPS4 would split the chip into 4 NUMA nodes (16 cores + 2 DRAM channels each), surfacing physical proximity asymmetries to the kernel. That’s useful for shared-nothing workloads — multiple independent DBs, per-node JVMs, etc. HttpArena’s server workloads are the opposite: heavily shared mutable state (thread-pool queues, IMemoryCache, Npgsql multiplexer state, Postgres shared buffers). Under NPS4 that shared state gets a “home node” and cross-node access becomes measurably slower. Also, at our throughput (~1 KB memory traffic per request × 350K rps ≈ 350 MB/s) we use 0.2% of aggregate memory bandwidth — the resource NPS partitions isn’t one we’re constrained on.

Cache hierarchy

LevelSizeShared by
L1i + L1d32 KB eachOne physical core (plus its SMT sibling)
L2512 KBOne physical core (plus its SMT sibling)
L316 MBOne CCX — 4 physical cores / 8 threads

Zen 2 arranges cores into CCXs (core complexes) of 4 physical cores. Each CCX has its own exclusive L3 slice. Physical chiplets on the 3995WX contain 2 CCXs each, and there are 8 chiplets → 16 CCXs total, 16 MB L3 per CCX.

Within a CCX, L3 access is ~40 cycles. Crossing CCXs (Infinity Fabric) is ~110 cycles. So CCX boundaries are the real locality boundaries on this chip, more so than NUMA.

CCX-to-CPU mapping

Verified from /sys/devices/system/cpu/cpuN/cache/index3/shared_cpu_list:

CCXPhysical cores (and SMT siblings)
00–3, 64–67
14–7, 68–71
28–11, 72–75
312–15, 76–79
416–19, 80–83
520–23, 84–87
624–27, 88–91
728–31, 92–95
832–35, 96–99
936–39, 100–103
1040–43, 104–107
1144–47, 108–111
1248–51, 112–115
1352–55, 116–119
1456–59, 120–123
1560–63, 124–127

SMT (Simultaneous Multithreading)

Each physical core has two hardware threads. The sibling of CPU N is CPU N+64:

cpu0  → 0,64
cpu1  → 1,65
...
cpu63 → 63,127

Verified from /sys/devices/system/cpu/cpuN/topology/thread_siblings_list.

Why this matters for pinning: SMT siblings share L1, L2, execution units, and decode bandwidth. When you cpuset-pin a consumer, always include both threads of a pair — splitting an SMT pair across two different consumers creates pathological cache thrashing. When the harness assigns, say, 0,64 to Redis and 1-31,65-95 to the server, both sides get coherent pairs.

SMT2 roughly yields +30% throughput on our workload over single-thread-per-core — useful for latency-bound async handlers where one logical thread is usually idle waiting on I/O while its sibling can execute.

How HttpArena pins against this topology

Most profiles use a 64-thread server cpuset: 0-31,64-95. That’s 32 physical cores spanning CCX 0–7 (8 CCXs = 128 MB L3 budget). The load generator (gcannon) gets 32-63,96-127 — symmetric split, CCX 8–15. One half of the chip drives the test, the other half serves it. Same NUMA node either way.

For profiles that need a sidecar (Postgres, Redis), the harness reshuffles:

crud profile (uses Redis):

consumerphysthreadscpusetL3 reach
Redis120,64CCX 0 (shared with 3 server cores)
Server31621-31,65-95CCX 0–7 (112 MB L3 exclusive + 16 MB shared with Redis)
Gcannon326432-63,96-127CCX 8–15 (128 MB L3)
Postgresunpinnedkernel-scheduled, typically lands on server CCXs for request-path L3 locality

Redis sharing a CCX with a few server cores is beneficial — data the server just wrote to Redis (on cache miss) stays in the shared L3 when it reads back (on hit). Moving Redis to a non-server CCX would introduce a ~70-cycle inter-CCX coherence hop per read.

production-stack profile (explicit multi-service pinning):

Servicecpuset
edge (nginx)4-15,68-79 (12 phys)
authsvc (JWT verifier)16-19,80-83 (4 phys)
Redis (cache)15,79 (1 phys)
Postgres (unpinned)
server (framework)0-3,20-31,64-67,84-95 (16 phys)
gcannon32-63,96-127 (32 phys)

This split is empirically tuned — see the CHANGELOG entry for 2026-04-16 for the calibration history and why an edge-heavy allocation works at the given rps.

Kernel tuning applied per run

scripts/lib/system.sh runs before each benchmark:

  • CPU governor → performance (no DVFS ramp delays)
  • net.core.somaxconn → 65535 (accept queue)
  • net.ipv4.tcp_max_syn_backlog → 65535
  • net.core.netdev_max_backlog → 65535
  • net.ipv4.ip_local_port_range1024 65535 (avoid ephemeral port exhaustion under -r reconnect storms)
  • net.ipv4.tcp_tw_reuse → 1
  • net.ipv4.tcp_max_tw_buckets → 131072
  • net.core.rmem_max / wmem_max → 7.5 MB (UDP buffer for QUIC)
  • Loopback MTU → 1500 (realistic Ethernet; the default 65536 hides kernel segmentation cost)
  • Docker daemon restart to pick up the new limits

Post-run system_restore reverts governor and loopback MTU to defaults.

Practical takeaways

  1. CCXs are the locality unit, not NUMA nodes. Pin consumers to contiguous 4-core groups when you can; avoid splitting a CCX across two consumers unless you explicitly want them to share L3 (like Redis + server).
  2. Keep SMT pairs together. Every cpuset in the harness respects (N, N+64) pairing — preserved automatically if you specify cpusets like a-b,(a+64)-(b+64).
  3. NUMA is a non-issue on this chip. Don’t waste time with numactl --membind or NPS subdivision.
  4. Memory bandwidth is 99.8% idle at our rps. Memory-side optimizations only help workloads with much higher per-request data movement (analytics, streaming) than REST-API-style request handling.