Skip to main content

Perf vs PCIe speed

More info
topo
========================= ROCm System Management Interface =========================
============================= Weight between two GPUs ==============================
GPU0 GPU1 GPU2 GPU3
GPU0 0 40 40 40
GPU1 40 0 40 40
GPU2 40 40 0 40
GPU3 40 40 40 0

============================== Hops between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3
GPU0 0 2 2 2
GPU1 2 0 2 2
GPU2 2 2 0 2
GPU3 2 2 2 0

============================ Link Type between two GPUs ============================
GPU0 GPU1 GPU2 GPU3
GPU0 0 PCIE PCIE PCIE
GPU1 PCIE 0 PCIE PCIE
GPU2 PCIE PCIE 0 PCIE
GPU3 PCIE PCIE PCIE 0

==================================== Numa Nodes ====================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 0
GPU[3] : (Topology) Numa Node: 0
GPU[3] : (Topology) Numa Affinity: 0
=============================== End of ROCm SMI Log ================================
llama.cpp ver
ggml_cuda_init: found 4 ROCm devices (Total VRAM: 131008 MiB):
Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 2: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 3: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
load_backend: loaded ROCm backend from /app/libggml-hip.so
load_backend: loaded CPU backend from /app/libggml-cpu-icelake.so
build: ef93e98 (1)
PCIe 1.0
31:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
34:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
4b:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
4e:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
PCIe 2.0
31:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 5GT/s (downgraded), Width x8 (downgraded)
34:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 5GT/s (downgraded), Width x8 (downgraded)
4b:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 5GT/s (downgraded), Width x8 (downgraded)
4e:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 5GT/s (downgraded), Width x8 (downgraded)
PCIe 3.0
31:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
34:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
4b:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
4e:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
PCIe 4.0
31:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 16GT/s, Width x8 (downgraded)
34:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 16GT/s, Width x8 (downgraded)
4b:00.0
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 16GT/s, Width x8 (downgraded)
4e:00.0
LnkCap: Port #2, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 16GT/s, Width x8 (downgraded)

Max bw

LinkWorkloadBW (mb/s)
x8 1.0pp589
x8 1.0tg209
x8 2.0pp728
x8 2.0tg228
x8 3.0pp1104
x8 3.0tg219
x8 4.0pp1135
x8 4.0tg222

Bench

MODEL=unsloth/gemma-4-31B-it-GGUF:Q8_0
cpupower -c 0-37 frequency-set -g performance
numactl --membind=0 --cpunodebind=0 \
./llama-bench \
--hf-repo $MODEL \
--split-mode tensor --flash-attn 1 \
--n-prompt 2048 --ubatch-size 2048 \
--n-gen 256 \
--n-depth 0,16384
Linkmodelsizeparamsbackendngln_ubatchsmfatestt/s
x8 1.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048285.30 ± 0.03
x8 1.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg25628.90 ± 0.02
x8 1.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048 @ d16384248.42 ± 0.44
x8 1.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg256 @ d1638427.47 ± 0.09
x8 2.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048360.82 ± 0.08
x8 2.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg25631.39 ± 0.16
x8 2.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048 @ d16384311.40 ± 0.63
x8 2.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg256 @ d1638429.59 ± 0.11
x8 3.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048414.26 ± 0.12
x8 3.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg25631.95 ± 0.02
x8 3.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048 @ d16384355.50 ± 0.75
x8 3.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg256 @ d1638430.27 ± 0.14
x8 4.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048447.61 ± 0.08
x8 4.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg25632.52 ± 0.06
x8 4.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1pp2048 @ d16384382.58 ± 0.97
x8 4.0gemma4 31B Q8_030.38 GiB30.70 BROCm992048tensor1tg256 @ d1638430.67 ± 0.11