The Strix Halo platform finally makes mini-PC local inference viable for 70B-class models.

Why unified memory changes everything

Discrete GPU inference requires moving data across the PCIe bus. Strix Halo’s unified memory pool — up to 128 GB shared between CPU and GPU — eliminates that bottleneck. At 96 GB, you can run Llama 3 70B fully in memory with context to spare.

LLM throughput

ASUS NUC Pro 14 — tokens/sec (Ollama)
Llama 3 8B (Q4)
68.4 tok/s
Llama 3 70B (Q4)
13.2 tok/s
Phi-3 Mini
142.1 tok/s

Verdict

If you want to run frontier-class models locally without a dual-RTX rig, the Strix Halo platform is the most pragmatic answer available in 2026.