The VRAM constraint dominates every build decision
Local LLM inference is a VRAM problem first. A 70B model at 4-bit quantization needs ~40 GB of VRAM. A 7B model needs ~6 GB. Everything else — CPU, RAM, storage — is secondary to whether your GPU can fit the model.
Build tiers
Entry — $1,500: Single RX 9070 XT (16 GB)
Runs 7B–13B models well. 70B via CPU offloading: possible but slow (8 tok/s).
Mid — $2,500: Single RTX 5080 (16 GB)
Faster throughput on 7B–13B. 70B with offload: 18 tok/s. Best price-to-performance for most users.
High — $4,500: Dual RTX 5080 (32 GB combined via NVLink + CPU offload)
Full 70B at Q4 in VRAM: 38 tok/s. Approaches the Strix Halo mini-PC for a fraction of the portability.
Throughput comparison
Software stack
Ollama is the simplest starting point. llama.cpp with GGUF models gives more control. vLLM (Linux only) for production-grade throughput with an OpenAI-compatible API.