How to build an LLM rig in 2026: the complete hardware guide

Everything you need to run Llama 3 70B, Mistral, and Qwen 2.5 locally at usable speeds — from $1,500 single-GPU builds to dual-RTX workstations.

The VRAM constraint dominates every build decision

Local LLM inference is a VRAM problem first. A 70B model at 4-bit quantization needs ~40 GB of VRAM. A 7B model needs ~6 GB. Everything else — CPU, RAM, storage — is secondary to whether your GPU can fit the model.

Build tiers

Entry — $1,500: Single RX 9070 XT (16 GB)

Runs 7B–13B models well. 70B via CPU offloading: possible but slow (8 tok/s).

Mid — $2,500: Single RTX 5080 (16 GB)

Faster throughput on 7B–13B. 70B with offload: 18 tok/s. Best price-to-performance for most users.

High — $4,500: Dual RTX 5080 (32 GB combined via NVLink + CPU offload)

Full 70B at Q4 in VRAM: 38 tok/s. Approaches the Strix Halo mini-PC for a fraction of the portability.

Throughput comparison

Llama 3 70B Q4 — tokens/sec by rig tier

Entry: RX 9070 XT (offload)

8.2 tok/s

Mid: RTX 5080 (offload)

18.4 tok/s

High: 2× RTX 5080

38.6 tok/s

Strix Halo (96GB UMem)

13.2 tok/s

Software stack

Ollama is the simplest starting point. llama.cpp with GGUF models gives more control. vLLM (Linux only) for production-grade throughput with an OpenAI-compatible API.