Skip to main content
Methodology v2.5

How We
Test & Publish

RunAIatHome is written by the RunAIatHome Team. Every hardware recommendation is backed by measured VRAM headroom, reproducible inference benchmarks, and compatibility data we validate against published model cards.

Editorial process

  1. 1. Collect benchmark and compatibility data from maintained datasets and fresh validations.
  2. 2. Compare real fit thresholds by VRAM, quantization, and offloading behavior.
  3. 3. Write recommendations that privilege the cheapest hardware that still fits the workload.
  4. 4. Add disclosure where monetization exists and keep explanatory context around CTAs.
  5. 5. Rebuild the site and verify static output before shipping.

GPU benchmark methodology

Our GPU scoring compares raw compute and memory bandwidth against the VRAM a workload actually needs. We combine vendor specs with reproducible inference runs rather than synthetic 3D benchmarks.

  • Compute baseline. FP16 / BF16 TFLOPS from vendor datasheets, cross-checked against published MLPerf Inference results for the same silicon.
  • Memory bandwidth. Effective GB/s, because token-generation latency on LLMs is bandwidth-bound, not compute-bound, for most single-user workloads.
  • Driver baseline. NVIDIA 550+ with CUDA 12.x on Linux; AMD ROCm 6.x on supported RDNA3 cards. We note when a GPU needs a newer stack.
  • Thermals & power. Sustained clocks after a 10-minute inference load, not peak boost. Blower cards and 2-slot designs lose up to 8% sustained.

LLM inference tests

Inference numbers are produced with open-source runtimes on a fixed prompt set. We report tokens-per-second at a stable context window so results stay comparable across GPUs.

  • Runtimes. llama.cpp (GGUF), Ollama, and vLLM where the GPU has enough VRAM. The runtime is always noted next to any published number.
  • Model set. Llama 3.1 8B, Mistral 7B, Gemma 2 9B, Qwen 2.5 14B, and Llama 3.3 70B at Q4_K_M. Larger models run with partial CPU offload on consumer cards.
  • Context & batch. 4,096-token context, batch size 1, greedy decoding, 512 generated tokens. Reported figure is the median of three runs.
  • Warm vs. cold. We discard the first run (model load into VRAM) and report warm throughput, which is what users actually experience.

VRAM measurement

VRAM is the single biggest constraint for local AI, so we measure it rather than estimate. Every compatibility verdict starts from a real footprint on a real GPU.

  • Footprint formula. params × bytes-per-weight + KV-cache + activation overhead. KV-cache scales with context length; activation overhead is typically 1-2 GB on consumer GPUs.
  • Quantization table. FP16 = 2 bytes, Q8 = 1 byte, Q5_K_M ≈ 0.7, Q4_K_M ≈ 0.55, Q3_K_S ≈ 0.4 bytes per parameter. We publish the lowest quantization that preserves benchmark accuracy within 2%.
  • Measurement tool. nvidia-smi / rocm-smi sampled every second during a loaded inference pass. We take peak memory, not average.
  • Headroom rule. A GPU is marked compatible only with at least 10% free VRAM at peak, so users can raise context length without immediate OOM.

Data sources & verification

  • Model cards and official quantization releases on Hugging Face.
  • Vendor datasheets (NVIDIA, AMD, Intel) for TFLOPS, memory size, and bandwidth.
  • MLPerf Inference v4.x results for cross-silicon sanity checks.
  • Community benchmarks from r/LocalLLaMA, cross-verified on our own hardware where possible.
  • Reader corrections — we fix and date the update whenever numbers land off.

Updates & corrections

Pages are re-reviewed whenever a driver release, runtime version, or new quantization meaningfully changes compatibility. Updates are stamped with a Last tested date on hardware reviews and Last reviewed on editorial guides. Spotted an error? Email [email protected] and we will publish a correction.