March 31, 2026
I got a Mac mini M4 and wanted to know exactly how fast it can run local LLMs — not "fast enough" fast, but numbers-on-the-table fast. I benchmarked four Qwen 3.5 model sizes across three runtimes and three quantization levels. Here's what I found.
Three runtimes:
-DGGML_METAL=ON), benchmarked via llama-benchmlx-community on HuggingFaceTwo metrics:
| Model | Quant | Runtime | PP (t/s) | TG (t/s) |
|---|---|---|---|---|
| 0.8B | Q4_K_M | llama.cpp | 2085 | 103 |
| 0.8B | Q4_K_M | mlx_lm | 553 | 192 |
| 0.8B | Q4_K_M | Ollama | 611 | 51 |
| 0.8B | Q8_0 | llama.cpp | 2174 | 81 |
| 0.8B | Q8_0 | Ollama | 870 | 51 |
| 0.8B | IQ2_XXS | llama.cpp | 2166 | 115 |
| 2B | Q4_K_M | llama.cpp | 978 | 59 |
| 2B | Q4_K_M | Ollama | 258 | 31 |
| 4B | Q4_K_M | llama.cpp | 385 | 28 |
| 4B | Q4_K_M | mlx_lm | 112 | 39 |
| 4B | Q4_K_M | Ollama | 133 | 19 |
| 9B | Q4_K_M | llama.cpp | 214 | 16 |
| 9B | Q4_K_M | mlx_lm | 22 | 22 |
| 9B | Q4_K_M | Ollama | 114 | 14 |
Bold = winner for that metric at that model size. All llama.cpp runs use -ngl 99 (full GPU offload via Metal).
One surprise: the 0.8B IQ2_XXS (2-bit importance-aware quantization, 323 MB on disk) generates at 115 t/s on llama.cpp — faster than Q8_0 at 81 t/s. Fewer bits means less memory bandwidth pressure. For tasks where you just need fast completions from a tiny model, this is genuinely useful.
The 9B mlx_lm result is also interesting: PP and TG are both ~22 t/s. That means the model is fully memory-bandwidth-limited — it doesn't matter whether it's processing the prompt or generating; it's just reading weights as fast as unified memory allows.
llama.cpp — build with Metal, then use llama-bench:
cmake -B build -DGGML_METAL=ON cmake --build build --config Release -j 10 ./build/bin/llama-bench -m model.gguf -p 512 -n 200 -r 3 -o json -ngl 99
mlx_lm — install and run directly:
pip install mlx-lm mlx_lm.generate --model mlx-community/Qwen3.5-4B-MLX-4bit \ --prompt "..." --max-tokens 200 # Prompt: 83 tokens, 111.976 tokens-per-sec # Generation: 200 tokens, 39.191 tokens-per-sec
Ollama — pull and time via API:
ollama pull qwen3.5:4b
curl http://localhost:11434/api/generate -d '{
"model": "qwen3.5:4b",
"prompt": "...",
"stream": false,
"options": {"num_predict": 200}
}'
# Response includes prompt_eval_duration and eval_duration in nanoseconds
The benchmarks above used a 512-token prompt. Real workloads are longer. I ran a second benchmark at 8k, 16k, and 32k context to see how each runtime handles KV cache pressure.
Prompt processing (PP) — t/s across context lengths:
| Model | Runtime | 8k | 16k | 32k | Drop 8→32k |
|---|---|---|---|---|---|
| 0.8B | llama.cpp | 1808 | 1514 | 1117 | −38% |
| 0.8B | mlx_lm | 2300 | 2107 | 1814 | −21% |
| 0.8B | Ollama | 1156 | 1052 | 883 | −24% |
| 4B | llama.cpp | 339 | 305 | 257 | −24% |
| 4B | mlx_lm | 388 | 368 | 340 | −12% |
| 4B | Ollama | 276 | 254 | 226 | −18% |
| 9B | llama.cpp | 193 | — | — | OOM |
| 9B | mlx_lm | — | — | 185 | partial |
| 9B | Ollama | 175 | 165 | 150 | −14% |
Text generation (TG) — t/s across context lengths:
| Model | Runtime | 8k | 16k | 32k |
|---|---|---|---|---|
| 0.8B | llama.cpp | 104.7 | 104.9 | 104.8 |
| 0.8B | mlx_lm | 175.0 | 154.8 | 124.4 |
| 0.8B | Ollama | 48.8 | 46.4 | 41.9 |
| 4B | llama.cpp | 28.0 | 27.9 | 28.0 |
| 4B | mlx_lm | 37.3 | 34.5 | 30.2 |
| 4B | Ollama | 18.5 | 17.8 | 16.3 |
| 9B | llama.cpp | 17.4 | — | — |
| 9B | mlx_lm | — | — | 17.8 |
| 9B | Ollama | 13.3 | 12.8 | 12.0 |
llama.cpp if you're processing short-to-medium prompts or running batch jobs at scale. It wins PP at 512 tokens by a wide margin, and its TG is competitive. At long context (8k+), its PP advantage shrinks and mlx_lm pulls ahead.
mlx_lm if you're doing interactive chat or processing long documents (8k+ tokens). It wins TG for small models and — surprisingly — beats llama.cpp on PP at realistic context lengths. The 0.8B at 175 t/s TG is faster than you can read.
Ollama if you want something running in five minutes with no build step. The speed tax is real but the ergonomics are unmatched.
My current daily driver: Qwen 3.5 4B Q4_K_M on llama.cpp. 28 t/s generation feels instant, 2.6 GB leaves plenty of RAM headroom, and the quality is surprisingly good for its size.