← jitesh.dev

Benchmarking Qwen 3.5 on Apple Silicon

March 31, 2026

I got a Mac mini M4 and wanted to know exactly how fast it can run local LLMs — not "fast enough" fast, but numbers-on-the-table fast. I benchmarked four Qwen 3.5 model sizes across three runtimes and three quantization levels. Here's what I found.

Setup

Machine
Mac mini M4
Memory
16 GB unified
Models
Qwen 3.5
Sizes
0.8B · 2B · 4B · 9B

Three runtimes:

Two metrics:

Results

Model Quant Runtime PP (t/s) TG (t/s)
0.8BQ4_K_Mllama.cpp 2085103
0.8BQ4_K_Mmlx_lm 553192
0.8BQ4_K_MOllama 61151
0.8BQ8_0llama.cpp 217481
0.8BQ8_0Ollama 87051
0.8BIQ2_XXSllama.cpp 2166115
2BQ4_K_Mllama.cpp 97859
2BQ4_K_MOllama 25831
4BQ4_K_Mllama.cpp 38528
4BQ4_K_Mmlx_lm 11239
4BQ4_K_MOllama 13319
9BQ4_K_Mllama.cpp 21416
9BQ4_K_Mmlx_lm 2222
9BQ4_K_MOllama 11414

Bold = winner for that metric at that model size. All llama.cpp runs use -ngl 99 (full GPU offload via Metal).

What the numbers say

llama.cpp dominates prompt processing. At 0.8B it hits 2085 t/s PP — nearly 4× Ollama and 3.8× mlx_lm. The gap is most extreme at 9B: 214 vs 22 t/s. If you're building RAG pipelines or processing long documents, there's no contest.
mlx_lm wins generation for small models. The 0.8B model generates at 192 t/s on mlx_lm vs 103 t/s on llama.cpp — nearly 2× faster for the actual output phase. The lead narrows at 4B (39 vs 28 t/s) and disappears at 9B where memory bandwidth is the bottleneck for everyone.
Ollama is consistently the slowest. It uses MLX internally but the abstraction overhead is real — 2–4× slower than bare mlx_lm on PP, ~40% slower on TG. Worth it for the convenience; not worth it if throughput matters.

One surprise: the 0.8B IQ2_XXS (2-bit importance-aware quantization, 323 MB on disk) generates at 115 t/s on llama.cpp — faster than Q8_0 at 81 t/s. Fewer bits means less memory bandwidth pressure. For tasks where you just need fast completions from a tiny model, this is genuinely useful.

The 9B mlx_lm result is also interesting: PP and TG are both ~22 t/s. That means the model is fully memory-bandwidth-limited — it doesn't matter whether it's processing the prompt or generating; it's just reading weights as fast as unified memory allows.

How to run this yourself

llama.cpp — build with Metal, then use llama-bench:

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 10
./build/bin/llama-bench -m model.gguf -p 512 -n 200 -r 3 -o json -ngl 99

mlx_lm — install and run directly:

pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-4B-MLX-4bit \
  --prompt "..." --max-tokens 200
# Prompt: 83 tokens, 111.976 tokens-per-sec
# Generation: 200 tokens, 39.191 tokens-per-sec

Ollama — pull and time via API:

ollama pull qwen3.5:4b
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5:4b",
  "prompt": "...",
  "stream": false,
  "options": {"num_predict": 200}
}'
# Response includes prompt_eval_duration and eval_duration in nanoseconds

Context Length: How Speed Degrades at 8k, 16k, and 32k

The benchmarks above used a 512-token prompt. Real workloads are longer. I ran a second benchmark at 8k, 16k, and 32k context to see how each runtime handles KV cache pressure.

Prompt processing (PP) — t/s across context lengths:

Model Runtime 8k 16k 32k Drop 8→32k
0.8Bllama.cpp 180815141117−38%
0.8Bmlx_lm 230021071814−21%
0.8BOllama 11561052883−24%
4Bllama.cpp 339305257−24%
4Bmlx_lm 388368340−12%
4BOllama 276254226−18%
9Bllama.cpp 193OOM
9Bmlx_lm 185partial
9BOllama 175165150−14%

Text generation (TG) — t/s across context lengths:

Model Runtime 8k 16k 32k
0.8Bllama.cpp 104.7104.9104.8
0.8Bmlx_lm 175.0154.8124.4
0.8BOllama 48.846.441.9
4Bllama.cpp 28.027.928.0
4Bmlx_lm 37.334.530.2
4BOllama 18.517.816.3
9Bllama.cpp 17.4
9Bmlx_lm 17.8
9BOllama 13.312.812.0
mlx_lm reverses the PP result at scale. At 512 tokens, llama.cpp dominated PP (2085 vs 553 t/s for 0.8B). At 8k+ tokens, mlx_lm pulls ahead — 2300 vs 1808 t/s — and degrades far more gracefully (−21% vs −38% from 8k to 32k). Apple's unified memory architecture appears to handle KV cache pressure better at longer contexts.
TG barely moves with context length. llama.cpp 0.8B holds 104.7 → 104.9 → 104.8 t/s across 8k/16k/32k. Generation is purely memory-bandwidth-bound per token — the KV cache size doesn't change the per-token cost much at these sizes.
9B hits memory limits at long context on 16 GB. llama.cpp 9B failed at 16k and 32k (the 5.5 GB model + 32k KV cache exceeds available unified memory). mlx_lm 9B was only stable at 32k. Ollama managed all three, likely due to more aggressive memory management. For 9B at long context, you need more than 16 GB.

Which should you use?

llama.cpp if you're processing short-to-medium prompts or running batch jobs at scale. It wins PP at 512 tokens by a wide margin, and its TG is competitive. At long context (8k+), its PP advantage shrinks and mlx_lm pulls ahead.

mlx_lm if you're doing interactive chat or processing long documents (8k+ tokens). It wins TG for small models and — surprisingly — beats llama.cpp on PP at realistic context lengths. The 0.8B at 175 t/s TG is faster than you can read.

Ollama if you want something running in five minutes with no build step. The speed tax is real but the ergonomics are unmatched.

My current daily driver: Qwen 3.5 4B Q4_K_M on llama.cpp. 28 t/s generation feels instant, 2.6 GB leaves plenty of RAM headroom, and the quality is surprisingly good for its size.