Benchmarking Qwen 3.5 on Apple Silicon

March 31, 2026

I got a Mac mini M4 and wanted to know exactly how fast it can run local LLMs — not "fast enough" fast, but numbers-on-the-table fast. I benchmarked four Qwen 3.5 model sizes across three runtimes and three quantization levels. Here's what I found.

Setup

Machine

Mac mini M4

Memory

16 GB unified

Models

Qwen 3.5

Sizes

0.8B · 2B · 4B · 9B

Three runtimes:

llama.cpp — C++ inference engine, compiled with Metal GPU support (-DGGML_METAL=ON), benchmarked via llama-bench
mlx_lm — Apple's own MLX framework, models from mlx-community on HuggingFace
Ollama — the "just works" LLM runner, which uses MLX internally on Apple Silicon

Two metrics:

PP (prompt processing) — tokens/sec ingesting the prompt. Matters for RAG, document analysis, long-context tasks.
TG (text generation) — tokens/sec generating output. What you feel in a chat interface.

Results

Model	Quant	Runtime	PP (t/s)	TG (t/s)
0.8B	Q4_K_M	llama.cpp	2085	103
0.8B	Q4_K_M	mlx_lm	553	192
0.8B	Q4_K_M	Ollama	611	51
0.8B	Q8_0	llama.cpp	2174	81
0.8B	Q8_0	Ollama	870	51
0.8B	IQ2_XXS	llama.cpp	2166	115
2B	Q4_K_M	llama.cpp	978	59
2B	Q4_K_M	Ollama	258	31
4B	Q4_K_M	llama.cpp	385	28
4B	Q4_K_M	mlx_lm	112	39
4B	Q4_K_M	Ollama	133	19
9B	Q4_K_M	llama.cpp	214	16
9B	Q4_K_M	mlx_lm	22	22
9B	Q4_K_M	Ollama	114	14

Bold = winner for that metric at that model size. All llama.cpp runs use -ngl 99 (full GPU offload via Metal).

What the numbers say

llama.cpp dominates prompt processing. At 0.8B it hits 2085 t/s PP — nearly 4× Ollama and 3.8× mlx_lm. The gap is most extreme at 9B: 214 vs 22 t/s. If you're building RAG pipelines or processing long documents, there's no contest.

mlx_lm wins generation for small models. The 0.8B model generates at 192 t/s on mlx_lm vs 103 t/s on llama.cpp — nearly 2× faster for the actual output phase. The lead narrows at 4B (39 vs 28 t/s) and disappears at 9B where memory bandwidth is the bottleneck for everyone.

Ollama is consistently the slowest. It uses MLX internally but the abstraction overhead is real — 2–4× slower than bare mlx_lm on PP, ~40% slower on TG. Worth it for the convenience; not worth it if throughput matters.

One surprise: the 0.8B IQ2_XXS (2-bit importance-aware quantization, 323 MB on disk) generates at 115 t/s on llama.cpp — faster than Q8_0 at 81 t/s. Fewer bits means less memory bandwidth pressure. For tasks where you just need fast completions from a tiny model, this is genuinely useful.

The 9B mlx_lm result is also interesting: PP and TG are both ~22 t/s. That means the model is fully memory-bandwidth-limited — it doesn't matter whether it's processing the prompt or generating; it's just reading weights as fast as unified memory allows.

How to run this yourself

llama.cpp — build with Metal, then use llama-bench:

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 10
./build/bin/llama-bench -m model.gguf -p 512 -n 200 -r 3 -o json -ngl 99

mlx_lm — install and run directly:

pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-4B-MLX-4bit \
  --prompt "..." --max-tokens 200
# Prompt: 83 tokens, 111.976 tokens-per-sec
# Generation: 200 tokens, 39.191 tokens-per-sec

Ollama — pull and time via API:

ollama pull qwen3.5:4b
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5:4b",
  "prompt": "...",
  "stream": false,
  "options": {"num_predict": 200}
}'
# Response includes prompt_eval_duration and eval_duration in nanoseconds

Context Length: How Speed Degrades at 8k, 16k, and 32k

The benchmarks above used a 512-token prompt. Real workloads are longer. I ran a second benchmark at 8k, 16k, and 32k context to see how each runtime handles KV cache pressure.

Prompt processing (PP) — t/s across context lengths:

Model	Runtime	8k	16k	32k	Drop 8→32k
0.8B	llama.cpp	1808	1514	1117	−38%
0.8B	mlx_lm	2300	2107	1814	−21%
0.8B	Ollama	1156	1052	883	−24%
4B	llama.cpp	339	305	257	−24%
4B	mlx_lm	388	368	340	−12%
4B	Ollama	276	254	226	−18%
9B	llama.cpp	193	—	—	OOM
9B	mlx_lm	—	—	185	partial
9B	Ollama	175	165	150	−14%

Text generation (TG) — t/s across context lengths:

Model	Runtime	8k	16k	32k
0.8B	llama.cpp	104.7	104.9	104.8
0.8B	mlx_lm	175.0	154.8	124.4
0.8B	Ollama	48.8	46.4	41.9
4B	llama.cpp	28.0	27.9	28.0
4B	mlx_lm	37.3	34.5	30.2
4B	Ollama	18.5	17.8	16.3
9B	llama.cpp	17.4	—	—
9B	mlx_lm	—	—	17.8
9B	Ollama	13.3	12.8	12.0

mlx_lm reverses the PP result at scale. At 512 tokens, llama.cpp dominated PP (2085 vs 553 t/s for 0.8B). At 8k+ tokens, mlx_lm pulls ahead — 2300 vs 1808 t/s — and degrades far more gracefully (−21% vs −38% from 8k to 32k). Apple's unified memory architecture appears to handle KV cache pressure better at longer contexts.

TG barely moves with context length. llama.cpp 0.8B holds 104.7 → 104.9 → 104.8 t/s across 8k/16k/32k. Generation is purely memory-bandwidth-bound per token — the KV cache size doesn't change the per-token cost much at these sizes.

9B hits memory limits at long context on 16 GB. llama.cpp 9B failed at 16k and 32k (the 5.5 GB model + 32k KV cache exceeds available unified memory). mlx_lm 9B was only stable at 32k. Ollama managed all three, likely due to more aggressive memory management. For 9B at long context, you need more than 16 GB.

Which should you use?

llama.cpp if you're processing short-to-medium prompts or running batch jobs at scale. It wins PP at 512 tokens by a wide margin, and its TG is competitive. At long context (8k+), its PP advantage shrinks and mlx_lm pulls ahead.

mlx_lm if you're doing interactive chat or processing long documents (8k+ tokens). It wins TG for small models and — surprisingly — beats llama.cpp on PP at realistic context lengths. The 0.8B at 175 t/s TG is faster than you can read.

Ollama if you want something running in five minutes with no build step. The speed tax is real but the ergonomics are unmatched.

My current daily driver: Qwen 3.5 4B Q4_K_M on llama.cpp. 28 t/s generation feels instant, 2.6 GB leaves plenty of RAM headroom, and the quality is surprisingly good for its size.