Choosing a Local LLM for CPU-Only Inference
This is a follow-on from the Hermes setup post. The gateway is running, Ollama is installed — the question is which model to actually use. The default llama3.2:3b doesn't reliably use tools. Here's how to think through the alternatives.
The Constraints
Before comparing models it helps to be precise about what we need:
- No GPU. The OptiPlex 7070 has no discrete graphics card, so everything runs on the CPU via llama.cpp (which Ollama uses under the hood). Speed is measured in single-digit tokens per second, not tokens per millisecond.
- 16 GB RAM total. This has to cover the OS, Ollama, Hermes, and the model itself. In practice the OS takes 1.5–2 GB at idle, leaving around 13–14 GB for the model stack.
- 64k context window. Hermes requires this to be configured explicitly. A model must support at least 64k tokens.
- Reliable tool use. The model needs to call tools rather than describe what tools to run. This is a hard requirement — a model that narrates instead of acts is useless for an agent.
The last two constraints interact in a way that's easy to underestimate.
Why 64k Context Costs More Than Just Context
When a model processes a long conversation, it stores intermediate attention data for every token it has seen. This is called the KV cache (key-value cache). It grows with context length and lives entirely in RAM alongside the model weights.
The formula for KV cache size at a given context length:
KV cache = 2 × layers × kv_heads × head_dim × context_tokens × 2 bytes (fp16)The critical variable here is kv_heads. Models with grouped query attention (GQA) use far fewer KV heads than attention heads. A 7B model might have 32 attention heads but only 4 or 8 KV heads — which reduces the KV cache by 4× to 8× compared to a model with no GQA.
For a 64k token context this matters enormously:
| Model | Weights | KV heads | KV cache @ 64k | Total in RAM | Est. tok/s | Tool use |
|---|---|---|---|---|---|---|
llama3.2:3b |
~2.0 GB | 8 | ~3.7 GB | ~5.7 GB | 12–18 | ✗ unreliable |
qwen2.5:7b |
~4.7 GB | 4 | ~1.9 GB | ~6.6 GB | 5–8 | ✓ strong |
llama3.1:8b |
~4.9 GB | 8 | ~4.3 GB | ~9.2 GB | 4–7 | ✓ strong |
mistral-nemo:12b |
~7.1 GB | 8 | ~5.4 GB | ~12.5 GB | 2–4 | ✓ good |
qwen2.5:14b |
~8.9 GB | 8 | ~7.5 GB | ~16.4 GB | 1–3 | ✓ strong |
Weights are Q4_K_M quantised (typical Ollama default). KV cache figures assume fp16 precision. Speed estimates are for the OptiPlex 7070's DDR4-2666 dual-channel memory (~40 GB/s bandwidth) — see the speed section below.
qwen2.5:14b lands at 16.4 GB before OS overhead — it doesn't fit. mistral-nemo:12b at 12.5 GB plus OS gives 14+ GB, which is on the edge and leaves no headroom. Both are effectively eliminated.
qwen2.5:7b is the standout: Qwen2.5's GQA design drops the KV cache to 1.9 GB, making it comfortably the most RAM-efficient model at long contexts.
Speed on CPU-Only Hardware
CPU inference speed is almost entirely limited by memory bandwidth, not compute. Each generated token requires reading the full model weights from RAM into the CPU cache. The OptiPlex 7070 with DDR4-2666 in dual-channel configuration has roughly 40 GB/s of bandwidth — that's the ceiling everything runs against.
A rough estimate of generation speed:
tok/s ≈ memory_bandwidth / model_size_on_disk
≈ 40 GB/s / 4.7 GB ≈ 8 tok/s (for qwen2.5:7b, theoretical)In practice, attention computation and other overhead bring this down by 20–40%, giving the ranges in the table above.
What these speeds feel like:
| Speed | What it means |
|---|---|
| 1–3 tok/s | A 100-word response takes over a minute. Too slow for interactive use. |
| 4–8 tok/s | A 100-word response in 15–30 seconds. Acceptable for background tasks. |
| 10+ tok/s | A 100-word response in under 10 seconds. Feels reasonably responsive. |
For a personal assistant running tool calls, response quality matters more than raw speed — a correct answer in 20 seconds is better than a wrong one in 5. But if Hermes is doing interactive back-and-forth, 4–8 tok/s starts to feel slow for longer responses.
Two types of latency to be aware of:
Generation speed (the numbers above) is how fast tokens appear once the model starts responding. Prefill speed is how long the model takes to process the input before generating anything — this matters when the conversation history is long. At short context (a few hundred tokens), prefill is near-instant. At 32k+ tokens of history it can add several seconds of waiting before the first word appears. On CPU this is noticeably slower than on GPU.
Thermal throttling is a real concern on the OptiPlex 7070 Micro. The small chassis has limited airflow, and under sustained inference load the CPU will eventually throttle. In practice this means the first response is near peak speed and subsequent responses in a long session may be 20–30% slower. The T-series CPUs (lower TDP) throttle earlier; the full-power i5-9500 or i7-9700 sustain better under load but run hotter.
The 3B model's speed advantage (12–18 tok/s) sounds appealing but is irrelevant if it won't reliably use tools — you get fast wrong answers instead of slower correct ones.
Candidate Models
llama3.2:3b — skip it
This was the obvious starting point (smallest, fastest) but it fails on tool use. The 3B model recognises what tool to use but doesn't reliably follow the function-calling schema. An agent that narrates instead of acts isn't useful.
It does run at 12–18 tok/s — noticeably snappier than anything larger — but speed is irrelevant when the output is wrong. Small models trained on function calling have improved significantly at the 7–8B scale. Below that the behaviour is too inconsistent for a personal assistant that's meant to run unattended.
qwen2.5:7b — the recommendation
Qwen2.5 was trained with heavy emphasis on instruction following and function calling. The 7B variant:
- Fits comfortably in 16 GB even at full 64k context (~6.6 GB, leaving 7+ GB for OS and Hermes)
- Tool use is reliable — it follows function call schemas rather than describing them
- Supports 128k context natively (configured to 64k for Hermes)
- Runs at roughly 5–8 tok/s on the full-power OptiPlex i5-9500/i7-9700, or 3–5 tok/s on the T-series variants — a 150-word response takes 20–30 seconds or 30–50 seconds respectively
This is the model to start with.
ollama pull qwen2.5:7bllama3.1:8b — solid alternative
Meta specifically trained Llama 3.1 for agentic and function-calling use cases. It performs well on tool use and the 128k context support is real.
The downside is RAM: 9.2 GB for model + KV cache at 64k, plus the OS, gets close to 11–12 GB. It fits in 16 GB but leaves less headroom than Qwen2.5:7b. Speed is similar to Qwen2.5:7b at 4–7 tok/s — the slightly larger model size costs a little, though the difference in practice is hard to notice.
Worth trying if Qwen2.5 has any compatibility issues with Hermes, or if you want a second opinion on a query.
ollama pull llama3.1:8bqwen2.5:14b and mistral-nemo:12b — over budget at 64k
Both exceed the available RAM once you account for OS overhead and a 64k KV cache. They're viable on machines with 32 GB of RAM, but not here. Even if they fit, mistral-nemo:12b at 2–4 tok/s and qwen2.5:14b at 1–3 tok/s make for a noticeably sluggish assistant — every tool call and follow-up adds up.
Configuring the Context Window in Ollama
Ollama defaults to a short context (typically 2k–4k tokens depending on the model) to save memory. Hermes needs 64k configured explicitly.
The cleanest approach is a Modelfile:
cat << 'EOF' > Modelfile
FROM qwen2.5:7b
PARAMETER num_ctx 65536
EOF
ollama create qwen2.5:7b-64k -f ModelfileThen point Hermes at qwen2.5:7b-64k instead of the base model name. The -64k suffix makes it obvious what's been configured.
You can verify the context is set:
ollama show qwen2.5:7b-64k --modelinfo | grep contextFor llama3.1:8b:
cat << 'EOF' > Modelfile
FROM llama3.1:8b
PARAMETER num_ctx 65536
EOF
ollama create llama3.1:8b-64k -f ModelfileQuantising the KV Cache
The KV cache figures in the table above assume fp16 precision — the Ollama default. Ollama also supports storing the cache in int8 (q8_0), which halves its size with negligible quality impact.
| Model | KV cache (fp16) | KV cache (q8_0) | Saving |
|---|---|---|---|
qwen2.5:7b-64k |
~1.9 GB | ~950 MB | ~950 MB |
llama3.1:8b-64k |
~4.3 GB | ~2.2 GB | ~2.1 GB |
For qwen2.5:7b the saving is modest — GQA already keeps its cache small. For llama3.1:8b it's substantial, dropping total RAM from ~9.2 GB to ~7.1 GB and meaningfully reducing the prefill pause before each response.
Enable it by setting an environment variable before starting Ollama:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serveTo make it permanent, add it to the Ollama systemd service override:
sudo systemctl edit ollamaIn the editor that opens, add:
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"Save, then restart Ollama:
sudo systemctl restart ollamaVerify it's active:
sudo systemctl show ollama --property=EnvironmentWhy q8_0 is safe to enable: KV cache quantisation is not the same as weight quantisation. The model weights (Q4_K_M) encode the model's knowledge and reasoning — quantising them aggressively causes visible quality degradation. The KV cache stores intermediate attention state for the current conversation only. Rounding that to int8 introduces tiny errors in attention patterns, but the effect on output quality is below the noise floor for conversational use. It's safe to enable by default.
CPU Performance Tuning
KV cache quantisation is one lever. There are several others worth knowing about, each targeting a different bottleneck.
Flash attention
OLLAMA_FLASH_ATTN=1 restructures how the model computes attention to avoid materialising the full attention matrix in memory. On CPU the main benefit is prefill speed — the silent pause before the first token while the model processes the conversation history. At 32k+ tokens of context this pause is measurable in seconds; FlashAttention typically reduces it by 30–50%.
OLLAMA_FLASH_ATTN=1 ollama serveRequires Ollama 0.2+. Check your version: ollama --version.
Thread count
By default Ollama may use all logical CPUs, including hyperthreaded cores. For inference, using more threads than physical cores often hurts — hyperthreaded "cores" share execution units and compete rather than help.
Find your physical core count:
lscpu | grep "Core(s) per socket"Set Ollama to match:
OLLAMA_NUM_THREADS=6 ollama serve # adjust to your core countThe i5-9500 has 6 physical cores; the i7-9700 has 8. Neither has hyperthreading, so the default is probably already correct for those CPUs — but it's worth checking on other hardware.
Memory locking
By default the OS can page model weights to swap if RAM gets tight. Even a brief swap access stalls inference. Lock the weights in RAM:
OLLAMA_MLOCK=1 ollama serveThis requires that the model actually fits — if it doesn't, Ollama will refuse to load rather than silently falling back to swap. With qwen2.5:7b-64k using ~6.6 GB on a 16 GB machine there's plenty of headroom.
CPU frequency governor
Linux power management defaults to a conservative or schedutil governor that scales clock speed with load. There's a lag — the CPU ramps up over a few hundred milliseconds, meaning the first tokens of each response may run at a lower frequency.
Set the governor to performance to stay at full speed:
# Check what's currently active
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set to performance (resets on reboot)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorTo make it permanent:
sudo apt install cpufrequtils
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl restart cpufrequtilsCaveat for the OptiPlex 7070 Micro: the performance governor keeps the CPU at full clock speed, which means more heat in an already tight chassis. If the system is already throttling under load, the governor gain may be offset by earlier thermal throttling — watch temperatures with sensors or watch -n1 "cat /sys/class/thermal/thermal_zone*/temp".
Model keep-alive
Ollama unloads the model from RAM after 5 minutes of inactivity by default. The next request then pays a cold-start penalty of several seconds while the model is reloaded from disk.
For a personal assistant that gets used in bursts throughout the day, keep the model loaded indefinitely:
OLLAMA_KEEP_ALIVE=-1 ollama serveOr set a longer timeout if you'd rather reclaim RAM overnight:
OLLAMA_KEEP_ALIVE=4h ollama servePutting it all together
Add all of the above to the Ollama systemd override so they apply on every start:
sudo systemctl edit ollama[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTN=1"
Environment="OLLAMA_NUM_THREADS=6"
Environment="OLLAMA_MLOCK=1"
Environment="OLLAMA_KEEP_ALIVE=-1"Adjust OLLAMA_NUM_THREADS to your physical core count. Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollamaVerify the settings are active:
sudo systemctl show ollama --property=EnvironmentTesting Tool Use
A quick smoke test after switching models — ask something that requires a tool call rather than a knowledge answer:
How much disk space do I have left?
What's the current CPU temperature?
List the files in my home directory.The correct behaviour is Hermes invoking a tool and returning the actual result from the system. If the model responds with instructions on how to check those things yourself, it's not using tools — try the other model or check the Hermes gateway logs for errors.
Recommended Setup
# Pull both candidates
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
# Create 64k-context variants
cat << 'EOF' > Modelfile.qwen
FROM qwen2.5:7b
PARAMETER num_ctx 65536
EOF
ollama create qwen2.5:7b-64k -f Modelfile.qwen
cat << 'EOF' > Modelfile.llama
FROM llama3.1:8b
PARAMETER num_ctx 65536
EOF
ollama create llama3.1:8b-64k -f Modelfile.llama
# Point Hermes at the primary model
# In Hermes config: model = qwen2.5:7b-64kStart with qwen2.5:7b-64k. It has the best RAM efficiency at 64k context, proven tool use, and runs at 5–8 tok/s — the fastest you'll get while still having reliable tool use on this hardware. Switch to llama3.1:8b-64k if you hit any issues — the two models complement each other well and it's useful to have both available.
The 14B and larger models are for machines with more RAM or a GPU. On 16 GB CPU-only hardware, the 7–8B range is the practical sweet spot: small enough to fit comfortably, fast enough for assistant use, and large enough for reliable tool calling.
Expected Response Times
Each Hermes request involves at least two LLM calls — one to decide which tool to use, one to synthesise the result — so even the simplest query has a meaningful floor. On top of that, Hermes injects a system prompt containing all registered tool definitions before every request, which adds prefill overhead even in a fresh session.
The OptiPlex 7070 ships with either a full-power CPU (i5-9500, i7-9700) or a T-series low-TDP variant (i5-9500T, i7-9700T). Both use the same DDR4-2666 dual-channel memory (~40 GB/s bandwidth), which is the primary bottleneck for inference. The difference shows up under sustained load: the T-series has a 35W TDP vs 65W, a lower base clock (2.2 GHz vs 3.0 GHz), and throttles earlier under heat. Generation speed is roughly 3–5 tok/s on the T-series vs 5–8 tok/s on the full-power variants, with the gap widening as the session runs longer.
Session length matters. Configuring num_ctx 65536 pre-allocates the KV cache for up to 64k tokens, but each request only pays prefill cost for the tokens actually in the conversation at that moment. A short session prefills quickly; a session with tens of thousands of tokens of history adds significant silent delay before each response — before a single output token is generated.
Rule of thumb: if responses are getting progressively slower across a session, prefill growth is the cause. Start a fresh session to reset it.
Diagnosing Slow Responses
Work through these checks in order.
1. Check if the model is loaded
ollama psIf qwen2.5:7b-64k is not listed, it was unloaded and is being reloaded from disk on every request. Fix: set OLLAMA_KEEP_ALIVE=-1 in the systemd override.
2. Check for swap usage
free -hLook at the Swap used column. Any non-zero swap while Ollama is running means model weights are being paged to disk — this alone can turn a 20-second response into 5+ minutes. Fix: set OLLAMA_MLOCK=1 and identify what else is consuming RAM with htop.
3. Check CPU frequency
watch -n1 "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq"A healthy i5-9500 or i7-9700 should hold 3,000,000–4,000,000 (3–4 GHz) under load. If you're seeing 800,000–1,200,000 (800 MHz–1.2 GHz), the CPU is severely throttling. Let it cool down before testing again.
4. Check CPU temperature
sensorsInstall if needed: sudo apt install lm-sensors && sudo sensors-detect. On the OptiPlex 7070 Micro, sustained temperatures above 90°C will cause throttling. The small chassis has limited airflow — ensure the vents are clear and consider a small external fan if this is a recurring problem.
5. Count the tool calls
Check the Hermes logs to see how many tool calls a given request is triggering. A simple question that generates 5+ tool calls indicates the model is over-planning — try rephrasing as a more specific, single-intent request.
journalctl -u hermes --since "5 minutes ago"6. Reset the session
If responses have been getting progressively slower across a long session, the prefill overhead from accumulated conversation history is the cause. Start a fresh Hermes session to reset it. With OLLAMA_FLASH_ATTN=1 enabled this degrades more slowly, but there is no avoiding it entirely on CPU.
Quick-reference fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| First request slow, subsequent fast | Cold model load | OLLAMA_KEEP_ALIVE=-1 |
| All requests extremely slow (5+ min) | Swap usage | OLLAMA_MLOCK=1, free RAM |
| Responses getting slower across session | Prefill growth | Start a fresh session |
| Later responses slower than earlier ones | Thermal throttling | Let cool, check airflow |
| Simple queries triggering many tool calls | Model over-planning | Rephrase request more specifically |
Summary
On 16 GB CPU-only hardware with a 64k context requirement, qwen2.5:7b-64k is the right model. Its grouped query attention design keeps the KV cache at ~1.9 GB — less than half of llama3.1:8b's ~4.3 GB — which is the decisive factor at long context lengths. It fits comfortably, uses tools reliably, and runs at 5–8 tok/s.
llama3.1:8b-64k is a solid backup. It uses more RAM and runs slightly slower, but tool use is strong and it's worth having pulled for comparison.
Anything 12B or larger doesn't fit at 64k context on 16 GB. Save those for machines with more RAM or a GPU.
Once the model is chosen, the tuning matters as much as the choice. In rough order of impact:
OLLAMA_KV_CACHE_TYPE=q8_0— halves KV cache RAM and reduces prefill timeOLLAMA_FLASH_ATTN=1— reduces the prefill pause at long context by 30–50%OLLAMA_MLOCK=1— prevents swap, which is the most common cause of extreme slowdownsOLLAMA_KEEP_ALIVE=-1— keeps the model loaded so every request doesn't pay a cold-start penalty- CPU governor → performance — eliminates clock ramp-up lag at the start of each response
Response times vary significantly with session history length, CPU variant, and how many tool calls a given request triggers — the T-series throttles earlier under sustained load and runs at roughly 3–5 tok/s vs 5–8 tok/s on the full-power i5-9500/i7-9700. If responses are dramatically slower than a fresh session baseline, work through the diagnostics above — swap usage, thermal throttling, and a cold model load are the most common fixable causes.
Comments