Hermes on CPU: Diagnosing a 39-Minute Response Time
This follows on from the Hermes setup post and the model selection and tuning post. The model is installed, Ollama is tuned — this post is about the layer above: Hermes itself.
The Problem
After installing Hermes with qwen2.5:7b-64k and applying all the Ollama tuning from the model selection post — flash attention, KV cache quantisation, memory locking, performance governor — a simple disk usage query took 39 minutes to respond.
The model was loaded. No swap. No disk I/O. The CPU was at 100% the entire time. Something else was wrong.
Establish a Baseline: Test Ollama Directly
Before blaming Hermes, confirm what the model itself can do by bypassing the agent entirely:
echo '{"model":"qwen2.5:7b-64k","prompt":"How much disk space does the system have?","stream":false}' > /tmp/test.json
curl -s http://localhost:11434/api/generate -d @/tmp/test.json > /tmp/result.json
python3 -c "import json; d=json.load(open('/tmp/result.json')); [print(k, d[k]) for k in ['prompt_eval_count','prompt_eval_duration','eval_count','eval_duration'] if k in d]"On the OptiPlex 7070 with i5-9500T, this returned:
prompt_eval_count 39
prompt_eval_duration 2010984604
eval_count 304
eval_duration 74526395591- Prefill: 39 tokens in 2 seconds = 19.4 tok/s
- Generation: 304 tokens in 74.5 seconds = 4.1 tok/s
- Total direct to Ollama: 76 seconds
The model is working correctly. At 4.1 tok/s, 39 minutes through Hermes means roughly 9,600 tokens generated — 30× more than the direct test. Hermes was doing something the direct test was not.
How Hermes Multiplies Latency
A single Hermes request involves multiple LLM calls:
- Tool selection — the model reads the full context and decides which tool to invoke
- Tool execution — the actual command runs (fast)
- Synthesis — the model reads the tool result and generates the final response
More complex tasks chain additional calls. And critically: every LLM call receives the full system prompt — all skill and tool definitions — before your message. On a GPU this is fast enough to ignore. On a CPU at 19.4 tok/s prefill, it dominates everything.
Finding the System Prompt Size
Hermes writes a snapshot of the compiled skills prompt on startup:
wc -c ~/.hermes/.skills_prompt_snapshot.jsonWith the default Hermes install and all skills active, this was 45,273 bytes — approximately 11,300 tokens injected into every LLM call.
At 19.4 tok/s prefill, that's:
11,300 tokens ÷ 19.4 tok/s ≈ 583 seconds ≈ 9.7 minutes per callMultiply by three calls for a simple tool-using query:
3 × 9.7 min prefill + ~2 min generation ≈ 31+ minutesThat's the 39 minutes. The model wasn't slow — it was processing an 11,000-token system prompt on every single LLM call.
The Two Sources of Prompt Weight
Hermes has two separate systems that contribute to the system prompt:
Skills — optional capability bundles (GitHub integration, email, productivity tools). Each enabled skill adds its tool definitions to every prompt.
hermes skills listToolsets — built-in capabilities (terminal, web, browser, delegation). Always loaded but individually togglable.
hermes tools listBoth need to be audited.
Reducing Skills
With a default install, Hermes enables a wide range of skills designed for a cloud-hosted assistant. Most are irrelevant for a homelab setup.
Check the prompt size, disable what you don't need, restart the gateway, and check again:
wc -c ~/.hermes/.skills_prompt_snapshot.json
hermes skill disable <name>
hermes gateway stop && hermes gateway start
wc -c ~/.hermes/.skills_prompt_snapshot.jsonFor a homelab personal assistant, the clear removals are anything in categories you don't use: sub-agent orchestration, development tooling for languages you don't run through Hermes, productivity apps you don't have accounts for, media and social tools.
The improvement is proportional. Cutting the skills prompt from 45KB to 14KB reduces prefill time per call by 70% — which for a three-call query is the difference between 30 minutes and under 10.
Toolsets: Two Specific Offenders
delegation allows Hermes to spawn sub-agents to handle parts of a task. On a local 7B model this is counterproductive — for a simple disk usage query the model may delegate to a sub-agent, which means additional full LLM round-trips, each with the full system prompt. Disable it:
hermes tools disable delegationclarify makes the model ask clarifying questions before acting when it's uncertain. This adds a full LLM call before any tool is invoked. On CPU hardware, one extra call is 5–15 minutes. Disable it:
hermes tools disable clarifyOther toolsets worth reviewing depending on your use case:
hermes tools disable browser # if you only need terminal access
hermes tools disable code_execution # if you don't need sandboxed code execution
hermes tools disable session_search # not needed for most homelab queriesVerifying the Improvement
After each round of changes, restart the gateway and measure:
hermes gateway stop && hermes gateway start
# Check the new prompt size
wc -c ~/.hermes/.skills_prompt_snapshot.json
# Time an actual query
time hermes chat -q "how much disk space do I have?"The new prompt size tells you exactly what to expect. Each 4KB removed from the snapshot saves roughly 1,000 tokens × number of LLM calls from the total response time.
Minimum Viable Configuration for a Homelab Assistant
What to keep enabled by default:
| What it does | |
|---|---|
terminal toolset |
Shell command execution — runs df, sensors, etc. |
file toolset |
File read/write operations |
homeassistant toolset |
Home Assistant device control |
memory toolset |
Persistent memory between sessions |
todo toolset |
Task tracking |
Add skills back individually when you have a concrete use for them. Every enabled skill is a permanent cost on every LLM call, paid regardless of whether that skill is relevant to the query.
Summary
A 39-minute disk usage query has nothing to do with the model or the hardware — it's the system prompt size.
| Check | Command | What to look for |
|---|---|---|
| Model loaded | ollama ps |
Model listed, Until: Forever |
| No swap | free -h |
Swap used: ~0 |
| No disk I/O | iostat -x 1 3 |
%util near 0 during inference |
| Prompt size | wc -c ~/.hermes/.skills_prompt_snapshot.json |
Target under 15KB |
| Baseline speed | Direct Ollama test | Calculate actual tok/s |
Fix in order: disable unused skills, disable delegation, disable clarify. Restart the gateway after each change and measure the improvement.
Comments