Hermes on CPU: Diagnosing a 39-Minute Response Time

This follows on from the Hermes setup post and the model selection and tuning post. The model is installed, Ollama is tuned — this post is about the layer above: Hermes itself.


The Problem

After installing Hermes with qwen2.5:7b-64k and applying all the Ollama tuning from the model selection post — flash attention, KV cache quantisation, memory locking, performance governor — a simple disk usage query took 39 minutes to respond.

The model was loaded. No swap. No disk I/O. The CPU was at 100% the entire time. Something else was wrong.


Establish a Baseline: Test Ollama Directly

Before blaming Hermes, confirm what the model itself can do by bypassing the agent entirely:

echo '{"model":"qwen2.5:7b-64k","prompt":"How much disk space does the system have?","stream":false}' > /tmp/test.json
curl -s http://localhost:11434/api/generate -d @/tmp/test.json > /tmp/result.json
python3 -c "import json; d=json.load(open('/tmp/result.json')); [print(k, d[k]) for k in ['prompt_eval_count','prompt_eval_duration','eval_count','eval_duration'] if k in d]"

On the OptiPlex 7070 with i5-9500T, this returned:

prompt_eval_count 39
prompt_eval_duration 2010984604
eval_count 304
eval_duration 74526395591
  • Prefill: 39 tokens in 2 seconds = 19.4 tok/s
  • Generation: 304 tokens in 74.5 seconds = 4.1 tok/s
  • Total direct to Ollama: 76 seconds

The model is working correctly. At 4.1 tok/s, 39 minutes through Hermes means roughly 9,600 tokens generated — 30× more than the direct test. Hermes was doing something the direct test was not.


How Hermes Multiplies Latency

A single Hermes request involves multiple LLM calls:

  1. Tool selection — the model reads the full context and decides which tool to invoke
  2. Tool execution — the actual command runs (fast)
  3. Synthesis — the model reads the tool result and generates the final response

More complex tasks chain additional calls. And critically: every LLM call receives the full system prompt — all skill and tool definitions — before your message. On a GPU this is fast enough to ignore. On a CPU at 19.4 tok/s prefill, it dominates everything.


Finding the System Prompt Size

Hermes writes a snapshot of the compiled skills prompt on startup:

wc -c ~/.hermes/.skills_prompt_snapshot.json

With the default Hermes install and all skills active, this was 45,273 bytes — approximately 11,300 tokens injected into every LLM call.

At 19.4 tok/s prefill, that's:

11,300 tokens ÷ 19.4 tok/s ≈ 583 seconds ≈ 9.7 minutes per call

Multiply by three calls for a simple tool-using query:

3 × 9.7 min prefill + ~2 min generation ≈ 31+ minutes

That's the 39 minutes. The model wasn't slow — it was processing an 11,000-token system prompt on every single LLM call.


The Two Sources of Prompt Weight

Hermes has two separate systems that contribute to the system prompt:

Skills — optional capability bundles (GitHub integration, email, productivity tools). Each enabled skill adds its tool definitions to every prompt.

hermes skills list

Toolsets — built-in capabilities (terminal, web, browser, delegation). Always loaded but individually togglable.

hermes tools list

Both need to be audited.


Reducing Skills

With a default install, Hermes enables a wide range of skills designed for a cloud-hosted assistant. Most are irrelevant for a homelab setup.

Check the prompt size, disable what you don't need, restart the gateway, and check again:

wc -c ~/.hermes/.skills_prompt_snapshot.json
hermes skill disable <name>
hermes gateway stop && hermes gateway start
wc -c ~/.hermes/.skills_prompt_snapshot.json

For a homelab personal assistant, the clear removals are anything in categories you don't use: sub-agent orchestration, development tooling for languages you don't run through Hermes, productivity apps you don't have accounts for, media and social tools.

The improvement is proportional. Cutting the skills prompt from 45KB to 14KB reduces prefill time per call by 70% — which for a three-call query is the difference between 30 minutes and under 10.


Toolsets: Two Specific Offenders

delegation allows Hermes to spawn sub-agents to handle parts of a task. On a local 7B model this is counterproductive — for a simple disk usage query the model may delegate to a sub-agent, which means additional full LLM round-trips, each with the full system prompt. Disable it:

hermes tools disable delegation

clarify makes the model ask clarifying questions before acting when it's uncertain. This adds a full LLM call before any tool is invoked. On CPU hardware, one extra call is 5–15 minutes. Disable it:

hermes tools disable clarify

Other toolsets worth reviewing depending on your use case:

hermes tools disable browser        # if you only need terminal access
hermes tools disable code_execution # if you don't need sandboxed code execution
hermes tools disable session_search # not needed for most homelab queries

Verifying the Improvement

After each round of changes, restart the gateway and measure:

hermes gateway stop && hermes gateway start

# Check the new prompt size
wc -c ~/.hermes/.skills_prompt_snapshot.json

# Time an actual query
time hermes chat -q "how much disk space do I have?"

The new prompt size tells you exactly what to expect. Each 4KB removed from the snapshot saves roughly 1,000 tokens × number of LLM calls from the total response time.


Minimum Viable Configuration for a Homelab Assistant

What to keep enabled by default:

What it does
terminal toolset Shell command execution — runs df, sensors, etc.
file toolset File read/write operations
homeassistant toolset Home Assistant device control
memory toolset Persistent memory between sessions
todo toolset Task tracking

Add skills back individually when you have a concrete use for them. Every enabled skill is a permanent cost on every LLM call, paid regardless of whether that skill is relevant to the query.


Summary

A 39-minute disk usage query has nothing to do with the model or the hardware — it's the system prompt size.

Check Command What to look for
Model loaded ollama ps Model listed, Until: Forever
No swap free -h Swap used: ~0
No disk I/O iostat -x 1 3 %util near 0 during inference
Prompt size wc -c ~/.hermes/.skills_prompt_snapshot.json Target under 15KB
Baseline speed Direct Ollama test Calculate actual tok/s

Fix in order: disable unused skills, disable delegation, disable clarify. Restart the gateway after each change and measure the improvement.