Local Model or API: Choosing the Right Backend for Hermes
This follows on from the Hermes setup post and the CPU optimisation post. The agent is running — this post is about what to do when CPU-only inference still isn't fast enough.
The Problem: CPU-Only Inference Is a Bottleneck
A Dell OptiPlex 7070 with a 9th-gen i5 or i7 pushing a 7B parameter model at Q4 quantisation will produce somewhere between 2–6 tokens per second. That's enough to see the model thinking, but too slow for anything interactive — and in an agentic setup where multiple tool calls chain together, the latency compounds badly.
The CPU optimisation post covers squeezing the most out of that hardware. But there's a ceiling. You have three realistic paths beyond it:
- Move inference to a cloud API
- Add a second-hand GPU for local inference
- A hybrid approach — local GPU for fast/private tasks, cloud for heavy ones
Option 1: Cloud APIs
Cloud inference solves the speed problem instantly. You trade a monthly bill for zero hardware effort, and most providers bill by token usage — if your agent is idle, you pay nothing. Several have free tiers that cover light personal use entirely.
All prices below are in NZD (approximately 1 USD = 1.70 NZD), per million tokens.
Groq — Best Speed on Open-Source Models
Groq runs open-source models on custom LPU silicon. It is by far the fastest cloud inference available, which makes it well suited to agentic workloads where latency compounds across many tool calls.
| Model | Input (NZD/M) | Output (NZD/M) | Speed |
|---|---|---|---|
| Llama 3.3 70B | ~$1.00 | ~$1.34 | ~280 tok/s |
| Llama 3.1 8B | ~$0.10 | ~$0.10 | ~750 tok/s |
| Gemma 2 9B | ~$0.34 | ~$0.34 | ~500 tok/s |
Groq offers a generous free tier with rate limits, which may be enough for light personal use.
Hermes models are fine-tuned from Llama base weights. Running Llama 3.x on Groq gives you a capable, fast alternative — not identical to a Hermes checkpoint but excellent for tool use.
Best for: Frequent tool calls where response latency is the main constraint.
Gemini 2.0 Flash — Best Value for Money
Gemini 2.0 Flash is the standout budget option. Fast, capable, supports function calling natively, and aggressively priced.
| Model | Input (NZD/M) | Output (NZD/M) | Notes |
|---|---|---|---|
| Gemini 2.0 Flash | ~$0.13 | ~$0.51 | Excellent tool use |
| Gemini 1.5 Flash | ~$0.13 | ~$0.51 | Slightly older |
| Gemini 2.0 Flash | Free | Free | Rate-limited |
Free tier available via Google AI Studio.
Best for: Lowest possible cost at moderate usage volumes.
OpenAI — Most Reliable for Tool Use
OpenAI's tool-calling ecosystem is the most mature. If your Hermes setup uses an OpenAI-compatible API format, GPT-4o mini is a near drop-in replacement.
| Model | Input (NZD/M) | Output (NZD/M) | Notes |
|---|---|---|---|
| GPT-4o mini | ~$0.26 | ~$1.02 | Best budget option |
| GPT-4o | ~$4.25 | ~$17.00 | Top-tier reasoning |
| o4-mini | ~$1.87 | ~$7.48 | Strong reasoning |
Best for: Broad compatibility and reliable function calling without configuration pain.
Anthropic Claude — Best for Complex Agents
Claude handles ambiguous agentic tasks and multi-step reasoning well — useful if your agent does complex chained tool calls rather than simple lookups.
| Model | Input (NZD/M) | Output (NZD/M) | Notes |
|---|---|---|---|
| Claude Haiku 3.5 | ~$1.36 | ~$6.80 | Fastest Claude |
| Claude Sonnet 4 | ~$5.10 | ~$15.30 | Best balance |
More expensive than the alternatives. The quality premium is meaningful for complex workflows; less so for simple queries.
Best for: Multi-step reasoning, ambiguous instructions, or tasks where you're actively waiting for a thoughtful response.
Together AI and OpenRouter — Open-Source Aggregators
Both platforms let you run open-source models via API at low cost. OpenRouter in particular aggregates dozens of providers and lets you compare prices dynamically — and you can often find Hermes-specific model checkpoints here.
| Platform | Example Model | Input (NZD/M) | Output (NZD/M) |
|---|---|---|---|
| Together AI | Llama 3.1 70B | ~$1.50 | ~$1.50 |
| Together AI | Llama 3.2 3B | ~$0.10 | ~$0.10 |
| OpenRouter | Llama 3.1 8B | ~$0.10 | ~$0.10 |
| OpenRouter | Mistral 7B | ~$0.10 | ~$0.10 |
Best for: Flexibility across open-source models, or specifically wanting Hermes model compatibility in the cloud.
Cloud Cost Estimate for Real Usage
Assuming a moderately active agent (roughly 500K tokens/day — dozens of agentic tasks with tool calls):
| Provider and model | Est. daily cost (NZD) | Est. monthly cost (NZD) |
|---|---|---|
| Groq Llama 3.1 8B | ~$0.05 | ~$1.50 |
| Gemini 2.0 Flash | ~$0.16 | ~$5 |
| GPT-4o mini | ~$0.30 | ~$9 |
| Groq Llama 3.3 70B | ~$0.58 | ~$17 |
| Claude Haiku 3.5 | ~$0.85 | ~$25 |
For light personal use (under 100K tokens/day), several providers' free tiers may cover your needs entirely.
Option 2: Second-Hand GPU (~$200 NZD)
At $200 NZD you're looking at two realistic options: an NVIDIA GTX 1080 or an AMD RX 580. Both have 8GB of VRAM, which is the critical number for local LLM inference.
NVIDIA GTX 1080
The 1080 is the stronger choice for AI workloads. CUDA support is mature and every local inference tool — llama.cpp, Ollama, LM Studio — supports it out of the box.
- VRAM: 8GB GDDR5X
- Architecture: Pascal (2016)
- Inference speed: ~35–55 tok/s for a 7B Q4_K_M model
- Max model size: 7B comfortably; 13B at aggressive quantisation (Q3/Q2)
- Setup: Low effort — install driver, install Ollama, run model
- Power draw: ~180W under load
Key limitation: 8GB VRAM means 7B parameter models. Hermes 7B runs well; 13B is marginal.
AMD RX 580
Tempting on price, but significantly more complicated for AI workloads.
- VRAM: 8GB GDDR5
- Architecture: Polaris (2016)
- Inference speed: ~15–25 tok/s (ROCm on Linux); slower on Windows via Vulkan
- Software support: Poor on Windows; better on Linux with ROCm — but ROCm setup is non-trivial and the RX 580 is not officially supported by ROCm's current release
- Setup: High effort — specific Linux kernel, specific driver versions, manual configuration
Verdict: Not recommended for this use case unless you enjoy Linux driver archaeology.
The Real Difference
| Setup | Tok/s (7B Q4) | Time for a 500-token response |
|---|---|---|
| OptiPlex CPU only | 2–5 | ~2–4 minutes |
| + GTX 1080 | 35–55 | ~10–14 seconds |
| Cloud (Groq) | 300–800 | ~1 second |
The 1080 is a 10–20× improvement over CPU-only. That's the difference between barely usable and genuinely practical.
GPU: Things to Check Before Buying
PSU capacity: The OptiPlex 7070 SFF (small form factor) ships with only a 180W PSU, which may not safely support a 1080 under load. The MT (mini tower) has 260W and is fine. This is the single biggest risk — check which variant you have before buying.
Physical fit: The 7070 SFF will only fit a low-profile GPU. A full-size 1080 will not fit the SFF chassis. MT is fine.
Second-hand risk: Ex-mining cards may have degraded memory. Ask for a stress test or thermal image.
Electricity: A 1080 adds roughly $15–25 NZD/month to your power bill at several hours of daily use.
SFF alternative: If you have the SFF chassis, a GTX 1650 LP (~$150–200 NZD) fits and delivers ~20–30 tok/s — less than the 1080 but a substantial improvement over CPU-only.
Head-to-Head Summary
| Option | Upfront cost | Monthly cost | Speed | Max model | Setup effort |
|---|---|---|---|---|---|
| CPU only (current) | $0 | ~$0 | 2–5 tok/s | 7B (slow) | Done |
| GTX 1080 (MT) | ~$200 NZD | ~$15–25 power | 35–55 tok/s | 7B–13B | Low |
| GTX 1650 LP (SFF) | ~$150–200 NZD | ~$10–15 power | 20–30 tok/s | 7B | Low |
| RX 580 | ~$200 NZD | ~$15–25 power | 15–25 tok/s | 7B | High |
| Groq Llama 8B | $0 | ~$1.50–5 | 500–800 tok/s | Any | Minimal |
| Gemini 2.0 Flash | $0 | ~$5–15 | Fast | Any | Minimal |
| GPT-4o mini | $0 | ~$9–30 | Fast | Any | Minimal |
A Practical Framework
Beyond raw speed, the right choice also depends on the nature of the task.
Think of it as doing versus thinking.
Doing: shell commands, device control, file operations, system queries, scheduled automations. These are execution tasks — the model needs to follow a schema reliably, not reason deeply. A local model handles these well once it's fast enough, and the privacy argument is strong.
Thinking: code review, writing, multi-step reasoning, analysing a complex situation. These are tasks where the quality gap between a 7B model and a larger cloud model is meaningful, and where you're actively waiting for a useful answer. Cloud APIs have a permanent advantage here — not just in speed but in capability.
The homelab sweet spot: local inference handles the automation and system management layer; cloud handles anything interactive where you want a thoughtful response.
Switching Backends in Hermes
Hermes supports multiple model backends. The exact flag varies by version — check with:
hermes chat --help | grep -i modelTypically:
# Default — uses whatever backend is configured in config.yaml
hermes chat -q "how much disk space do I have?"
# Explicitly request a cloud model for a task that needs quality or speed
hermes chat --model claude -q "review this and suggest improvements: ..."Setting the local model as the default and reaching for the API flag explicitly keeps the common case fast and cheap — or local and private, depending on what you've configured.
Recommendation
For most users: start with Groq or Gemini Flash.
Zero upfront cost, near-instant responses, and free tiers make cloud APIs the obvious first step. Groq's Llama 3.3 70B is a substantially more capable model than Hermes 7B running locally and responds in under a second. You can have this working in under an hour.
If you want to stay local and private: the GTX 1080 — but check your chassis first.
If your OptiPlex 7070 is the MT (mini tower) variant with a 260W PSU, the 1080 is a good upgrade. It turns the agent from frustratingly slow to genuinely practical. If you have the SFF, you need a low-profile card instead.
Skip the RX 580 for this workload. The driver pain is not worth the savings.
Hybrid: Many people run a local GPU for quick, private tasks and fall back to a cloud API for heavy lifting. With Ollama and OpenRouter both configured in your agent, this is easy to set up and covers most scenarios cleanly.
Quick Decision Guide
Do you need privacy or offline access?
├── Yes → Buy GTX 1080
│ ├── MT chassis + 260W PSU → standard 1080 (~$200 NZD)
│ └── SFF chassis → low-profile 1650 (~$150–200 NZD)
└── No → Use cloud API
├── Lowest cost → Groq Llama 3.1 8B (often free)
├── Best value → Gemini 2.0 Flash (~$5/month)
├── Best compat → GPT-4o mini (~$9/month)
└── Best quality → Claude Sonnet 4 (~$30+/month)Pricing current as of April 2026. USD/NZD conversion at approximately 1.70. Provider pricing changes frequently — check the pricing pages directly before committing.
Comments