Local Model or API: Choosing the Right Backend for Hermes

This follows on from the Hermes setup post and the CPU optimisation post. The agent is running — this post is about what to do when CPU-only inference still isn't fast enough.


The Problem: CPU-Only Inference Is a Bottleneck

A Dell OptiPlex 7070 with a 9th-gen i5 or i7 pushing a 7B parameter model at Q4 quantisation will produce somewhere between 2–6 tokens per second. That's enough to see the model thinking, but too slow for anything interactive — and in an agentic setup where multiple tool calls chain together, the latency compounds badly.

The CPU optimisation post covers squeezing the most out of that hardware. But there's a ceiling. You have three realistic paths beyond it:

  1. Move inference to a cloud API
  2. Add a second-hand GPU for local inference
  3. A hybrid approach — local GPU for fast/private tasks, cloud for heavy ones

Option 1: Cloud APIs

Cloud inference solves the speed problem instantly. You trade a monthly bill for zero hardware effort, and most providers bill by token usage — if your agent is idle, you pay nothing. Several have free tiers that cover light personal use entirely.

All prices below are in NZD (approximately 1 USD = 1.70 NZD), per million tokens.


Groq — Best Speed on Open-Source Models

Groq runs open-source models on custom LPU silicon. It is by far the fastest cloud inference available, which makes it well suited to agentic workloads where latency compounds across many tool calls.

Model Input (NZD/M) Output (NZD/M) Speed
Llama 3.3 70B ~$1.00 ~$1.34 ~280 tok/s
Llama 3.1 8B ~$0.10 ~$0.10 ~750 tok/s
Gemma 2 9B ~$0.34 ~$0.34 ~500 tok/s

Groq offers a generous free tier with rate limits, which may be enough for light personal use.

Hermes models are fine-tuned from Llama base weights. Running Llama 3.x on Groq gives you a capable, fast alternative — not identical to a Hermes checkpoint but excellent for tool use.

Best for: Frequent tool calls where response latency is the main constraint.


Gemini 2.0 Flash — Best Value for Money

Gemini 2.0 Flash is the standout budget option. Fast, capable, supports function calling natively, and aggressively priced.

Model Input (NZD/M) Output (NZD/M) Notes
Gemini 2.0 Flash ~$0.13 ~$0.51 Excellent tool use
Gemini 1.5 Flash ~$0.13 ~$0.51 Slightly older
Gemini 2.0 Flash Free Free Rate-limited

Free tier available via Google AI Studio.

Best for: Lowest possible cost at moderate usage volumes.


OpenAI — Most Reliable for Tool Use

OpenAI's tool-calling ecosystem is the most mature. If your Hermes setup uses an OpenAI-compatible API format, GPT-4o mini is a near drop-in replacement.

Model Input (NZD/M) Output (NZD/M) Notes
GPT-4o mini ~$0.26 ~$1.02 Best budget option
GPT-4o ~$4.25 ~$17.00 Top-tier reasoning
o4-mini ~$1.87 ~$7.48 Strong reasoning

Best for: Broad compatibility and reliable function calling without configuration pain.


Anthropic Claude — Best for Complex Agents

Claude handles ambiguous agentic tasks and multi-step reasoning well — useful if your agent does complex chained tool calls rather than simple lookups.

Model Input (NZD/M) Output (NZD/M) Notes
Claude Haiku 3.5 ~$1.36 ~$6.80 Fastest Claude
Claude Sonnet 4 ~$5.10 ~$15.30 Best balance

More expensive than the alternatives. The quality premium is meaningful for complex workflows; less so for simple queries.

Best for: Multi-step reasoning, ambiguous instructions, or tasks where you're actively waiting for a thoughtful response.


Together AI and OpenRouter — Open-Source Aggregators

Both platforms let you run open-source models via API at low cost. OpenRouter in particular aggregates dozens of providers and lets you compare prices dynamically — and you can often find Hermes-specific model checkpoints here.

Platform Example Model Input (NZD/M) Output (NZD/M)
Together AI Llama 3.1 70B ~$1.50 ~$1.50
Together AI Llama 3.2 3B ~$0.10 ~$0.10
OpenRouter Llama 3.1 8B ~$0.10 ~$0.10
OpenRouter Mistral 7B ~$0.10 ~$0.10

Best for: Flexibility across open-source models, or specifically wanting Hermes model compatibility in the cloud.


Cloud Cost Estimate for Real Usage

Assuming a moderately active agent (roughly 500K tokens/day — dozens of agentic tasks with tool calls):

Provider and model Est. daily cost (NZD) Est. monthly cost (NZD)
Groq Llama 3.1 8B ~$0.05 ~$1.50
Gemini 2.0 Flash ~$0.16 ~$5
GPT-4o mini ~$0.30 ~$9
Groq Llama 3.3 70B ~$0.58 ~$17
Claude Haiku 3.5 ~$0.85 ~$25

For light personal use (under 100K tokens/day), several providers' free tiers may cover your needs entirely.


Option 2: Second-Hand GPU (~$200 NZD)

At $200 NZD you're looking at two realistic options: an NVIDIA GTX 1080 or an AMD RX 580. Both have 8GB of VRAM, which is the critical number for local LLM inference.


NVIDIA GTX 1080

The 1080 is the stronger choice for AI workloads. CUDA support is mature and every local inference tool — llama.cpp, Ollama, LM Studio — supports it out of the box.

  • VRAM: 8GB GDDR5X
  • Architecture: Pascal (2016)
  • Inference speed: ~35–55 tok/s for a 7B Q4_K_M model
  • Max model size: 7B comfortably; 13B at aggressive quantisation (Q3/Q2)
  • Setup: Low effort — install driver, install Ollama, run model
  • Power draw: ~180W under load

Key limitation: 8GB VRAM means 7B parameter models. Hermes 7B runs well; 13B is marginal.


AMD RX 580

Tempting on price, but significantly more complicated for AI workloads.

  • VRAM: 8GB GDDR5
  • Architecture: Polaris (2016)
  • Inference speed: ~15–25 tok/s (ROCm on Linux); slower on Windows via Vulkan
  • Software support: Poor on Windows; better on Linux with ROCm — but ROCm setup is non-trivial and the RX 580 is not officially supported by ROCm's current release
  • Setup: High effort — specific Linux kernel, specific driver versions, manual configuration

Verdict: Not recommended for this use case unless you enjoy Linux driver archaeology.


The Real Difference

Setup Tok/s (7B Q4) Time for a 500-token response
OptiPlex CPU only 2–5 ~2–4 minutes
+ GTX 1080 35–55 ~10–14 seconds
Cloud (Groq) 300–800 ~1 second

The 1080 is a 10–20× improvement over CPU-only. That's the difference between barely usable and genuinely practical.


GPU: Things to Check Before Buying

PSU capacity: The OptiPlex 7070 SFF (small form factor) ships with only a 180W PSU, which may not safely support a 1080 under load. The MT (mini tower) has 260W and is fine. This is the single biggest risk — check which variant you have before buying.

Physical fit: The 7070 SFF will only fit a low-profile GPU. A full-size 1080 will not fit the SFF chassis. MT is fine.

Second-hand risk: Ex-mining cards may have degraded memory. Ask for a stress test or thermal image.

Electricity: A 1080 adds roughly $15–25 NZD/month to your power bill at several hours of daily use.

SFF alternative: If you have the SFF chassis, a GTX 1650 LP (~$150–200 NZD) fits and delivers ~20–30 tok/s — less than the 1080 but a substantial improvement over CPU-only.


Head-to-Head Summary

Option Upfront cost Monthly cost Speed Max model Setup effort
CPU only (current) $0 ~$0 2–5 tok/s 7B (slow) Done
GTX 1080 (MT) ~$200 NZD ~$15–25 power 35–55 tok/s 7B–13B Low
GTX 1650 LP (SFF) ~$150–200 NZD ~$10–15 power 20–30 tok/s 7B Low
RX 580 ~$200 NZD ~$15–25 power 15–25 tok/s 7B High
Groq Llama 8B $0 ~$1.50–5 500–800 tok/s Any Minimal
Gemini 2.0 Flash $0 ~$5–15 Fast Any Minimal
GPT-4o mini $0 ~$9–30 Fast Any Minimal

A Practical Framework

Beyond raw speed, the right choice also depends on the nature of the task.

Think of it as doing versus thinking.

Doing: shell commands, device control, file operations, system queries, scheduled automations. These are execution tasks — the model needs to follow a schema reliably, not reason deeply. A local model handles these well once it's fast enough, and the privacy argument is strong.

Thinking: code review, writing, multi-step reasoning, analysing a complex situation. These are tasks where the quality gap between a 7B model and a larger cloud model is meaningful, and where you're actively waiting for a useful answer. Cloud APIs have a permanent advantage here — not just in speed but in capability.

The homelab sweet spot: local inference handles the automation and system management layer; cloud handles anything interactive where you want a thoughtful response.


Switching Backends in Hermes

Hermes supports multiple model backends. The exact flag varies by version — check with:

hermes chat --help | grep -i model

Typically:

# Default — uses whatever backend is configured in config.yaml
hermes chat -q "how much disk space do I have?"

# Explicitly request a cloud model for a task that needs quality or speed
hermes chat --model claude -q "review this and suggest improvements: ..."

Setting the local model as the default and reaching for the API flag explicitly keeps the common case fast and cheap — or local and private, depending on what you've configured.


Recommendation

For most users: start with Groq or Gemini Flash.

Zero upfront cost, near-instant responses, and free tiers make cloud APIs the obvious first step. Groq's Llama 3.3 70B is a substantially more capable model than Hermes 7B running locally and responds in under a second. You can have this working in under an hour.

If you want to stay local and private: the GTX 1080 — but check your chassis first.

If your OptiPlex 7070 is the MT (mini tower) variant with a 260W PSU, the 1080 is a good upgrade. It turns the agent from frustratingly slow to genuinely practical. If you have the SFF, you need a low-profile card instead.

Skip the RX 580 for this workload. The driver pain is not worth the savings.

Hybrid: Many people run a local GPU for quick, private tasks and fall back to a cloud API for heavy lifting. With Ollama and OpenRouter both configured in your agent, this is easy to set up and covers most scenarios cleanly.


Quick Decision Guide

Do you need privacy or offline access?
├── Yes → Buy GTX 1080
│         ├── MT chassis + 260W PSU → standard 1080 (~$200 NZD)
│         └── SFF chassis → low-profile 1650 (~$150200 NZD)
└── No  → Use cloud API
           ├── Lowest cost  → Groq Llama 3.1 8B (often free)
           ├── Best value   → Gemini 2.0 Flash (~$5/month)
           ├── Best compat  → GPT-4o mini (~$9/month)
           └── Best quality → Claude Sonnet 4 (~$30+/month)

Pricing current as of April 2026. USD/NZD conversion at approximately 1.70. Provider pricing changes frequently — check the pricing pages directly before committing.