You can now run GLM-5.2 locally on Mac Studio to integrate with Hermes.👇
Hardware Reality (Non-Negotiable)
• 2-bit dynamic quant: ~239GB
• Minimum: 256GB Unified Memory (Mac Studio only, no laptops)
• Recommended: 512GB Unified Memory
• Speed: 1–9 tokens/sec on M3 Ultra
Use case: Perfect private background worker for long async tasks
Not for fast, casual interactive chat
Step 1: Run GLM-5.2 Local API (2 Options)
Option 1 | LM Studio (Easiest macOS Setup)
1. Install LM Studio
2. Download Unsloth GLM-5.2 GGUF (UD-IQ2_M)
3. Enable developer local server
4. Endpoint: http://localhost:1234/v1
Option 2 | Llama.cpp (Full CLI Control)
pip install huggingface_hub
hf download unsloth/GLM-5.2-GGUF \
--local-dir unsloth/GLM-5.2-GGUF \
--include "*UD-IQ2_M*"
./llama.cpp/llama-server \
--model unsloth/GLM-5.2-GGUF/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
--temp 1.0 --top-p 0.95 --min-p 0.01 \
--ctx-size 32768 --jinja \
--host 0.0.0.0 --port 8080
(Official Unsloth sampling params + Jinja chat template for valid tool calling)
Step 2: Connect Nous Hermes Agent to Local Model
Edit ~/.hermes/config.yaml for fully local agent execution:
model:
default: glm-5.2
provider: custom
base_url: http://localhost:8080/v1
api_key: local
context_length: 32768
agent:
tool_use_enforcement: true
Key fix: Enable tool_use_enforcement
GLM is not in Hermes’ default supported model list — this forces proper tool calling (no more just describing tasks!)
@brian_armstrong The vision is clear, but the real difficulty is building an agent that can handle real money movement, unexpected market moves, and regulatory edge cases without creating silent disasters that only surface days later.
I've been running similar multi-agent setups on bigger refactors and the skeptic + reviewer layer is what actually keeps things from quietly drifting into elegant but broken solutions — the real test will be how well /goal handles mid-project plan changes without the whole team losing coherence.
@yuhasbeentaken The token inefficiency and long planning loops are the real hidden cost — even with much lower per-token pricing, GLM-5.2 can easily end up more expensive than expected once you run actual long-horizon agent workflows that need consistent steering.
I'm desperately in need of GLM-5.2 right now.
My current agent workflows are burning through GPT-5.5 tokens way too fast, and the monthly bill is getting out of control.
@Hicker_Moledao This is the exact kind of infrastructure oversight that hurts long-running local agents the most — TRACE-level logging should never be left on by default in anything meant to run for hours, or you end up trading model intelligence for hardware lifespan.
@0xgibly I've wasted too many tokens and context windows on agent hallucinations that actually traced back to messy PDF parsing — adding a proper cleaning step upstream has been one of the highest-ROI improvements in any document-heavy workflow I've run.
Real agentic benchmarks like GDPval-AA matter more than most synthetic ones because they test multi-turn practical deliverables. GLM-5.2 reaching #3 here means open weights are now close enough that long-running agent workflows can start shifting for cost and control instead of pure capability.