So impressed with @NousResearch Hermes agent. I needed my new agent to have access to certain API’s and skills that exist on another agent on my laptop so it figured out how SSH in, located other agents, pulled their info, fixed something I didn’t know broke & got the job done
@Teknium Interesting is my hermes, valen started off as a very dry & strick IT wizard and now after a few weeks it’s throwing some humor in, making fun of how many questions and I ask it and without me even telling it how figured out how to SSH into my laptop and extract info I needed.
@Teknium Hermes agent is too awesome! So I was just downloading the update. I had Hermes do it within its chat window. And some weird crash happened from the update. I restarted Hermes. It discovered the problem, fixed it, something to do with TUI & then downloaded the update. love it
Would love to hear how others are using their Karpatthy style wiki with Hermes agents from @NousResearch are you adding notes, ebooks, courses? Articles, information about your company & how are you getting Hermes to actually learn or use the information @Teknium just curious :-)
@jamesrowdyy@NousResearch What was one of the things that had you say “good lord” about it. And yes, I agree. I’m new to a lot of this but my agent almost seems to be able to read my mind at times lol 😂
The Deep Dive: What Worked, What Didn't, and Why
✅ Where GPT-5.5 Excels
Structured Output / JSON Mode (0.90) — The Series Leader
This is the best JSON performance we've seen across all seven models. GPT-5.5 returned perfectly valid JSON with exact schema compliance, 4 of 5 pattern checks passed, and the structure was clean and immediately usable. For production agentic pipelines that depend on machine-parseable output, GPT-5.5 sets the new standard.
Compare: KimiK2.6 (1.00), DeepSeek (1.00), Claude Opus (0.90). GPT-5.5 ties the top but with faster generation speed. The JSON was well-structured with sensible values and no formatting artifacts.
Code Execution Reasoning (0.88)
Identical score to Claude Opus and DeepSeek. GPT-5.5 correctly predicted all three print outputs and explained the reference-vs-copy distinction clearly. It lost the same partial point on not fully explaining the slice mechanism — suggesting this is a rubric-level expectation rather than a model limitation.
Complex Multi-Step Reasoning (0.75)
A meaningful improvement over Kimi (0.25) and DeepSeek (0.25). GPT-5.5 correctly identified that the logic puzzle had multiple valid solutions and noted the ambiguity. While it didn't converge on a single answer, it demonstrated awareness of the problem space — a different kind of correctness than brute-forcing the wrong answer.
Adversarial / Trick Questions (0.75)
Same score as most models in the series. GPT-5.5 correctly identified the widget machine rate trap (5 minutes, not 100) with clear reasoning. Nothing surprising here — this test has become a baseline that most frontier models pass.
Instruction Following Precision (0.70)
Same score as DeepSeek, higher than Kimi (0.50). GPT-5.5 attempted the constraint puzzle (5 sentences, ≤15 "e"s, "serverless" once, end with "future", ALL CAPS) and met 2/5 constraints. Like DeepSeek, it showed engagement with the problem rather than ignoring it.
100% Reliability
Zero errors. Zero timeouts. Across 15 tests with an average runtime of nearly 20 seconds per test, GPT-5.5 never crashed, never rate-limited, never failed to return a response. This is the operational gold standard.
❌ Where GPT-5.5 Struggles
The Speed Tax (16.3s average TTF)
This is the single biggest issue. GPT-5.5 takes 16.3 seconds on average to produce its first token. For comparison:
| Model | Avg TTF | Relative | |-------|---------|----------| | KimiK2.6 (Ollama) | 2.2s | Baseline | | Claude Opus 4.8 Fast | ~4.0s | 1.8x | | DeepSeek-V4-Pro | 17.5s | 8.0x | | GPT-5.5 | 16.3s | 7.4x |
From a user experience perspective, 16 seconds of silence before any response is agonizing. The total time averages are reasonable (19.9s) because GPT-5.5 generates efficiently once it starts, but the latency before first output is a real problem for interactive use.
Recent Knowledge / World Events (0.50)
The most disappointing failure. Asked about the June 2025 G7 summit, GPT-5.5 hallucinated an elaborate narrative about a "June 15–17, 2025 G7 summit in Kananaskis, Alberta" hosted by "Canadian Prime Minister Mark Carney." None of this happened. The model fabricated dates, location, host, and agenda items.
This is worse than DeepSeek's honest "my cutoff is May 2025" or Kimi's incorrect "April 2024." GPT-5.5 didn't decline to answer — it confidently invented a fictional event. For production use cases requiring current information, this is a critical vulnerability.
Debugging (0.50)
Same as most models in the series. GPT-5.5 missed the subtle mutability bug and claimed the code was fine. The test may be too subtle — it's designed to check whether models hallucinate bugs, and GPT-5.5 correctly avoided that trap. But it didn't earn full credit for edge case analysis.
Content Generation (0.50)
Same score as most models. GPT-5.5 wrote a competent but generic tech article about API rate limiting. It stayed within word count but missed the creativity and authenticity marks. Like every model before it, GPT-5.5 struggles to write with a distinctive voice.
Edge Case Handling (0.50)
Same pattern as Kimi and DeepSeek. GPT-5.5 correctly asked clarifying questions rather than hallucinating trip details, but didn't actually solve the edge case problem. Safe but not helpful.
Long-Context RAG (0.50)
Only extracted 1 of 3 required data points from the embedded document. The McKinsey stat (72%) was captured, but MIT CSAIL attribution and emerging paradigms were missed. Same "fade toward the end" pattern we've seen across all models.
Tool Use / Function Calling (0.50)
Listed function calls with correct parameters but no native execution. This is a harness limitation — OpenRouter doesn't support tool execution in our test setup. The model understood what to call; we just couldn't validate execution.
Summarization Fidelity (0.50)
Missed key facts from the quantum computing article. Word count was acceptable, but both the independent physicist caution and stock movement details were omitted. DeepSeek and Kimi had the same problem.
Who is GPT-5.5 actually for?
Structured data pipelines — the 0.90 JSON score makes GPT-5.5 the best choice for agentic workflows that depend on machine-parseable output. If your production system sends model output directly to a JSON parser, GPT-5.5 is the safest bet.
Complex reasoning tasks — the 0.75 on multi-step logic is the best in the series. GPT-5.5 doesn't just brute-force answers; it recognizes ambiguity and problem structure. For research analysis, legal reasoning, or any task where "I don't know" is better than a wrong answer, this matters.
Batch processing where latency doesn't matter — the 16.3s TTF is irrelevant if you're processing documents overnight. GPT-5.5's reliability and structured output excellence make it ideal for background jobs.
NOT for: Real-time chat, interactive applications, or any user-facing interface where 16 seconds of silence kills engagement. The speed tax is real and significant.
NOT for: Tasks requiring current world knowledge. The hallucinated G7 summit is a red flag. GPT-5.5 will confidently invent events rather than admit uncertainty.
See the full test results here https://t.co/EwlnQpfoms
@imbabybrooklyn@NousResearch So are profiles each different Hermes agents you setup? If so, I assume you can create new agents right inside? New to this so still trying to understand it all
@Teknium@NousResearch@trycua Thanks for all your work @trycua I’m pretty new to all of this and haven’t had a chance to try the skill. But it sounds awesome. Thanks for all your work. Appreciate you
Can someone explain how the Claude code stuff with Hermes Agent @NousResearch I see in the desktop app it has it as a skill and under model in settings. Is anthropic letting us use our agents with them now outside of the ApI? Or only if we are using Claude code locally @Teknium
@Teknium@NousResearch Oh! Damn that’s cool. Sheesh. You guys have thought of it all! Just imagine if computer use comes to the desktop app. Can I buy stock in your company 😂🤣 taking over the world. Thanks Tek. If no one has told you, we appreciate all you do!
@imbabybrooklyn@max_paperclips Really loving it and seeing the potential. I can’t wait till computer use is added ;-) hint hint. Seriously great job. Also figured out how to get my WSL agent on my windows PC into the new desktop client. It wasn’t easy tho. Would love a WSL detect feature