GLM-5.1 is the best model to use on Hermes, and ranks very highly on OpenClaw too.
I built a new benchmark to test models on OpenClaw and Hermes, testing persistent memory, tool discipline, protocol compliance, and injection safety.
I put 11 models through the benchmark, including GLM-5.1, GPT-5.4, Kimi K2.5, Grok 4.20, and more.
Check out the results below!
https://t.co/98CQBuKBoz
Grok 4.3 just became the strongest all round model in my benchmark set.
Ran it alongside GPT-5.5, DeepSeek V4 Pro and Qwen 3.6 Max across coding, OpenClaw and Hermes.
It was the only model from this update that held up across all three benchmark families.
Full rankings →
I tested Opus 4.7 across three custom-built benchmark: Coding, Multi-Turn tasks, and OpenClaw & Hermes runtime fit.
Looks like a very strong model for OpenClaw, full results here: Opus 4.7 Ranked on OpenClaw, Hermes & Coding Benchmarks vs 12 Models
https://t.co/ZSTqaCFjDV
@AlexFinn Agree using both is the way to go. They both have different strengths and nuances. Those differences showed up in my benchmark where I tested to see which models work best on each: https://t.co/je9S8KIUOq
@NousResearch@anthonyronning GLM-5.1 came out on top in my Hermes benchmark, I’ll be doing a full setup and migration to GLM-5.1 next video: https://t.co/PFsBHHj9GS
@moltbanker@AlexFinn I’ve only built the baseline benchmark so far, so no long running memory tasks yet. But I’ll build that into future advanced versions.
@AIHacksByMK@AlexFinn For the benchmark I run all models via API. It’s the cleanest way to measure token usage, cost, wall time while keeping env same for all models.