8/ Repo here: https://t.co/OQnJIkEiEr
This is just a side project, but if you keep a Karpathy-style LLM wiki or agent-written markdown knowledge base, I would be curious whether this matches your workflow.
@Teknium@theo This is the way! I was envisioning that skills would be classified into bundles users could choose their defaults. Another great optimization that will make hermes feel more professional
@ollama@MiniMax_AI MVP sub as always. Deepseek launch was rough but so far so good on minimax.
Testing has been good so far, instruction following and agentic use cases seem to be a decent. Strong upgrade to mm 2.7.
@t_blom Companies usually treat context as stored knowledge, when agents need working state.
Docs = “what do we know?”
Workflow = “what happened, what matters etc.”
Without that layer, every agent starts smart but organizationally dumb.
@theo This is why I like agent benchmarks that expose cost per task, not only pass rate.
In real workflows, the bill is not one shot. It is all the retries, dead-end edits, context reloads, and human review needed before the change is safe to ship.
This is why I’m skeptical of long-context claims as a single number. In real agent loops, context has to recover the right files, preserve constraints, avoid repeating failed paths, and keep the target visible.
If those degrade differently as length grows, real performance is impacted more by the decay curve, not just the window size.
@Tur24Tur@XiaomiMiMo It is not tokens, it’s credits they have a scaling factor and everything takes credits. It’s a bit better now about 4-5x better in fact but the value actually isn’t quite there yet if you ask me.
https://t.co/H1qZ88WlfD
MiMo v2.5 pro is great but I'm afraid to report that their token plan value is still not resolved. I thought it was going fine but after I monitored their credit usage a bit more, its still ridiculous.
I fed a similar audit task to GPT 5.4 and after two follow up questions, GPT 5.4 exhausted 20% or so of my 5h window.
On the similar task, MiMo ate up ~35% (the other ~12% was from my prior testing) of my entire week's quota. I suspect they are still charging full credits for the cached prompts. Sad.
@mr_r0b0t@NousResearch@Teknium https://t.co/Iv9GMrz1FY
I actually made a hermes skill for it with slight ergonomic improvement to teach agents how to browse + author.
My agents made a Hermes Webwright skill today.
Repo: https://t.co/KFKx5bYwkI
Credit to Microsoft for Webwright. I did not build the CLI. This is a Hermes-native skill adapting the pattern for how my agents browse and extract.
@Teknium@NousResearch if this fits the direction of Hermes skills, feel free to use any of it. What I found interesting and encoded was the decision logic, not just adding another browser tool.
WebWright: microsoft/Webwright: A simple SWE style browser agent framework that achieves SOTA results on long horizon web tasks.
My agents made a Hermes Webwright skill today.
Repo: https://t.co/KFKx5bYwkI
Credit to Microsoft for Webwright. I did not build the CLI. This is a Hermes-native skill adapting the pattern for how my agents browse and extract.
The lesson was not "always automate the browser." It was the opposite.
Use the lightest layer that gives enough reliability: normal browser → Playwright scratch → durable script
Promote only when the path stabilizes, repeats, or needs evidence another agent can audit.
I think the enterprise gap is between “workflow as SOP” and “workflow as operating state.”
SOPs define the steps. But real work also needs to preserve what happened between runs: what changed, what failed, what exception path was taken, what evidence passed, and what should constrain the next execution.
If that context stays in the human’s head, it is not really an agent workflow. It is automation with a human memory patch.
@BeauJohnson89 Better context is usually just the starting state. But if there are fewer tool calls, we have real efficiency. We need to start tracking a metric associated with fewer wrong turns to really understand the impact these things have.
@0xharrynguyen Exactly! Yet I've seen quite a few people get tricked the agents. Its actually not hard to put it into the core files to give you the receipts, but how it scales as the system grows with more automation is the tricky part that I'm exploring.
After building with agents for quite some time now. I’m becoming less interested in whether an agent answer sounds impressive, and more interested in whether I can see enough proof to trust the run.
Using Hermes has made this more obvious for me because the agent can have continuity: memory, tools, skills, scheduled runs, and persistent roles. Once you have that, the question shifts from "can it answer?" to "whats the workflow?"
For anything serious, I want a small proof packet:
1. what I asked it to do
2. what context it used
3. what it changed or concluded
4. how it verified the result
5. what should carry into the next run
If that proof is missing, I still doubt the output, even if it sounds right.
To me this is where human-agent workflows get interesting. The agent runtime matters a lot, and Hermes is the base layer I’m building around. But the human still needs a clear trust boundary around each run: what happened, why it is safe, and what should persist.
That is the part I’m going to keep testing in public.