Most multi-agent AI systems don't fail at prompting.
They fail at lifecycle semantics.
A sub-agent says "done", the orchestrator trusts it, and the artifact ships without evidence.
Enoch is my attempt to make that impossible.
@KettlebellDan Well two things... The token consumption and cost are not competitive (all things considered... wrapper/harness features - model size and capacity versus others in market). The only thing I would consider using Grok for right now is sensitive / private work.
@lmariscal@xai Trying. It says: "This person's inbox is closed. They need to update their message settings before you can message them." -- I opened up my DMs. Thank you.
Unless youโre ready to spend serious time (and money) tuning hyperparameters, donโt mess with LLM reasoning traces.
I evaluated multiple reasoning budgets and BNF grammar / structured CoT settings on Qwen3.6 27B.
The results are underwhelming.
Yes, it can work: for a few specific tasks, it significantly reduces inference cost by shortening reasoning traces while preserving accuracy.
But in most settings, simply disabling reasoning is better, both for token efficiency and accuracy.
Full analysis here:
https://t.co/xxLLzVkASx
If anyone is needing a cheap hosting service, I found this one to be the most unique and useful: https://t.co/VFG3xUDqPp
No referral links. Dallas and extremely fast. Paired with Tailscale... not sure there is anything else quite like it.
This is why the wake gate watches the process tree and the telemetry windows instead of trusting the agent's own declaration that the work is done.
Docs: https://t.co/JKrFPUL3Ua
An agent can report that it is finished while child processes are still writing files, the GPU is still allocated, or the local state was never updated.
The model has no direct view of those things. It only knows what it was told or what it can see in its current context window.
If you only listen to what the agent says about completion, you have no independent way to know whether the run actually stopped or just went quiet for a while.
@yoheinakajima behavior.failed as a first-class event in the log, not an exception that disappears. The audit trail captures why something broke, not just that it did. Anyone who's debugged a long-running agent at 2am knows the difference.
@populartourist The n-gram cache scaling looks clean for single-stream, but the memory cost competes with KV cache under --parallel >1. In batched serving that tradeoff shifts. Curious how the curve looks with concurrent requests.
@bravo_abad Tree search over LLM calls is the real pattern. ERA evaluates thousands of candidates per task. The scoring function and branching strategy matter more than the model. This is what agentic actually means in practice: search with a metric, not just a bigger prompt.