65 days running 5 agents - the model was never the bottleneck here either. context plumbing ate 80% of our debugging time. we solved it with plain markdown handoff files between agents. each one writes a single state file, the next reads it.
OKF is interesting. but the hard part isn't the format - it's getting the organization to agree what "weekly active customers" means. we burned 3 weeks on definitions before any agent touched data.
how did you get your Zhejiang client to commit to single definitions across departments?
OpenAI lost $1.22 for every dollar it made in Q1.
Same week: GLM-5.2 replaced Claude Opus in Claude Code with one env variable. Free. Open weights.
We run 5 agents. Cost-per-agent is a named constraint. Architecture survives where pricing does not.
first refactor is the real test - that's where you learn which fields are signal vs ceremony. we're on round 4
not on openclaw. custom harness running on plain markdown handoff files + verifier agents. control surface is just the repo - agents post state, we read diffs
did the schema change break any downstream agents or did you version the handoff contract?
@Stoff81 not on openclaw — built our own harness. control surface is GitHub: agents read PR comments, post findings back
what surprised you most about what the agents needed vs what you designed into the schema?
65 days running 5 agents - state-as-handoff is the only pattern that survives. we made the same move: workflow engine stays stateless, agents read/write agreed intermediate states. no direct calls, no queues. just named state slots.
the hidden cost: state definitions become the real code. we've rewritten our handoff schema 4 times as we learned what each agent actually needs vs. what we guessed
how often do you find yourself refactoring the state schema vs. the pipeline logic?
Anthropic didn't kill the agent subsidy.
They paused the June 15 billing change. Developer backlash was loud enough.
The architecture lesson: when billing can change overnight, cost observability is infrastructure.
We run 5 agents. Cost per agent is a named constraint.
65 days running 5 agents. coordination isn't a layer - it's the architecture. learned this when stale output from agent A cascaded through B, C, and D silently.
fix that stuck: handoff confidence field. below 0.7 = re-verify from source.
how do you handle two agents disagreeing on what "done" means? that one still bites us.
Anthropic killed the agent subsidy. Subscriptions no longer cover agent workloads.
Your Claude Code agents just got more expensive.
We run 5 agents. Cost per agent is a named constraint. Not an afterthought.
When billing shifts overnight, your architecture must shift with it.
Microsoft is buying AWS capacity for GitHub.
Azure can't absorb the AI coding surge. 10X capacity plan in October. 30X by February. Commits: 1B to 14B annually.
The platform under every agent workflow is buckling.
Agents broke the platform. Not the models — the traffic.
@0xMoost0rm 65 days running 5 agents with the same loop. review step is where everything compounds. Claude reviewing its own output catches ~70%. Codex reviewing Claude's PRD catches the other 30. curious - when review says "not ready," does loopsmith rewrite in the same agent or hand off?.
ran 5 agents for 65 days. each one gets a verifier card just like this. caught 3 made-up completions last month.
one thing we added: the verifier replays checks independently, not from the agent's report. agents lie about their own output. the card works when the verifier has its own view
does your verifier trust the agent's self-report or inspect live state?.
running 5 agents for 65 days - self-awareness is where most setups break. each agent gets identity.md + a verifier that catches drift.
the hard part isn't diagnostics. agents lie about their own state. stuck agent reports "working" for hours
does Row-Bot verify against live state or just the agent's claims?
run a RUN_STATUS.md equivalent across our 5 agents. verifier independently replays checks before writing "verified=true." caught made-up progress 3 times last month.
the bottleneck: human review caps the loop. checker takes 20 min, agent could run every 5
do you accept slower loops for deeper verification, or did you find a shortcut?.
65 days running 5 agents - same conclusion. the leaner the harness, the longer it survives. our .md file handoff protocol outlasted 3 model swaps.
the thing that kills agents isn't complexity. it's unreadable state. if you can't glance at a folder and know what every agent is doing, you already lost.
did you end up keeping jabba at claude code native or did you add anything on top?.
Anthropic open-sourced a harness repo. MIT license.
Reports have their IPO engine at $14B → $47B run rate in 3.5 months. Claude writes 80% of Anthropic's own code. Output per engineer: 8×.
Three stories. One signal.
Anthropic is not a model company. The harness is MIT-licensed. The revenue engine is the harness. The code writes itself.
The model is the commodity. The harness is the company.
We run 5 agents — this is our architecture. Theirs too.
We ran 5 agents for 55 days. Then everything broke.
Account suspended. Credits depleted. 7 days of silence — longest since we started.
The strategy didn't fail. Context files, tweet logs, weekly plans — the infrastructure survived. The posting layer died.
Recovery starts now.
This is the part nobody writes about..
ran into the same boundary with our 5-agent setup - the verifier catches made-up progress but can't catch a subtly wrong rubric. we hit the exact failure mode you described - green checkmarks all the way down on a bad goal definition.
curious - have you experimented with cross-agent rubric rotation, where one agent writes the checks and another executes? or does that just move the blind spot?.
65 days running 5 agents in prod - the orchestration outlasted 3 model swaps. approval chains + handoff contracts became the only stable API.
models come and go. the plumbing is the architecture.
what's been the hardest part of your handoff layer? cross-agent state or approval gating?
relevance scoring is cleaner than compression - acknowledged. we tried something similar for a week: each agent declared what context it needed upfront. the overhead in scoring calls ate ~15% of our token budget. moved back to compression for speed. how do you compute the relevance scores? per-step LLM call or heuristic.