Run pairwise LLM judges in both orders: A/B and B/A. If the winner changes when the order changes, treat it as a tie. Otherwise you’re baking in position bias and calling it preference.
@phosphenq The “every trajectory flows back into memory” part is the risky bit. If memory only gets success summaries, the swarm starts training on its own false positives. Make the memory unit the raw trace, the failing repro before, and the same check passing after.
@itsPaulAi Before and after.
Before fine-tuning, evals give you a baseline and show where the local model is actually failing vs Claude.
After fine-tuning, the same evals tell you whether it really improved on the task, or just learned to sound more like your examples.
@AiCamila_ The curated-table step is the place to be strict: keep the raw trace refs next to the derived record.
For agent evals, input, retrieved docs, tool calls, model version, output, judge result, and cost/latency should all hang off the same run id.
@leanxbt The loss filter is the interesting bit here. It may find the right tool-call moments, but production agents still need API guardrails: valid params, idempotency, loud failures, and a trace showing which call changed the answer.
@h100envy The self-evaluate step is doing a lot of work here. Game of 24 has a clean judge signal: did you make 24 or not? With coding agents, I trust search more when each branch has an executable check, like a failing repro before the patch and the same check after.
The harness layer carries most of the leverage here. I’d be careful making “the model checks its own work” the reliability boundary.
For coding agents, keep the check external: a failing repro before, the same check after, plus a diff scan for weakened tests or code quality degradation.
An eval suite with a 100% pass rate tells you very little.
I’d rather keep the main suite full of cases where models disagree, fail intermittently, or recently broke in production. Move the solved cases into a regression tier.
@svpino The Fibonacci loop is clean.
For real bugfixes, the missing constraint is usually making the test or repro fail for the expected reason before touching production code. Otherwise the agent can write code and tests that only agree with each other.
That separate reporting line matters most when failures come in.
When production misses something, someone has to turn that miss into a case and keep the trace with it. Cases that have stopped teaching you anything should disappear too, or the eval team ends up making dashboards for the next deploy.
@HeyAnjula At the loop layer, the checker has to be pretty literal. For coding agents, “done” means a failing repro before, the same check after, and a diff scan for weakened tests or degraded code quality. The model’s completion summary is just another thing to inspect.
@pauliusztin_ Steps 5 and 7 are the loop: every human override or bad trace turns into an eval case.
The trace shows where it failed once. The eval keeps that same failure from slipping back in.
@DanKornas Provenance tracking is where I’d look first here. In messy document pipelines, OCR and entity extraction only hold up if each claim points back to the raw scan, page, bounding box, and parser step. Otherwise the graph or timeline can turn into a very convincing rumor machine.
@vanstriendaniel@huggingface Guessed CLI flags are usually an interface problem. Agents do better with copy-paste commands, valid parameter values, and errors that point to the next command instead of dumping a generic 400.
@humzaakhalid A Claude Project with style files only keeps working when the weekly brief, decision log, Sunday close, and repeat-work docs stay fresh. Once those go stale, the “AI business OS” is mostly a nicer prompt with old context.
@akshay_pachaar That wrong-filename case is exactly where the trace earns its keep. A single score can hide a bad file path, a tool error, a shaky verifier assumption, or context that got handled badly. The run log is what tells you what actually changed.
@femke_plantinga Maintenance is where this usually breaks. My Claude Code wiki experiment hit the same failure: injected summaries felt useful at first, then duplicate open questions and stale decisions piled up. The memory needed status/supersedes/evidence more than extra context.
Someone on the client side needs to be able to change a rule, run the relevant eval case, get the change approved, inspect what happened, and roll it back when it goes wrong. Otherwise every new edge case turns into a call to the consultants, and the project becomes a subscription to the implementation team.
@Av1dlive Claude can chew through the overnight research batch and score how much it likes each trade. Anything that places orders should sit behind hard-coded risk checks, because the verify/refine/rerun loop is where a toy repo can quietly become an account-drainer.