Eriks Briedis

@eriks_b

Software engineer shipping LLM systems. Currently going deep on evals, RAG, and agentic workflows. Notes from the work.

Joined August 2009

49 Following

64 Followers

83 Posts

Eriks Briedis

@eriks_b

about 14 hours ago

Run pairwise LLM judges in both orders: A/B and B/A. If the winner changes when the order changes, treat it as a tie. Otherwise you’re baking in position bias and calling it preference.

Eriks Briedis

@eriks_b

about 14 hours ago

@phosphenq The “every trajectory flows back into memory” part is the risky bit. If memory only gets success summaries, the swarm starts training on its own false positives. Make the memory unit the raw trace, the failing repro before, and the same check passing after.

Eriks Briedis

@eriks_b

about 15 hours ago

@itsPaulAi Before and after. Before fine-tuning, evals give you a baseline and show where the local model is actually failing vs Claude. After fine-tuning, the same evals tell you whether it really improved on the task, or just learned to sound more like your examples.

Eriks Briedis

@eriks_b

about 15 hours ago

@AiCamila_ The curated-table step is the place to be strict: keep the raw trace refs next to the derived record. For agent evals, input, retrieved docs, tool calls, model version, output, judge result, and cost/latency should all hang off the same run id.

112

Who to follow

5 x boosted 😷💉, Xe/Xim, Vice/BBC, BLM ✊🏿, Democrat, Vegan, Just Stop Oil, Climate change ⚠️! Biden 2024 🇺🇦 🌈

Edgars Koroņevskis

@koronevskis

Internet komunikācija, sociālie mediji, Internet biznesa projekti. Blogs: https://t.co/GBTUOdCbH2

Eriks Briedis

@eriks_b

about 15 hours ago

@leanxbt The loss filter is the interesting bit here. It may find the right tool-call moments, but production agents still need API guardrails: valid params, idempotency, loud failures, and a trace showing which call changed the answer.

Eriks Briedis

@eriks_b

about 15 hours ago

@h100envy The self-evaluate step is doing a lot of work here. Game of 24 has a clean judge signal: did you make 24 or not? With coding agents, I trust search more when each branch has an executable check, like a failing repro before the patch and the same check after.

Eriks Briedis

@eriks_b

about 15 hours ago

The harness layer carries most of the leverage here. I’d be careful making “the model checks its own work” the reliability boundary. For coding agents, keep the check external: a failing repro before, the same check after, plus a diff scan for weakened tests or code quality degradation.

Eriks Briedis

@eriks_b

2 days ago

An eval suite with a 100% pass rate tells you very little. I’d rather keep the main suite full of cases where models disagree, fail intermittently, or recently broke in production. Move the solved cases into a regression tier.

Eriks Briedis

@eriks_b

2 days ago

@svpino The Fibonacci loop is clean. For real bugfixes, the missing constraint is usually making the test or repro fail for the expected reason before touching production code. Otherwise the agent can write code and tests that only agree with each other.

386

Eriks Briedis

@eriks_b

2 days ago

That separate reporting line matters most when failures come in. When production misses something, someone has to turn that miss into a case and keep the trace with it. Cases that have stopped teaching you anything should disappear too, or the eval team ends up making dashboards for the next deploy.

Eriks Briedis

@eriks_b

2 days ago

@HeyAnjula At the loop layer, the checker has to be pretty literal. For coding agents, “done” means a failing repro before, the same check after, and a diff scan for weakened tests or degraded code quality. The model’s completion summary is just another thing to inspect.

Eriks Briedis

@eriks_b

2 days ago

@pauliusztin_ Steps 5 and 7 are the loop: every human override or bad trace turns into an eval case. The trace shows where it failed once. The eval keeps that same failure from slipping back in.

Eriks Briedis

@eriks_b

2 days ago

@DanKornas Provenance tracking is where I’d look first here. In messy document pipelines, OCR and entity extraction only hold up if each claim points back to the raw scan, page, bounding box, and parser step. Otherwise the graph or timeline can turn into a very convincing rumor machine.

Eriks Briedis

@eriks_b

2 days ago

@vanstriendaniel @huggingface Guessed CLI flags are usually an interface problem. Agents do better with copy-paste commands, valid parameter values, and errors that point to the next command instead of dumping a generic 400.

Eriks Briedis

@eriks_b

2 days ago

@humzaakhalid A Claude Project with style files only keeps working when the weekly brief, decision log, Sunday close, and repeat-work docs stay fresh. Once those go stale, the “AI business OS” is mostly a nicer prompt with old context.

Eriks Briedis

@eriks_b

2 days ago

@akshay_pachaar That wrong-filename case is exactly where the trace earns its keep. A single score can hide a bad file path, a tool error, a shaky verifier assumption, or context that got handled badly. The run log is what tells you what actually changed.

347

Eriks Briedis

@eriks_b

2 days ago

@femke_plantinga Maintenance is where this usually breaks. My Claude Code wiki experiment hit the same failure: injected summaries felt useful at first, then duplicate open questions and stale decisions piled up. The memory needed status/supersedes/evidence more than extra context.

275

Eriks Briedis

@eriks_b

2 days ago

Someone on the client side needs to be able to change a rule, run the relevant eval case, get the change approved, inspect what happened, and roll it back when it goes wrong. Otherwise every new edge case turns into a call to the consultants, and the project becomes a subscription to the implementation team.

262

Eriks Briedis

@eriks_b

3 days ago

@Av1dlive Claude can chew through the overnight research batch and score how much it likes each trade. Anything that places orders should sit behind hard-coded risk checks, because the verify/refine/rerun loop is where a toy repo can quietly become an account-drainer.

289

Eriks Briedis

@eriks_b

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users