When an agent can’t hit my code-quality bar after a few iterations, I lower the LOC limits per file/function and reduce the allowed number of files.
It starts optimizing for simplicity, which is often the clearest instruction.
Same works for ChatGPT: “1 paragraph max.”
Best solutions appear under pressure :)
FrontierCode is basically the eval every serious codebase will eventually need.
We can’t keep writing specs, guides and “rules for agents” and hope the model figures it out.
First, perfect specs don’t exist.
Second, projects are not static. We add features, refactor architecture, shift product priorities, discover edge cases, and constantly change what “good code” means in this repo.
So we need a golden set for our project: a small, stable set of tasks that we run our agent against every time we change the model, prompts, tools, context strategy, or coding rules.
Then you inspect the output like a maintainer:
Would I merge this PR?
Did it solve the actual problem?
Did it introduce regressions?
Is the code idiomatic for this repo?
Are the tests meaningful?
Did it stay within scope?
Automation starts when the output is still bad and hand-tuning instructions stops working.
My usual loop:
One agent implements every task from the golden set from scratch in parallel in isolation (I just use git worktrees for this)
Another agent acts as a judge. It knows the ideal solution, the repo conventions, and the review rubric. It scores the output, finds failure patterns, and proposes changes to the first agent’s instructions.
Then repeat.
But the judge is not magic. You still want deterministic checks: tests, lint, typecheck, benchmarks, and human review for calibration.
The golden set should not change every day. Version it. Keep it stable enough to measure progress, but add new cases when your architecture or product surface changes. I have around 13 tasks maximum in my golden set for one of the project.
Cold start is the harder problem.
For a fresh project, I usually bootstrap the agent on a curated list of high-quality repos in my stack. For me that’s Go repos with clean architecture (like consul from hashicorp, or stdlib) strong tests, boring abstractions, and maintainable code.
The repo-specific evals is the future: golden tasks, calibrated judges, and continuous regression testing for engineering taste.
FrontierCode is basically the eval every serious codebase will eventually need.
We can’t keep writing specs, guides and “rules for agents” and hope the model figures it out.
First, perfect specs don’t exist.
Second, projects are not static. We add features, refactor architecture, shift product priorities, discover edge cases, and constantly change what “good code” means in this repo.
So we need a golden set for our project: a small, stable set of tasks that we run our agent against every time we change the model, prompts, tools, context strategy, or coding rules.
Then you inspect the output like a maintainer:
Would I merge this PR?
Did it solve the actual problem?
Did it introduce regressions?
Is the code idiomatic for this repo?
Are the tests meaningful?
Did it stay within scope?
Automation starts when the output is still bad and hand-tuning instructions stops working.
My usual loop:
One agent implements every task from the golden set from scratch in parallel in isolation (I just use git worktrees for this)
Another agent acts as a judge. It knows the ideal solution, the repo conventions, and the review rubric. It scores the output, finds failure patterns, and proposes changes to the first agent’s instructions.
Then repeat.
But the judge is not magic. You still want deterministic checks: tests, lint, typecheck, benchmarks, and human review for calibration.
The golden set should not change every day. Version it. Keep it stable enough to measure progress, but add new cases when your architecture or product surface changes. I have around 13 tasks maximum in my golden set for one of the project.
Cold start is the harder problem.
For a fresh project, I usually bootstrap the agent on a curated list of high-quality repos in my stack. For me that’s Go repos with clean architecture (like consul from hashicorp, or stdlib) strong tests, boring abstractions, and maintainable code.
The repo-specific evals is the future: golden tasks, calibrated judges, and continuous regression testing for engineering taste.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Looking for companies hiring many remote senior developers and struggling to assess candidates.
We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.
Even if candidates use AI, you can still evaluate their engineering judgment.
DM me for details. Paid pilot, flexible terms.
It's just paradigm shift:
Yes, for a while you may write less code by hand.
But the upside is bigger:
You still need strong engineering judgment and intuition — and now you can use them to solve harder problems faster.
For educators, the move is not to ban LLMs.
It’s to redesign programming tasks around their weaknesses.
We went through this ourselves.
Happy to talk — details here:
https://t.co/Y0KNuNywAX
Looking for companies hiring many remote senior developers and struggling to assess candidates.
We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.
Even if candidates use AI, you can still evaluate their engineering judgment.
DM me for details. Paid pilot, flexible terms.
@ParmarShantun What’s your interview process?
We solved the AI take-home problem in our own hiring after several iterations of tuning the tasks.
Now looking for companies hiring at scale to run a pilot with: https://t.co/MTrEHLzMOn
Looking for companies hiring many remote senior developers and struggling to assess candidates.
We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.
Even if candidates use AI, you can still evaluate their engineering judgment.
DM me for details. Paid pilot, flexible terms.
@HaVy69@LyTran1510 In reality, nothing changed.
You still need to test real engineering judgment.
The only difference: your tasks must now be designed around the weak spots of LLMs, not the strengths.
Described my approach for this problem:
https://t.co/MTrEHLzMOn
Looking for companies hiring many remote senior developers and struggling to assess candidates.
We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.
Even if candidates use AI, you can still evaluate their engineering judgment.
DM me for details. Paid pilot, flexible terms.
Looking for companies hiring many remote senior developers and struggling to assess candidates.
We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.
Even if candidates use AI, you can still evaluate their engineering judgment.
DM me for details. Paid pilot, flexible terms.
@file_mutex@zeeg OSS maintainers already support million-line codebases for years, with hundreds of contributors.
How is that so different from “AI generated the code”?
it’s needing the next abstraction layer (optimized for “compaction”):
https://t.co/PCYiwvYKGB
Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read.
Machine code → Assembly → C → high-level languages → …
What’s next, and who’s building it?
My bet: we’ll still be reading code 100 years from now.
Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read.
Machine code → Assembly → C → high-level languages → …
What’s next, and who’s building it?
My bet: we’ll still be reading code 100 years from now.
Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read.
Machine code → Assembly → C → high-level languages → …
What’s next, and who’s building it?
My bet: we’ll still be reading code 100 years from now.
@drnafizhamid@zeeg Yes, but Claude feels like a case study in what not to do. They lost momentum after Opus 4.5, and Harness is a bug monster. Wouldn’t be surprised if a postmortem drops soon: “Why we went back to reading code.” lol
After several iterations, the “prompt” gets long too.
But I think the loop is the point: it filters out the noise and keeps only what matters for this project, at this moment.
So the right feedback loop is much more important than the initial prompt — or any prompt-writing skill.
A prompt is like a map. The loop is the coastline being redrawn as you sail.
I guess you get the analogy :)
My take on AI agent loops changed while trying to improve an agent on SlopCodeBench — the closest benchmark I’ve seen to real software work: you write code, then requirements keep changing, and the original architecture either survives or collapses.
At first I tried the obvious thing: better specs, longer coding style guides, DRY/KISS/etc.
It helped almost not at all. The specs became so detailed that writing the code myself was faster.
What worked was looping the agent environment itself: prompts, linters, evals.
My loop now:
Split a task into iterations. Each iteration adds a feature that creates real architectural pressure.
Run 3 agents: one writes the task, one writes code, one judges code quality.
The next feature is locked until quality score > 0.95.
If score < 0.95, the judge updates AGENTS.md, rolls back the feature, and makes the coding agent rewrite it.
Over time, AGENTS.md + linters + style guides become the project’s memory.
That’s where agent loops make sense to me: not “let the agent code forever,” but “make every failure improve the system that produces the next attempt.”
3 things I wanted to understand about "agentic loops":
1. What are they actually?
2. Is it hype?
3. What are the real use cases?
This is the most practical, clearly explained video on "agentic loops" on the internet (thx @Rasmic)
https://t.co/BhZSTfM4uf
My take on AI agent loops changed while trying to improve an agent on SlopCodeBench — the closest benchmark I’ve seen to real software work: you write code, then requirements keep changing, and the original architecture either survives or collapses.
At first I tried the obvious thing: better specs, longer coding style guides, DRY/KISS/etc.
It helped almost not at all. The specs became so detailed that writing the code myself was faster.
What worked was looping the agent environment itself: prompts, linters, evals.
My loop now:
Split a task into iterations. Each iteration adds a feature that creates real architectural pressure.
Run 3 agents: one writes the task, one writes code, one judges code quality.
The next feature is locked until quality score > 0.95.
If score < 0.95, the judge updates AGENTS.md, rolls back the feature, and makes the coding agent rewrite it.
Over time, AGENTS.md + linters + style guides become the project’s memory.
That’s where agent loops make sense to me: not “let the agent code forever,” but “make every failure improve the system that produces the next attempt.”
@BVeiseh Same feeling. I was really hoping this would make me switch back to Claude from GPT-5.5, but it’s not there yet.
It still can’t get through my CLI test:
I tested Fable 5 on my small AI coding-agent eval.
Same task I use for GPT-5.5 and Opus 4.8.
Not a hard algorithm.
Not a giant repo.
Just a small CLI project with requirements added one by one.
The goal is to see whether the initial design can survive iterative product changes.
Result from Fable 5, default settings:
https://t.co/Kw6V6oARi8
TL;DR: I don’t see a real jump here since opus 4.5 / gpt 5.5 moment. On this type of iterative engineering task, agents still tend to produce slop unless you actively control the architecture.
Here is the exact process:
This feels like hype and a huge simplification of the actual problem — basically a new framing of the good old RALPH loop.
The hard part is missing: models still struggle with multi-step tasks, especially in software engineering, where they start producing slop as requirements evolve.
Right now every team seems to solve this in isolation: tweak the system prompt, build internal RL/eval loops, and try to catch where the model drifts from the spec.
A simple example from a small real coding task, tested on Fable 5:
https://t.co/wSGwrclJmI
I tested Fable 5 on my small AI coding-agent eval.
Same task I use for GPT-5.5 and Opus 4.8.
Not a hard algorithm.
Not a giant repo.
Just a small CLI project with requirements added one by one.
The goal is to see whether the initial design can survive iterative product changes.
Result from Fable 5, default settings:
https://t.co/Kw6V6oARi8
TL;DR: I don’t see a real jump here since opus 4.5 / gpt 5.5 moment. On this type of iterative engineering task, agents still tend to produce slop unless you actively control the architecture.
Here is the exact process: