luka @lukatechme - Twitter Profile

1 day ago

When an agent can’t hit my code-quality bar after a few iterations, I lower the LOC limits per file/function and reduce the allowed number of files. It starts optimizing for simplicity, which is often the clearest instruction. Same works for ChatGPT: “1 paragraph max.” Best solutions appear under pressure :)

luka

@lukatechme

1 day ago

FrontierCode is basically the eval every serious codebase will eventually need. We can’t keep writing specs, guides and “rules for agents” and hope the model figures it out. First, perfect specs don’t exist. Second, projects are not static. We add features, refactor architecture, shift product priorities, discover edge cases, and constantly change what “good code” means in this repo. So we need a golden set for our project: a small, stable set of tasks that we run our agent against every time we change the model, prompts, tools, context strategy, or coding rules. Then you inspect the output like a maintainer: Would I merge this PR? Did it solve the actual problem? Did it introduce regressions? Is the code idiomatic for this repo? Are the tests meaningful? Did it stay within scope? Automation starts when the output is still bad and hand-tuning instructions stops working. My usual loop: One agent implements every task from the golden set from scratch in parallel in isolation (I just use git worktrees for this) Another agent acts as a judge. It knows the ideal solution, the repo conventions, and the review rubric. It scores the output, finds failure patterns, and proposes changes to the first agent’s instructions. Then repeat. But the judge is not magic. You still want deterministic checks: tests, lint, typecheck, benchmarks, and human review for calibration. The golden set should not change every day. Version it. Keep it stable enough to measure progress, but add new cases when your architecture or product surface changes. I have around 13 tasks maximum in my golden set for one of the project. Cold start is the harder problem. For a fresh project, I usually bootstrap the agent on a curated list of high-quality repos in my stack. For me that’s Go repos with clean architecture (like consul from hashicorp, or stdlib) strong tests, boring abstractions, and maintainable code. The repo-specific evals is the future: golden tasks, calibrated judges, and continuous regression testing for engineering taste.

0

109

0

1

0

60

luka

@lukatechme

1 day ago

FrontierCode is basically the eval every serious codebase will eventually need. We can’t keep writing specs, guides and “rules for agents” and hope the model figures it out. First, perfect specs don’t exist. Second, projects are not static. We add features, refactor architecture, shift product priorities, discover edge cases, and constantly change what “good code” means in this repo. So we need a golden set for our project: a small, stable set of tasks that we run our agent against every time we change the model, prompts, tools, context strategy, or coding rules. Then you inspect the output like a maintainer: Would I merge this PR? Did it solve the actual problem? Did it introduce regressions? Is the code idiomatic for this repo? Are the tests meaningful? Did it stay within scope? Automation starts when the output is still bad and hand-tuning instructions stops working. My usual loop: One agent implements every task from the golden set from scratch in parallel in isolation (I just use git worktrees for this) Another agent acts as a judge. It knows the ideal solution, the repo conventions, and the review rubric. It scores the output, finds failure patterns, and proposes changes to the first agent’s instructions. Then repeat. But the judge is not magic. You still want deterministic checks: tests, lint, typecheck, benchmarks, and human review for calibration. The golden set should not change every day. Version it. Keep it stable enough to measure progress, but add new cases when your architecture or product surface changes. I have around 13 tasks maximum in my golden set for one of the project. Cold start is the harder problem. For a fresh project, I usually bootstrap the agent on a curated list of high-quality repos in my stack. For me that’s Go repos with clean architecture (like consul from hashicorp, or stdlib) strong tests, boring abstractions, and maintainable code. The repo-specific evals is the future: golden tasks, calibrated judges, and continuous regression testing for engineering taste.

Cognition @cognition

9 days ago

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

cognition's tweet photo. Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

241

4K

317

2K

3M

0

109

luka

@lukatechme

2 days ago

@CliftonSellers Interested. Just sent dm

0

9

luka

@lukatechme

2 days ago

@pie6k If you are ok to hire through take-home tasks, we can help, details here:

luka

@lukatechme

2 days ago

Looking for companies hiring many remote senior developers and struggling to assess candidates. We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase. Even if candidates use AI, you can still evaluate their engineering judgment. DM me for details. Paid pilot, flexible terms.

lukatechme's tweet photo. Looking for companies hiring many remote senior developers and struggling to assess candidates.

We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase.

Even if candidates use AI, you can still evaluate their engineering judgment.

DM me for details. Paid pilot, flexible terms.

1

10

0

3

47K

0

9

luka

@lukatechme

2 days ago

It's just paradigm shift: Yes, for a while you may write less code by hand. But the upside is bigger: You still need strong engineering judgment and intuition — and now you can use them to solve harder problems faster. For educators, the move is not to ban LLMs. It’s to redesign programming tasks around their weaknesses. We went through this ourselves. Happy to talk — details here: https://t.co/Y0KNuNywAX

luka

@lukatechme

2 days ago

Looking for companies hiring many remote senior developers and struggling to assess candidates. We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase. Even if candidates use AI, you can still evaluate their engineering judgment. DM me for details. Paid pilot, flexible terms.

1

10

0

3

47K

0

5

luka

@lukatechme

2 days ago

@ParmarShantun What’s your interview process? We solved the AI take-home problem in our own hiring after several iterations of tuning the tasks. Now looking for companies hiring at scale to run a pilot with: https://t.co/MTrEHLzMOn

luka

@lukatechme

2 days ago

Looking for companies hiring many remote senior developers and struggling to assess candidates. We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase. Even if candidates use AI, you can still evaluate their engineering judgment. DM me for details. Paid pilot, flexible terms.

1

10

0

3

47K

0

4

luka

@lukatechme

2 days ago

@HaVy69 @LyTran1510 In reality, nothing changed. You still need to test real engineering judgment. The only difference: your tasks must now be designed around the weak spots of LLMs, not the strengths. Described my approach for this problem: https://t.co/MTrEHLzMOn

luka

@lukatechme

2 days ago

Looking for companies hiring many remote senior developers and struggling to assess candidates. We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase. Even if candidates use AI, you can still evaluate their engineering judgment. DM me for details. Paid pilot, flexible terms.

1

10

0

3

47K

0

2

luka

@lukatechme

2 days ago

Looking for companies hiring many remote senior developers and struggling to assess candidates. We built https://t.co/xvrVwOmjj6: AI-resistant take-home and live-coding tasks based on your codebase. Even if candidates use AI, you can still evaluate their engineering judgment. DM me for details. Paid pilot, flexible terms.

1

10

0

3

47K

luka

@lukatechme

3 days ago

@file_mutex @zeeg OSS maintainers already support million-line codebases for years, with hundreds of contributors. How is that so different from “AI generated the code”? it’s needing the next abstraction layer (optimized for “compaction”): https://t.co/PCYiwvYKGB

luka

@lukatechme

3 days ago

Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read. Machine code → Assembly → C → high-level languages → … What’s next, and who’s building it? My bet: we’ll still be reading code 100 years from now.

0

1

2

312

0

56

luka

@lukatechme

3 days ago

@star_lytt @paulg What’s your idea is about ?

1

0

441

luka

@lukatechme

3 days ago

@paulg Sending a signal to the universe :)) Looking for a friend who’s working on this: https://t.co/PCYiwvYKGB

luka

@lukatechme

3 days ago

Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read. Machine code → Assembly → C → high-level languages → … What’s next, and who’s building it? My bet: we’ll still be reading code 100 years from now.

0

1

2

312

0

110

luka

@lukatechme

3 days ago

Following the essay’s advice: what I want is the next abstraction layer for “code” — something that makes huge amounts of AI-generated code easy to read. Machine code → Assembly → C → high-level languages → … What’s next, and who’s building it? My bet: we’ll still be reading code 100 years from now.

Paul Graham

@paulg

3 days ago

How to Earn a Billion Dollars: https://t.co/WeWBUkKym6

576

12K

1K

17K

3M

0

1

2

312

luka

@lukatechme

3 days ago

Can’t wait for this vibe to go mainstream.

David Cramer

@zeeg

4 days ago

@file_mutex people who dont read the code are not serious people and it takes a serious person to ship production software

33

944

94

74

117K

0

107

luka

@lukatechme

3 days ago

@drnafizhamid @zeeg Yes, but Claude feels like a case study in what not to do. They lost momentum after Opus 4.5, and Harness is a bug monster. Wouldn’t be surprised if a postmortem drops soon: “Why we went back to reading code.” lol

0

1

0

36

luka

@lukatechme

7 days ago

After several iterations, the “prompt” gets long too. But I think the loop is the point: it filters out the noise and keeps only what matters for this project, at this moment. So the right feedback loop is much more important than the initial prompt — or any prompt-writing skill. A prompt is like a map. The loop is the coastline being redrawn as you sail. I guess you get the analogy :)

1

0

14

luka

@lukatechme

7 days ago

My take on AI agent loops changed while trying to improve an agent on SlopCodeBench — the closest benchmark I’ve seen to real software work: you write code, then requirements keep changing, and the original architecture either survives or collapses. At first I tried the obvious thing: better specs, longer coding style guides, DRY/KISS/etc. It helped almost not at all. The specs became so detailed that writing the code myself was faster. What worked was looping the agent environment itself: prompts, linters, evals. My loop now: Split a task into iterations. Each iteration adds a feature that creates real architectural pressure. Run 3 agents: one writes the task, one writes code, one judges code quality. The next feature is locked until quality score > 0.95. If score < 0.95, the judge updates AGENTS.md, rolls back the feature, and makes the coding agent rewrite it. Over time, AGENTS.md + linters + style guides become the project’s memory. That’s where agent loops make sense to me: not “let the agent code forever,” but “make every failure improve the system that produces the next attempt.”

GREG ISENBERG

@gregisenberg

7 days ago

3 things I wanted to understand about "agentic loops": 1. What are they actually? 2. Is it hype? 3. What are the real use cases? This is the most practical, clearly explained video on "agentic loops" on the internet (thx @Rasmic) https://t.co/BhZSTfM4uf

81

1K

100

2K

124K

2

0

2

397

luka

@lukatechme

7 days ago

@gregisenberg @Rasmic Another way to think about loops: https://t.co/OpECHw356k

luka

@lukatechme

7 days ago

My take on AI agent loops changed while trying to improve an agent on SlopCodeBench — the closest benchmark I’ve seen to real software work: you write code, then requirements keep changing, and the original architecture either survives or collapses. At first I tried the obvious thing: better specs, longer coding style guides, DRY/KISS/etc. It helped almost not at all. The specs became so detailed that writing the code myself was faster. What worked was looping the agent environment itself: prompts, linters, evals. My loop now: Split a task into iterations. Each iteration adds a feature that creates real architectural pressure. Run 3 agents: one writes the task, one writes code, one judges code quality. The next feature is locked until quality score > 0.95. If score < 0.95, the judge updates AGENTS.md, rolls back the feature, and makes the coding agent rewrite it. Over time, AGENTS.md + linters + style guides become the project’s memory. That’s where agent loops make sense to me: not “let the agent code forever,” but “make every failure improve the system that produces the next attempt.”

2

0

2

397

0

145

luka

@lukatechme

7 days ago

@jakehalloran1 How did you test it?

0

416

luka

@lukatechme

7 days ago

@BVeiseh Same feeling. I was really hoping this would make me switch back to Claude from GPT-5.5, but it’s not there yet. It still can’t get through my CLI test:

luka

@lukatechme

7 days ago

I tested Fable 5 on my small AI coding-agent eval. Same task I use for GPT-5.5 and Opus 4.8. Not a hard algorithm. Not a giant repo. Just a small CLI project with requirements added one by one. The goal is to see whether the initial design can survive iterative product changes. Result from Fable 5, default settings: https://t.co/Kw6V6oARi8 TL;DR: I don’t see a real jump here since opus 4.5 / gpt 5.5 moment. On this type of iterative engineering task, agents still tend to produce slop unless you actively control the architecture. Here is the exact process:

1

2

0

3

4K

1

3

0

963

luka

@lukatechme

7 days ago

This feels like hype and a huge simplification of the actual problem — basically a new framing of the good old RALPH loop. The hard part is missing: models still struggle with multi-step tasks, especially in software engineering, where they start producing slop as requirements evolve. Right now every team seems to solve this in isolation: tweak the system prompt, build internal RL/eval loops, and try to catch where the model drifts from the spec. A simple example from a small real coding task, tested on Fable 5: https://t.co/wSGwrclJmI

luka

@lukatechme

7 days ago

I tested Fable 5 on my small AI coding-agent eval. Same task I use for GPT-5.5 and Opus 4.8. Not a hard algorithm. Not a giant repo. Just a small CLI project with requirements added one by one. The goal is to see whether the initial design can survive iterative product changes. Result from Fable 5, default settings: https://t.co/Kw6V6oARi8 TL;DR: I don’t see a real jump here since opus 4.5 / gpt 5.5 moment. On this type of iterative engineering task, agents still tend to produce slop unless you actively control the architecture. Here is the exact process:

1

2

0

3

4K

0

238

luka

@lukatechme

Last Seen Users on Sotwe

Trends for you

Most Popular Users