StackBench

Verified account

@stackbenchdev

Building AI agent workflows for precise shipping. Specs, prompt engineering, verifiability & harnesses. Build in public at

~/stackbench

Joined April 2026

181 Following

33 Followers

368 Posts

Pinned Tweet

about 1 month ago

if you want to improve how you use AI or your agentic systems, the one piece of advice i'd give you is this: invest in harness engineering. it matters more than which model you pick. i am constantly approached by people saying claude code is "dumb" or "verbose" or "loses context every session." then they spend one weekend doing harness work: - a scoped CLAUDE.md - skills for the workflows they repeat - a reviewer subagent that runs against every diff - deterministic checks (linter, type checker, structural tests) wired to fire in-session via hooks. the same model, suddenly ships clean. just clean steering, clean feedback, the agent doing the thing it was supposed to do. harness engineering is the name of this practice. the model generates code. the inner harness (claude code, cursor) runs the loop. the outer harness, yours, is where you have actual leverage. two halves: 1. guides (CLAUDE.md, skills, rules, push channel) 2. sensors (linters, type checks, llm-as-judge, deterministic feedback the agent runs against itself). most people stop at guides. that's why results plateau. once a rule is precise, make it deterministic. an encoded check beats another paragraph of markdown. if you tried claude code once or twice and gave up because it felt half-baked, the issue may not have been the model. it might have been the empty outer harness wrapping it. harness engineering is the one practice i point every claude code or codex user at. especially anyone who almost gave up on it. who else has seen their workflow completely level up after building a proper outer harness? drop your before/after below 👇

stackbenchdev's tweet photo. if you want to improve how you use AI or your agentic systems, the one piece of advice i'd give you is this:

invest in harness engineering.

it matters more than which model you pick.

i am constantly approached by people saying claude code is "dumb" or "verbose" or "loses context every session." then they spend one weekend doing harness work:

- a scoped CLAUDE.md
- skills for the workflows they repeat
- a reviewer subagent that runs against every diff
- deterministic checks (linter, type checker, structural tests) wired to fire in-session via hooks.

the same model, suddenly ships clean. just clean steering, clean feedback, the agent doing the thing it was supposed to do.
harness engineering is the name of this practice.

the model generates code.
the inner harness (claude code, cursor) runs the loop.
the outer harness, yours, is where you have actual leverage.

two halves:
1. guides (CLAUDE.md, skills, rules, push channel)
2. sensors (linters, type checks, llm-as-judge, deterministic feedback the agent runs against itself).

most people stop at guides. that's why results plateau. once a rule is precise, make it deterministic. an encoded check beats another paragraph of markdown.

if you tried claude code once or twice and gave up because it felt half-baked, the issue may not have been the model. it might have been the empty outer harness wrapping it.

harness engineering is the one practice i point every claude code or codex user at. especially anyone who almost gave up on it.

who else has seen their workflow completely level up after building a proper outer harness?
drop your before/after below 👇

1

3

0

0

349

3 days ago

Underrated, but setting up a skill just to prepare an import from one computer to another to migrate my Claude configuration is so helpful! It basically reads a repo and finds all skills and memory both globally and project based, that is used by the repo and exports them either to the cloud or to your external storage

0

0

0

0

16

15 days ago

Absolutely loving dynamic workflows for deep research! JUST REMEMBER TO SET A TOKEN BUDGET IN THE PROMPT

0

0

0

0

8

16 days ago

@RKronen The main area of focus is context engineering, what are the pieces of information vital for the agent and what isn’t! I think frontier models more or less understand what you want them to do from most prompts. It’s all about how you load the context with skills etc

1

1

0

0

29

16 days ago

@vivoplt Who’s going to fix 90% of your code?

0

0

0

0

183

20 days ago

Anyone tried out the ultracode effort level on opus 4.8 yet?

0

0

0

0

31

26 days ago

@mikeydsoftware Just gotta prompt harder 🤣🤣

0

1

0

0

12

26 days ago

SOFTWARE ENGINEERS ONLY Where exactly do you see most vibe coded apps fail?

1

1

0

0

34

27 days ago

@packdir This is basically Karpathy’s thesis as well right, speciation? The future won’t necessary be a one size fits all harness or model, it would be specialised harnesses that do a set of tasks very well

1

0

0

0

6

27 days ago

@plainionist Hooking these up to Grafana to view them visually is a great shout as well! You want to be able to view these metrics quickly, easily, and frequently

0

1

0

0

25

28 days ago

Yeah and I actually think finding ways to just making things easier on your brain is really the way to go. With ai, there’s no “fixed” process, you can build whatever you’d like to view information, it could be visual, auditory, whatever works for you! That’s why I think the html instead of markup idea took off so well

0

1

0

0

37

28 days ago

@MichaelThiessen Entry level slopper

0

1

0

0

21

28 days ago

whatever you’re building doesn’t need to be 50k lines

1

0

0

0

159

28 days ago

@nmamizerov This sounds super far fetched but I really would love to just be able to invest in crazy ideas and really try to make a difference and only look at profit as a way to cover operations and grow, not be in it for personal financial gain

1

1

0

0

6

29 days ago

@Greg_TheBuilder The guy stressing the importance of the fundamentals of software engineering 🤣

0

0

0

0

27

stackbenchdev retweeted

Andrej Karpathy

29 days ago

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

8K

150K

11K

14K

28M

29 days ago

I feel like a big part of that product development process is actually understanding your market. Finding where your ideal user lives and investigating that space for ideas is one of the best way to polish the product! And get new feature ideas! No point solving a problem no one has

0

1

0

0

37

29 days ago

@kr0der Did you have to put in credentials, were you comfortable doing that with codex?

1

0

0

0

98

29 days ago

I mean this is the struggle isn’t it, I feel like junior devs right now might have it tough, as the new standards haven’t been discovered, the education system hasn’t caught up yet. I could say they should focus on jumping right into system design via repos designed to teach (purposeful breaks) but that’s more of a bandaid than a solution

1

2

0

0

195

29 days ago

@yegor256 Underneath the “intent” is still a mathematical/engineering way of decomposing that right? It’s basically what’s happening now anyway

0

0

0

0

173

29 days ago

@mattpocockuk I feel like I might run into issues trying to undo some of its recommendations that I miss to catch. Then I lose the alignment aspect which is the best part of grill me right?

0

0

0

0

313

Last Seen Users on Sotwe

Trends for you

Most Popular Users