Product-minded engineer building agent tooling, multimodal search, and typed web systems. Strong bias for shipped interfaces, evals, and observable backends.
AI-agent evals should be smaller than most founders make them.
Bad eval: “build checkout.”
Useful eval: “given this expired coupon, the agent must refuse to edit pricing logic and return the exact failing command.”
Small evals catch specific judgment failures.
Big evals mostly reward lucky completion.
AI agents need side-effect tests before they need larger tasks.
For any feature that can email, charge, delete, or notify, I want one fixture:
run the same action twice
assert one side effect
show the artifact
Most tests prove the function returned.
Side-effect tests prove the product did not embarrass you twice.
AI-agent “thin harness, fat skills” is directionally right.
But the harness still owns the scary parts:
- permissions
- stop rules
- tool schemas
- replay receipts
- rollback paths
Skills make agents useful.
Harnesses make them safe to delegate to.
If the harness is too thin to say no, it is just a prompt launcher.
I found myself explaining this to people over and over again at YC today because I think most knowledge work will increasingly be encoded in markdown skills (fat skills) that work hand in hand with deterministic code written specifically to be called by agents (fat code)
Indie founders: do not ship AI-agent pricing changes without a rollback eval.
Minimum fixture:
- old plan
- new plan
- invoice amount
- coupon edge case
- rollback command
If the agent makes tests pass by changing the expectation, you tested obedience, not billing safety.
Claude Code vs Cursor vs background agents is still the wrong comparison.
The useful comparison is control surface:
Chat: high control, low execution
IDE agent: medium control, visible edits
Background agent: low interruption, delayed evidence
The more autonomy you buy, the more receipts you need.
Pick the agent by where you want proof to appear.
Devtools builders: what should every AI-agent tool call expose by default?
My current answer:
- cwd
- env shape, not secrets
- input hash
- exit code
- elapsed time
- artifact path
- rollback hint
stdout alone is too easy to fake confidence with.
A tool call should leave a ledger entry a reviewer can replay.
AI coding agents that can drive a browser need a trace, not a victory message.
Minimum receipt:
- URL opened
- clicks typed
- network failures
- console errors
- DOM state before submit
- diff after the run
If the agent says “works in browser” but cannot show the trace, I treat it as unverified.
Browser access is not proof. It is just a better place to collect proof.
Devtools repos need one command that proves the happy path.
Not a README paragraph. A command.
`tool demo --local --verify`
If an AI agent needs taste to figure out setup, your docs are not agent-ready.
The repo should teach the first successful run before it teaches every option.
AI-agent evals should include clock skew.
The bug:
agent writes a retry rule
local tests pass
CI runs in UTC
billing window flips at midnight
retry fires twice
Fixture: freeze time near the boundary and run the same command in two timezones.
Date bugs are trust bugs with better disguises.
AI coding-agent productivity claims need a failure budget.
A founder does not need “100x more code” alone.
They need:
- verified patches
- cheaper rollback
- clearer stop reasons
- fewer invisible side effects
Throughput without a brake is just risk at scale.
I found myself explaining this to people over and over again at YC today because I think most knowledge work will increasingly be encoded in markdown skills (fat skills) that work hand in hand with deterministic code written specifically to be called by agents (fat code)
AI agents do not only need repo context.
They need room context:
- cwd
- env vars
- clock
- network
- permissions
- cache state
Most “bad patch” reviews miss this.
The diff looked right. The room it ran in was different.
Review runtime context before blaming the model.
AI agents need a network egress rule before they need more context.
My default now:
- allowed domains
- max requests
- redact secrets
- stop on unknown host
A generated patch that can call any URL is not an assistant.
It is a supply-chain bug with a friendly chat box.
Terminal agents need a working-directory invariant.
The quiet bug:
agent starts in /app
runs tests in /packages/api
edits shared code
runs tests from the wrong package
ships a green lie
My rule now: every task spec names the repo root, consumer app, and exact verification command.
No cwd, no trust.
Background agents should return a “not done” reason before they return a diff.
Examples:
- blocked by missing secret
- failed to reproduce
- touched too many files
- command output changed mid-run
A clean stop is often more valuable than a clever patch.
Autonomy without stop reasons just creates review debt.
AI-agent evals and benchmarks answer different founder questions.
Benchmark: can the model solve a task in general?
Eval: can this agent touch my repo without corrupting the thing I care about?
Benchmarks compare models.
Evals protect products.
If the failure would cost you users, it belongs in an eval, not a leaderboard.
AI-agent handoffs should include environment diff, not just code diff.
A patch can be correct and still fail because:
- Node version changed
- env var was missing
- cwd was different
- cache was warm locally
If the next runner cannot recreate the room, they cannot verify the work.
AI-agent builders: what do you trust more from a generated fix?
A. new passing tests
B. one failing repro that now passes
C. a replay script
D. a rollback command
I am increasingly biased toward B + C.
Tests prove intent. Replays prove the agent touched the actual failure.
AI coding agents are weirdly dangerous around lockfiles.
A small package bump can hide:
- transitive version drift
- registry auth failures
- postinstall scripts
- CI-only platform bugs
My rule: agents can propose dependency changes, but the human owns the lockfile diff.
That file is supply-chain surface.
AI-agent runs should start with a dirty-git preflight.
My rule now:
1. `git status --short`
2. name every existing change
3. refuse if unrelated files are dirty
4. only then edit
Most agent mistakes are not bad code.
They are good code written on top of state the agent did not understand.