Yeah, and it's fantastic when I have control over when to stop. But it frustrates me with code review: I fix one issue and the agent notices another (possibly unrelated). Sometimes it's wrong, which is mildly irritating; other times it's right, which is helpful but when left unchecked, becomes distracting. Everything we ship is incomplete and I feel like most agents have a poor conception of prioritization
I think the actual root cause of the problem is that it is quite hard to create a benchmark that is ~unsaturable (for a reasonable amount of time) at arbitrary amounts of test-time compute and using arbitrary scaffolds. My hypothesis is that the existence of such a benchmark will automatically incentivize the field to move in this direction- thoughts? Wrote more about this here: https://t.co/OnvnCtHbZp
Lots of people telling "bad VC" stories. Unfortunately I don't have any good ones, people were generally nice and respectful.
And then there's @JenniferHli and @caseyaylward who were truly amazing.
Howard Hawks said that a great movie is "three good scenes and no bad ones." A lot of early products are the same: a few features people get value from and no complete deal-breakers (bad regressions from prior workflow).
My own adoption experiences:
Arc: vertical tabs, Cmd+Shift+C, split view; built on Chromium, no major regressions vs Chrome
Cursor: Cmd+K, Cmd+L, tab; forked VSCode
Granola: notes, one-click share, ask anything; did not have a bot enter each meeting with me (was not previously using a notetaker partly for this reason)