I might be one of the few people who is most bearish on human research taste and bullish on automated research:
- "AIs can only do hyperparameter search" is mainly a skill issue with bad automated research setups.
- human taste is overrated, e.g. frontier labs / neolabs are doing pretty simlar things.
- human taste might win in a low-compute world, but not a high-compute world we're entering.
anyway. the anthropic team (@bcherny@trq212 and everyone else) are clearly working hard on this. just wanted to share concrete data instead of vibes. it's easy to criticize those in the arena, but please remember they are humans working their butts off for your benefit <3
been data mining my claude code transcripts to figure out why conversations "feel worse" lately. pulled out 264 "tilt" incidents across 127 sessions, ~9k tool calls and this is what i found...
in contrast to others i don't think the fix is more thinking tokens. tool selection and "thrash" detection are two places to start: make the model verify against actual source material before it starts claiming things, and detect when the model starts swirling and second-guessing