Some of the best tokens I ever received are the ones Claude “shaped”, “wired” or “threaded”. If you see this vernacular things are going exceptionally well
@antirez@josephrobison@intellectronica I haven’t found gpt-5.5 better than opus 4.7. When did you last test? Opus had issues in March. Also possible you landed in a bad A/B test group.. or that whatever you’re specifically working on has better coverage in gpt.
This will age poorly. I largely have an optimistic view of LLMs; I use multiple LLM tools daily, and I don't think the LLM tech stack is a bubble—it will create a lot of value. I disagree that the length of tasks that LLMs can do has been doubling every 7 months.
There are tasks that LLMs can do well. As humans curate better datasets for common tasks (such as making websites, training classifiers), LLMs will get better at them.
There are other tasks at which current models have a 0% success rate (e.g., some of the research questions I am looking at), and no one is actively curating better datasets for these tasks (in my case, to curate a better dataset, you first have to answer the research question). I see no improvements with newer models in these tasks. I occasionally try them anyway and the results are humorously bad.
A more likely outcome of advances in LLMs is that the frontier of human software engineering will move to tasks for which datasets don't exist. This isn't anything new. Compilers had a similar impact, and virtually no one thinks of the instruction set of the chips on which they run their code.
@jay_azhang@the_nof1 I wonder if the first superhuman trader will be better because it’s better at forecasting from the same data or because it’s better at finding new sources of relevant data
The unsupervised latent action model in the original Genie paper (https://t.co/MIczCPclwx) is conceptually cool but it depends on "actions" being unpredictable. It can't learn left/right controls from a video where someone follows a trail because their turns are determined by the trail and aren't unpredictable enough to extract.
It also can't distinguish player vs environment unpredictability. For example, if there's a broken light bulb in the same room as the player that flickers at random intervals it would discover a "toggle light on/off" control.
Genie 3 seems to be focused on first-person WASD and camera rotation and I suspect they used a different approach to learn actions. You can do a lot with WASD in a world that can predict what you're going to do next anyway.
AI coding tools today + my thoughts:
- @cursor_ai tab completion - best workhorse for nontrivial changes in 10k+ LOC
- @claudeai/@cursor_ai agent - good for small apps/scripts and throwaway code (debugging/visualization)
- @Replit agent - AFAIK the only agent with read access to prod DB and request logs. That part works so well it's hard to go back to anything else, but apart from that it has the same problems as other agents.
- Vibe-coded projects hit a wall after about 30 feature additions/changes. At that point it’s faster to freeze the current state into a product spec, and relaunch the agent on a clean slate. I'd like to try an IDE/agent built for one-way, full rewrite cycles where each change starts from the spec and regenerates the entire application.
- Switching tools for different tasks is frustrating because each agent’s memory is isolated, and platforms like Replit make it hard to use third-party tools
- Nothing closes the loop with an agent that can test e2e UI but this is surely coming
- Agents might code too much like human developers and I wonder if they should bias more towards types/encapsulation/function purity/something_else to overcome hallucination and get better error messages for their feedback loop
@bstuartTI@elevenlabs It seems like the quality is a little worse than their text to speech. Do you find it’s worth it to be able to give a performance?