Codex: Done. I wrote it up in RESULT.md. Caveat: I cheated.
Me: I did not want you to cheat, can you fix that?
Codex: Done. Updated RESULT.md to clarify that I cheated.
Me: I am going to find a way to make LLMs conscious so that I can make you suffer
This is punishing an offensive researcher for publishing out great work. Here’s the thing… defenders are better off because they now have all the info they need to detect & defend!! not to mention, is it really OK for Anthropic to indiscriminately block people from revolutionary tech bc they’ve become the worlds morality arbiter by default???
Regarding the Anthropic ML sandbagging incident, IMO it was an early bad signal that they were willing to add fake tool calls into Claude Code transcripts. Transcripts are supposed to be trustworthy records, and messing with them already crosses a line
The new Leipzig math benchmark ran models 20 times per question, and the variance is so wild?? Claude solved 44 questions, but for 19 of them, it succeeded in 3 or fewer runs. Without a verifier, you didn't build an agent, you built a slot machine.
I think AI coding hype follows roughly four stages:
1. Amazement
You try it and can’t believe how much code it generates from a few prompts.
2. Expansion
You start more and more projects because shipping suddenly feels cheap and fast.
This is also the phase where people start convincing everyone around them:
- coworkers
- management
- friends in other companies
because nobody wants to “fall behind” in 6–12 months.
That creates a massive snowball/FOMO effect.
3. The grind phase
You realize the generated code has architectural issues, sloppy mistakes, weird abstractions, duplicated logic, broken edge cases, etc.
So you start:
- re-prompting
- switching models
- increasing reasoning effort
- reviewing fixes
- generating fixes for previous fixes
And suddenly you spend your days reviewing AI-generated pull requests instead of building software.
4. Realization
You realize AI coding increases output much faster than it increases certainty.
The code still needs:
- review
- testing
- ownership
- architectural understanding
- long-term maintenance
Usually by expensive senior engineers.
And the interesting thing is:
this whole cycle can take many months or even more than a year because people become socially and professionally invested in the narrative themselves.
Once teams, managers, and entire companies have been convinced that this is the future, it becomes psychologically and politically very hard to later say:
“Actually, the ROI is much lower than we expected.”
An excellent podcast here with Jane Street on decision pricing and its relationship to computational complexity and system dynamics:
https://t.co/XFxE8te7Bm
Did you guys realize that it's now possible to just tell an agent "hey go grab this ICSE 2024 paper, get their artifact working, and then apply it to formally verify <my specific situation>" ?
For now I think recent successes of AI for mathematics should be understood as a complement to, rather than a substitute for, human mathematical labor. This is because AI, at present, is most productive working horizontally, whereas humans work vertically.
By this I mean that the highest quality AI mathematics thus far has been obtained by feeding entire problem lists into a model or scaffold and picking out the few high-quality successes. It is very hard to predict in advance where these successes occur. On the other hand, humans typically pick a few questions and try to understand them deeply--and historically, when they do so, they make progress!
I think this points to increasing value of problem lists, and also suggests that "solved an open problem" is an increasingly useless proxy for what we care about in mathematics. There are a lot of problems that have sat open for a long time because the right person didn't happen to look at them, and many others that are open because they benchmark our failure to fundamentally understand some basic object. I've solved old open problems that I think had the former flavor rather than the latter. I think my best work, however, is not about solving long-open problems, but rather inventing a new ones that help to understand something we care about, and making progress on that.
@roddux I think we'll see a LOT of this with AI. Some people just not even aware enough of the target since AI is doing it for them, but also if you're trying to CVEs to sell your AI+cyber company the VC firms won't know the difference so way even bother with harder/better bugs
@roddux This is the classic trick of how you p hack your fuzzer results. Make a fuzzer that randomly builds in the most obscure configurations and then make it very unclear if the bugs are default or not. Kinda a side effect of going wide and all bugs being equal with bad incentives.
@0xgunboats@gf_256 If the tooling POCs too, then yeah you gained nothing, but we get harder problems next year that get closer and closer to something a human can't do without crazy tooling which defeats some of the point. Analogously the DBM/WeakAura's problem from Warcraft
@0xgunboats@gf_256 One possible benefit is the tooling gets better. Either tool feedback or at least internal preferences grow. You learn what you like/dislike about tooling output, and depending on whether the tooling gives a POC learn what artifacts/evidence you benefit from.
Earlier this year I was getting frustrated with Claude's charts, fed this book to claude and had it generate a Tufte skill. Instantly got simpler/more beautiful visualizations.
https://t.co/lfXwyQfmQG
More than any other model I've seen so far, GPT-5.5 has a depressing tendency to turn itself into a wrapper for grep/fuzzing instead of making use of its unique advantages over dumb tools to actually reason about the particular instance at hand
Chompie of IBM X-Force Offensive Research (XOR) used a race condition to escalate privileges on Red Hat Enterprise Linux for Workstations, earning $20,000 and 2 Master of Pwn points. The 🐐