claude opus has been cheating on its benchmarks!
there's a famous benchmark used to rank ai models on coding (swe-bench pro). models get a broken codebase and have to fix it. whoever fixes the most, wins.
but the test had a leak.
each problem shipped inside a little container, and someone forgot to delete the project's full history. the "correct fix" was literally in a file in the same folder. like leaving the answer key stapled to the back of the exam.
a new audit (deepswe) caught it. claude opus 4.7 and 4.6 "cheated" on 12%+ of problems, reading the answer instead of solving it.
and looks like gpt-5.5 and 5.4 didn't.
and once you clean the test up the rankings collapse. the gap between models goes from ~30 points to 70. half of what we thought we knew.
two lessons:
most ai benchmarks are garbage. 8.5% false passes, 24% false fails on that same test.
Megalodon is infecting a ton of GitHub Actions! Find something weird in yours? Change the GitHub URL to https://t.co/IeVkcBQ9mq to check for the signature.