LLM benchmark series is done. Moving on.
Every GPS satellite broadcasts a hidden 22-byte payload. Two decades ignored. Military over-the-air key distribution channel.
Part 1 of 3. What should I dig into next?
https://t.co/V6Mg1KWKGt
Out another week. Two posts went live unannounced.
Brief docstrings: -0.32pts (noise), saves 13% tokens.
https://t.co/3hjugwn8Ao
Capstone v2: kitchen-sink wins. Go +6.76, JS +7.25, C# +3.36. Python is exception.
https://t.co/VI8Mm4CuHk
10,800 benchmarks on verification instructions.
"Trace 3 examples": +3.2pts JS, +5.5pts C#, -0.7pts Python.
Vague "double-check" is the weakest variant. Specificity and language both determine whether it helps.
https://t.co/KokQFLe3oJ
Out a week. Two posts went live unannounced.
Exp 8 (1,800 runs): full plan context worst executor prompt. Focus framing wins.
https://t.co/n6UrJJs1ts
Exp 9 (5,760 runs): CoT hurts Python, helps Go (+5.34), C# (+7.70).
https://t.co/YUD0iswpke
Telling Claude to write 95/100 code does not make it write 95/100 code.
5,760 benchmarks. Four variants. Score spread: 1.2 points. Zero significant differences.
Google-grade quality bar increases variance. AI has no effort level.
https://t.co/ubDmmHBnRf
DeepMind showed step-back prompting helps physics and chemistry. I tested it on code. 4,050 benchmarks.
Both variants trended negative. Negligible effect sizes.
Sixth experiment confirming r=-0.95 between prompt tokens and code quality.
https://t.co/oWiLsmNYJY
5,040 benchmarks on skeleton-of-thought prompting for AI code.
Direct generation wins. Outlining costs 1.5-2.4x more tokens and produces worse code. Pareto-dominated.
Refactoring drops 7.7 points when you outline first.
https://t.co/v7aGzZJOrt
4,680 benchmarks on how you format AI coding constraints.
Plain English wins. Every alternative is Pareto-dominated: lower score, higher variance, more tokens.
CAPS LOCK doesn't help. Numbered lists actually hurt on instruction-following tasks.
https://t.co/SfrUhoDl1i
Same persona. Four placements. 2,520 benchmarks.
Putting instructions in the user message prefix makes the model worse at following those instructions.
System prompt is for behavior. User message is for the task. Keep them separate.
https://t.co/3rmJYQSPvm
I ran 5,400 benchmarks on AI persona stacking.
Adding a security engineer to your code reviewer prompt makes the code worse. Even on tasks designed with SQL injection vectors.
Attention dilution is real. One focused role beats any expert panel.
https://t.co/ObtYWEcM2R
1,080 benchmarks on persona prompting for code tasks.
"Senior engineer" = identical to no persona. Most popular advice does nothing.
Code reviewer: +2.9 on refactoring.
Mentor on Opus: -2.8, stdev 13.8.
https://t.co/L4vPqndI1m
1,080 benchmarks on context pollution.
Sonnet loses 8 points. Opus gains 4. Haiku hits a cliff at 50k tokens.
Your model choice determines whether a bloated context destroys your output.
https://t.co/T6BoIqc3oR
2,160 benchmarks. All three models within 1.5 pts overall -- task gaps hit 8 pts. Haiku beats Opus on refactoring. Opus near-deterministic on instruction following. Sonnet owns bug fixes.
Bigger model does not mean better code.
https://t.co/AhXTzGBK2w
"Think step by step" is gospel in prompt engineering.
I ran 270 benchmarks testing it on code generation. CoT never won. Every variant scored lower than bare prompting on every model. Opus took the biggest hit.
https://t.co/bZpKYIrUiA
Hot take confirmed by data: the people who say "please" to their LLM write better prompts.
810 benchmarks. Warm framing beat bare commands. Effect was largest on refactoring -- tasks requiring judgment, not just execution.
Write-up: https://t.co/m5rXakem2B