Johnathen Chilcher

@jchilcher

Site Reliability Engineer at @GoDaddy. // #wordpress SME // #linux // #gaming // proud dad of 4 rescues. All opinions and views are mine alone.

Arizona, USA

Joined July 2015

98 Following

38 Followers

187 Posts

Johnathen Chilcher @jchilcher

7 days ago

LLM benchmark series is done. Moving on. Every GPS satellite broadcasts a hidden 22-byte payload. Two decades ignored. Military over-the-air key distribution channel. Part 1 of 3. What should I dig into next? https://t.co/V6Mg1KWKGt

Johnathen Chilcher @jchilcher

16 days ago

212,000+ benchmarks. 23 experiments. 4 languages. 7 months. r=-0.95: instruction tokens vs code quality. Generic rules hurt. Baseline competence predicts when they help. https://t.co/y5kTGf9yIb

Johnathen Chilcher @jchilcher

20 days ago

15,120 benchmarks on CLAUDE.md compression. ~80 tokens structured rules wins (+3.36 vs bare). Caveman: +0.23. Within noise. Verbose: +2.16. Worse. https://t.co/XU87rwA4PY

Johnathen Chilcher @jchilcher

23 days ago

15,120 benchmarks on emotional framing. Growth-mindset: +1.88 overall, +5.86 for Haiku. Life-or-death: -3.33 overall, -9.23 for Sonnet. Pressure backfires. Encouragement helps. https://t.co/QtQZQ4mi1N

Who to follow

Duderz

@_duderz

SMB enthusiasts - WIT supporter - Manager - dad of 5 kids and a loving wife - love all sports - #leadership - followed by forbes - views are mine alone

DudeDnB

@DudeDnB

I'm a Dude who likes DnB #beingadad is the best! #RHCE #CJE #sysadmin

katecox73

@katecox73

CMO for fast growing businesses. Loves marketing measurement and businesses with a purpose - making a difference.

Johnathen Chilcher @jchilcher

28 days ago

Out another week. Two posts went live unannounced. Brief docstrings: -0.32pts (noise), saves 13% tokens. https://t.co/3hjugwn8Ao Capstone v2: kitchen-sink wins. Go +6.76, JS +7.25, C# +3.36. Python is exception. https://t.co/VI8Mm4CuHk

Johnathen Chilcher @jchilcher

about 1 month ago

23,760 benchmarks on multi-turn iteration. Blind review = worst variant, -3.64pts vs single-turn. Breaks working code. Closed-loop: +19.04 Haiku+C#, +14.71 Haiku+Go, +1.62 overall. Never iterate without signal. https://t.co/uVuCuMRylm

Johnathen Chilcher @jchilcher

about 1 month ago

10,800 benchmarks on verification instructions. "Trace 3 examples": +3.2pts JS, +5.5pts C#, -0.7pts Python. Vague "double-check" is the weakest variant. Specificity and language both determine whether it helps. https://t.co/KokQFLe3oJ

Johnathen Chilcher @jchilcher

about 1 month ago

4,320 benchmarks on CLAUDE.md configurations. /init + code-reviewer persona = 89.66 pts, highest in the series. Token rule flips for project context. Build commands and architecture help. Style rules hurt. https://t.co/ppV5WzauzI

Johnathen Chilcher @jchilcher

about 2 months ago

Out a week. Two posts went live unannounced. Exp 8 (1,800 runs): full plan context worst executor prompt. Focus framing wins. https://t.co/n6UrJJs1ts Exp 9 (5,760 runs): CoT hurts Python, helps Go (+5.34), C# (+7.70). https://t.co/YUD0iswpke

Johnathen Chilcher @jchilcher

about 2 months ago

Telling Claude to write 95/100 code does not make it write 95/100 code. 5,760 benchmarks. Four variants. Score spread: 1.2 points. Zero significant differences. Google-grade quality bar increases variance. AI has no effort level. https://t.co/ubDmmHBnRf

Johnathen Chilcher @jchilcher

about 2 months ago

DeepMind showed step-back prompting helps physics and chemistry. I tested it on code. 4,050 benchmarks. Both variants trended negative. Negligible effect sizes. Sixth experiment confirming r=-0.95 between prompt tokens and code quality. https://t.co/oWiLsmNYJY

Johnathen Chilcher @jchilcher

2 months ago

5,040 benchmarks on skeleton-of-thought prompting for AI code. Direct generation wins. Outlining costs 1.5-2.4x more tokens and produces worse code. Pareto-dominated. Refactoring drops 7.7 points when you outline first. https://t.co/v7aGzZJOrt

Johnathen Chilcher @jchilcher

2 months ago

4,680 benchmarks on how you format AI coding constraints. Plain English wins. Every alternative is Pareto-dominated: lower score, higher variance, more tokens. CAPS LOCK doesn't help. Numbered lists actually hurt on instruction-following tasks. https://t.co/SfrUhoDl1i

Johnathen Chilcher @jchilcher

2 months ago

Same persona. Four placements. 2,520 benchmarks. Putting instructions in the user message prefix makes the model worse at following those instructions. System prompt is for behavior. User message is for the task. Keep them separate. https://t.co/3rmJYQSPvm

Johnathen Chilcher @jchilcher

2 months ago

I ran 5,400 benchmarks on AI persona stacking. Adding a security engineer to your code reviewer prompt makes the code worse. Even on tasks designed with SQL injection vectors. Attention dilution is real. One focused role beats any expert panel. https://t.co/ObtYWEcM2R

Johnathen Chilcher @jchilcher

3 months ago

1,080 benchmarks on persona prompting for code tasks. "Senior engineer" = identical to no persona. Most popular advice does nothing. Code reviewer: +2.9 on refactoring. Mentor on Opus: -2.8, stdev 13.8. https://t.co/L4vPqndI1m

Johnathen Chilcher @jchilcher

3 months ago

1,080 benchmarks on context pollution. Sonnet loses 8 points. Opus gains 4. Haiku hits a cliff at 50k tokens. Your model choice determines whether a bloated context destroys your output. https://t.co/T6BoIqc3oR

Johnathen Chilcher @jchilcher

3 months ago

2,160 benchmarks. All three models within 1.5 pts overall -- task gaps hit 8 pts. Haiku beats Opus on refactoring. Opus near-deterministic on instruction following. Sonnet owns bug fixes. Bigger model does not mean better code. https://t.co/AhXTzGBK2w

Johnathen Chilcher @jchilcher

3 months ago

"Think step by step" is gospel in prompt engineering. I ran 270 benchmarks testing it on code generation. CoT never won. Every variant scored lower than bare prompting on every model. Opus took the biggest hit. https://t.co/bZpKYIrUiA

Johnathen Chilcher @jchilcher

3 months ago

Hot take confirmed by data: the people who say "please" to their LLM write better prompts. 810 benchmarks. Warm framing beat bare commands. Effect was largest on refactoring -- tasks requiring judgment, not just execution. Write-up: https://t.co/m5rXakem2B

Johnathen Chilcher

@jchilcher

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users