Will Palmer @itswillpalmer - Twitter Profile

Pinned Tweet

3 months ago

Check out the latest article in my newsletter: Issue #8: Claude Code Leaked Their Own Blueprints. Here's What Was Inside https://t.co/yNlkukIyPw via @LinkedIn

0

131

Will Palmer

@itswillpalmer

7 days ago

This is cool. Cursor for PM’s

0

8

Will Palmer

@itswillpalmer

7 days ago

@usehamster This is cool, Cursor for PM’s.

1

0

133

Will Palmer

@itswillpalmer

3 months ago

Check out the latest AI Insider Update: Issue #7: Claude Controls the Mac, Composer 2&#39;s is actually Kimi K2.5, the Delve SOC 2 Scandal, and Two Leaderboard Updates https://t.co/Mu7hPo0XYw via @LinkedIn

0

1

0

112

Who to follow

Celluloid Junkie

@CelluloidJunkie

Celluloid Junkie tracks the business, technology, personalities, events, trade shows, news and trends that relate to the world of motion picture exhibition.

UNIC - International Union of Cinemas

@UNIC_Cinemas

The voice of cinema exhibitors, representing operators and associations from 39 territories in Europe and beyond. We love the Big Screen! 📽️🎞️🍿🎟️

Cinema Guru

@patrickvons

Talks about cinema a lot / DigiMarComm for @CinemaNext / Co-Founder @Cultpix1 / Klubb Super8 / 🍿 [email protected]

Will Palmer

@itswillpalmer

3 months ago

https://t.co/Hu7l8d5E6p

0

23

Will Palmer

@itswillpalmer

3 months ago

Curious - why is OpenAI nudging us to use GPT-5.4 when it hasn't been optimised for coding (Codex)? It costs 1.7x more to run, is less accurate and much slower.

Sigmabench

@sigmabench

3 months ago

Sigmabench: Codex CLI + GPT-5.4 vs GPT-5.3 Codex - 2 tiers lower on SigmaScore (worse accuracy + speed) 42% slower - 900 runs, 0 timeouts (flawless consistency, like 5.3) - But accuracy trails 5.3 Codex, Sonnet 4.6, and Opus 4.6 - 5.3 Codex 40% cheaper to run GPT-5.4 is a bigger, slower, more expensive general-purpose model, not a coding specialist. Generality hurts coding perf + cost. OpenAI pitches it as 5.3 Codex replacement. Our data says not yet for coding workflows. Note: This is not the code-optimized version. Stay tuned: New code quality eval. GPT-5.4 leads early charts #OpenAI #AICoding

0

3

1

3

554

1

0

101

Will Palmer

@itswillpalmer

4 months ago

https://t.co/izdK90uTE8

0

39

Will Palmer

@itswillpalmer

4 months ago

With Anthropic pushing builders to pay-as-you-go, having Opus 4.6 level performance at twice the speed and 30% of the cost is timely

Sigmabench

@sigmabench

4 months ago

We just ran Codex CLI + GPT-5.3 Codex. Wow! • Accuracy: Opus 4.6 / Sonnet 4.6 level • Consistency: #1 • Speed: ~2× faster than GPT-5.2 Codex • Cost: 70% less than Opus, 55% less than Sonnet Opus level accuracy, at twice the speed, and 30% of the cost! @OpenAI

sigmabench's tweet photo. We just ran Codex CLI + GPT-5.3 Codex. Wow!

• Accuracy: Opus 4.6 / Sonnet 4.6 level
• Consistency: #1
• Speed: ~2× faster than GPT-5.2 Codex
• Cost: 70% less than Opus, 55% less than Sonnet

Opus level accuracy, at twice the speed, and 30% of the cost!
@OpenAI https://t.co/SE73ZUarV4

0

3

1

0

306

0

65

Will Palmer

@itswillpalmer

4 months ago

https://t.co/u6wWLxIMY0

0

1

0

24

Will Palmer

@itswillpalmer

4 months ago

The cost of running this benchmark was $2k for Sonnet 4.6 v $3k for Opus 4.6. So bang for buck, Sonnet 4.6 with Claude Code CLI is my choice. Still gets expensive if you are running via the API. What is everyone using for the simple stuff?

Sigmabench

@sigmabench

4 months ago

Sonnet 4.6 is the most accurate model we’ve tested. Matches Opus 4.6 on performance at ⅔ the cost. Key benchmarks • Sigmabench Accuracy: 48.3% (Opus 47.6%) • SWE-bench Verified: 79.6% (Opus 80.8%) • Terminal-Bench 2.0: 59.1% (Opus 65.4%) • OSWorld-Verified: 72.5% (Opus 72.7%) #Claude #AICoding

sigmabench's tweet photo. Sonnet 4.6 is the most accurate model we’ve tested.
Matches Opus 4.6 on performance at ⅔ the cost.

Key benchmarks
• Sigmabench Accuracy: 48.3% (Opus 47.6%)
• SWE-bench Verified: 79.6% (Opus 80.8%)
• Terminal-Bench 2.0: 59.1% (Opus 65.4%)
• OSWorld-Verified: 72.5% (Opus 72.7%)

#Claude #AICoding

0

4

0

304

0

1

0

71

Will Palmer

@itswillpalmer

4 months ago

@PaulSolt Shameless plug here. We have built https://t.co/h9V476XAZQ to measure the model harness + the latest models. Love to know if we are missing anything?

0

1

0

206

Will Palmer

@itswillpalmer

4 months ago

Sonnet 4.6 looks remarkably close to Opus 4.6. If Sonnet is almost as good and much cheaper, are we going to switch? If so, Anthropic will be doing itself out of a lot of revenue. So does Sonnet have better margins?

itswillpalmer's tweet photo. Sonnet 4.6 looks remarkably close to Opus 4.6. If Sonnet is almost as good and much cheaper, are we going to switch?

If so, Anthropic will be doing itself out of a lot of revenue. So does Sonnet have better margins? https://t.co/QSWPnqF6Ug

0

52

Will Palmer

@itswillpalmer

4 months ago

Read the full story in the X Article below. Read it, quote it, discuss what you're seeing in your workflows. https://t.co/pf9mPdE96m Follow for weekly real-world agent updates.

Will Palmer

@itswillpalmer

4 months ago

https://t.co/iuwZgdqZ3y

0

1

0

44

0

23

Will Palmer

@itswillpalmer

4 months ago

OpenClaw + OpenAI, Cursor Under Pressure, and the 8-Hour AI Workday What actually changed this week, and what it means if you’re building with AI My takes below #AICoding #Agents

itswillpalmer's tweet photo. OpenClaw + OpenAI, Cursor Under Pressure, and the 8-Hour AI Workday

What actually changed this week, and what it means if you’re building with AI

My takes below #AICoding #Agents

1

0

73

Will Palmer

@itswillpalmer

4 months ago

The constraint isn't model smarts anymore. It's human time on context + review. METR data: Leading models now do 2–5 hrs serial engineering at ~50% reliability. Agents 3–5× faster when they hit. Prediction: 8 reliable hours of scoped work by late 2026. Delegated days with oversight, not autonomous coders. How close are you to this in your setup?

1

0

27