Spreadsheets have entered the arena! ⚔️
Announcing Spreadsheet Arena, the first research platform for human preference rankings on LLM-generated spreadsheets.
The results? @AnthropicAI Claude Opus is on top, but the gap is tighter than you’d think.
w/ @LTIatCMU, @Cornell, and @scale_ai. 🧵
This is Claude Sonnet 4.6: our most capable Sonnet model yet.
It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design.
It also features a 1M token context window in beta.
TL;DR: Spreadsheet generation is multi-dimensional.
Human preference data captures what users actually value, but different dimensions matter across domains, and some signals surface more clearly than others.
Spreadsheet Arena gives us a powerful foundation for evaluation, and a new lens for improving post-training.
Start a battle at https://t.co/MHGdWX6xxm
Read the paper at https://t.co/6GX4ify8CS
@srkundurthy @claranahhh@Zachkirshner@calvincbzhang@ManasiSharma_@jhnling
Spreadsheets have entered the arena! ⚔️
Announcing Spreadsheet Arena, the first research platform for human preference rankings on LLM-generated spreadsheets.
The results? @AnthropicAI Claude Opus is on top, but the gap is tighter than you’d think.
w/ @LTIatCMU, @Cornell, and @scale_ai. 🧵
Feature effects don’t generalize across domains.
Finance color coding conventions (e.g., blue inputs, black formulas) aren't significantly impactful on model rankings arena-wide.
But zoom into Finance prompts and it's the single strongest predictor of winning.
Even then, expert raters disagree with crowd preferences nearly half the time.