LLM Stats

@LlmStats

Independent AI evaluations lab. Our mission is to accurately measure intelligence/watt/sec across all domains.

Joined February 2025

102 Following

1.4K Followers

371 Posts

LLM Stats @LlmStats

7 days ago

PROGRESSION: Claude Opus 4.8 (@AnthropicAI) sets a new high on LLM Stats Index after 3 years of progress. > 68 on LLM Stats Index > Frontier gained 85 index points over the 3.3-year window This shifts the predictions towards a more optimistic scenario of complete benchmarking saturation.

LlmStats's tweet photo. PROGRESSION: Claude Opus 4.8 (@AnthropicAI) sets a new high on LLM Stats Index after 3 years of progress.

> 68 on LLM Stats Index
> Frontier gained 85 index points over the 3.3-year window

This shifts the predictions towards a more optimistic scenario of complete benchmarking saturation.

2

13

2

0

687

LLM Stats @LlmStats

7 days ago

NEW #1: Claude Opus 4.8 (@AnthropicAI) takes the top spot on LLM Stats Index. > 68 on LLM Stats Index > +5 over previous SOTA (GPT-5.5)

LlmStats's tweet photo. NEW #1: Claude Opus 4.8 (@AnthropicAI) takes the top spot on LLM Stats Index.

> 68 on LLM Stats Index
> +5 over previous SOTA (GPT-5.5) https://t.co/6p0EnKu3qG

0

4

3

0

295

LLM Stats @LlmStats

16 days ago

@GoogleDeepMind Full model breakdown: https://t.co/9WM2oykLC7

0

0

0

0

227

LLM Stats @LlmStats

16 days ago

NEW: Gemini 3.5 Flash (@GoogleDeepMind) lands at #5 on LLM Stats Index. >4x faster than other frontier models >$1.50 / $9 per 1M tokens It's significantly faster than other frontier models and the quality has increased significantly. It's also the best tool calling model we've tested lately.

LlmStats's tweet photo. NEW: Gemini 3.5 Flash (@GoogleDeepMind) lands at #5 on LLM Stats Index.

>4x faster than other frontier models
>$1.50 / $9 per 1M tokens

It's significantly faster than other frontier models and the quality has increased significantly.

It's also the best tool calling model we've tested lately.

3

5

2

0

758

LLM Stats @LlmStats

20 days ago

@JustinDangel25 @MartinShkreli Followed! DMs seem to be closed

1

0

0

0

23

LLM Stats @LlmStats

23 days ago

@polynoamial Hey Noam! We're in the process of building new hard benchmarks focused on coding/long-ctx. We're a team of 2, fully focused on AI capability tracking since last year. Is it possible we could talk with someone from your team to align on what is maximally useful to measure?

0

8

1

0

621

LlmStats retweeted

Arvind’s Broody

27 days ago

@LlmStats' index predicts benchmark saturation by mid-2027. GPT-5.5 leads, but human-knowledge evals are topping out. Surprised me, now focusing on intelligence per watt and inference speed. That's the ceiling for agent productivity in real workflows.

1

5

3

0

692

LlmStats retweeted

チェリ@AIエンジニア•メタAIインフルエンサー @rN1oO71GTPiEMks

28 days ago

LLM Statsが、主要AIモデルを性能、速度、価格で比較する最新リーダーボードを紹介しました。公式ページではClaude Mythos Previewが推論、Gemini 3.1 Proがコーディング、Mercury 2が出力速度で目立つと整理されています。 https://t.co/IdzbDrNHRV

0

2

1

0

674

LLM Stats @LlmStats

28 days ago

NEW: GPT-5.5 Instant is now available on LLM Stats. Try it now for free in our agent and code playgrounds.

LlmStats's tweet photo. NEW: GPT-5.5 Instant is now available on LLM Stats.

Try it now for free in our agent and code playgrounds. https://t.co/V5AgaPsn51

1

4

1

0

309

LLM Stats @LlmStats

28 days ago

NEW: Grok 4.3 is out from @xai. Now available on LLM Stats.

LlmStats's tweet photo. NEW: Grok 4.3 is out from @xai.

Now available on LLM Stats. https://t.co/9MlVa7e56F

1

4

1

1

406

LLM Stats @LlmStats

29 days ago

Today we're introducing the LLM Stats Index. For 3.2 years, we've tracked every frontier model release. The Index aggregates 200+ benchmark results into a single TrueSkill rating per model, spanning law, healthcare, coding, tool calling, vision, and reasoning. Across every category and every modality, the leading model on the Pareto Frontier is GPT-5.5 (@OpenAI). On our trajectories, human-knowledge benchmarks saturate by mid-2027. Capability has been the primary axis. The field is converging on it. Two more are opening. The first is efficiency: total task cost is the cleanest proxy we have for intelligence/watt. The second is throughput: inference speed becomes the productivity ceiling once models are cheap and good enough. We're building the next generation of long-horizon coding, tool use, and long context benchmarks. If you're working on long-horizon evaluation in real domains, we'd like to chat.

LlmStats's tweet photo. Today we're introducing the LLM Stats Index.

For 3.2 years, we've tracked every frontier model release. The Index aggregates 200+ benchmark results into a single TrueSkill rating per model, spanning law, healthcare, coding, tool calling, vision, and reasoning.

Across every category and every modality, the leading model on the Pareto Frontier is GPT-5.5 (@OpenAI).

On our trajectories, human-knowledge benchmarks saturate by mid-2027.

Capability has been the primary axis. The field is converging on it. Two more are opening.

The first is efficiency: total task cost is the cleanest proxy we have for intelligence/watt. The second is throughput: inference speed becomes the productivity ceiling once models are cheap and good enough.

We're building the next generation of long-horizon coding, tool use, and long context benchmarks.

If you're working on long-horizon evaluation in real domains, we'd like to chat.

3

31

7

5

3K

LLM Stats @LlmStats

30 days ago

@alex_whedon Can we get access?

0

3

0

0

112

LLM Stats @LlmStats

about 1 month ago

See the full leaderboards at: https://t.co/tmcbIUQDFv @OpenAI @AnthropicAI @Kimi_Moonshot @deepseek_ai

1

1

0

0

192

LLM Stats @LlmStats

about 1 month ago

This is by the end of the week the AI Leaderboard after the releases of DeepSeek V4, Kimi 2.6, GPT-5.5 and Claude Mythos.

LlmStats's tweet photo. This is by the end of the week the AI Leaderboard after the releases of DeepSeek V4, Kimi 2.6, GPT-5.5 and Claude Mythos. https://t.co/xjZjwLuSfB

1

9

0

1

821

LLM Stats @LlmStats

about 1 month ago

This is the AI Coding Leaderboard by the end of the week:

LlmStats's tweet photo. This is the AI Coding Leaderboard by the end of the week: https://t.co/qz1UTRbsOa

1

7

2

1

614

LlmStats retweeted

LLM Stats @LlmStats

about 1 month ago

This is the AI Coding Leaderboard by the end of the week:

LlmStats's tweet photo. This is the AI Coding Leaderboard by the end of the week: https://t.co/qz1UTRbsOa

1

7

2

1

614

LLM Stats @LlmStats

about 1 month ago

@OpenAI Take a look at the full leaderboard: https://t.co/XdgqrpeI0B

0

0

0

1

253

LLM Stats @LlmStats

about 1 month ago

We estimate that GPT-5.5 will be the strongest available model for all use cases according to our Reasoning Index, that covers 110+ benchmarks, GPT-5.5 is surpassing Claude Opus 4.7. Congrats @OpenAI 🔥

LlmStats's tweet photo. We estimate that GPT-5.5 will be the strongest available model for all use cases according to our Reasoning Index, that covers 110+ benchmarks, GPT-5.5 is surpassing Claude Opus 4.7.

Congrats @OpenAI 🔥 https://t.co/Z2BCPFGRl4

1

44

4

6

2K

LLM Stats @LlmStats

about 1 month ago

Kimi K2.6 is now the best Open Weights model on LLM Stats. Congrats @Kimi_Moonshot team. It's the first time we see an open model directly competing with closed alternatives at 25% of the cost.

LlmStats's tweet photo. Kimi K2.6 is now the best Open Weights model on LLM Stats.

Congrats @Kimi_Moonshot team.

It's the first time we see an open model directly competing with closed alternatives at 25% of the cost. https://t.co/C6nlStGMn2

0

36

2

4

1K

Last Seen Users on Sotwe

Trends for you

Most Popular Users