Maxim Saplin @msmxm - Twitter Profile

4 months ago

GH Copilot (VSCode Insider Preview) has added the context window stats... Eventually. And what a discovery, GPT-5.2 has just 128K context window (out of 272K allowed by the model)

msmxm's tweet photo. GH Copilot (VSCode Insider Preview) has added the context window stats... Eventually. And what a discovery, GPT-5.2 has just 128K context window (out of 272K allowed by the model) https://t.co/GqKsC66rVk

0

57

msmxm retweeted

Chenguang Wang (hiring)

@ChenguangWang

6 months ago

♟️Excited to share that our work LLM Chess! It’s a clean, scalable benchmark showing that even today’s top LLMs still struggle with strategic reasoning and instruction-following in dynamic environments. 📄 Paper: https://t.co/WNZUlFJC7E 🏆 Leaderboard: https://t.co/2L4Nixezpc 💻 Code: https://t.co/lpJSRcumgS 🎯Why Chess? Chess is the original AI challenge: strategic, long-horizon, and grounded. It’s also a clean test for LLMs: no contamination, no memorization, and difficulty scales with progress. 🔑• 50+ models including GPT-o3 @OpenAI, Gemini @Google, Claude @AnthropicAI, DeepSeek @deepseek_ai, Llama @Meta, @Alibaba_Qwen evaluated via agentic gameplay. • Reasoning models do much better than non-reasoning, yet many still can’t beat random play. • Top models reach ~758 Elo: good, but nowhere near strong humans. 🧑‍🤝‍🧑 Thank you amazing collaborators @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq_, @Chi_Wang_! 📍The work will also be presented at NeurIPS FoRLM Workshop at Sun, Dec 7 3:00–4:15pm PT in Upper Level Room 33ABC. Come chat with us and check out the live leaderboard!

ChenguangWang's tweet photo. ♟️Excited to share that our work LLM Chess! It’s a clean, scalable benchmark showing that even today’s top LLMs still struggle with strategic reasoning and instruction-following in dynamic environments.
📄 Paper: https://t.co/WNZUlFJC7E
🏆 Leaderboard: https://t.co/2L4Nixezpc
💻 Code: https://t.co/lpJSRcumgS

🎯Why Chess? Chess is the original AI challenge: strategic, long-horizon, and grounded. It’s also a clean test for LLMs: no contamination, no memorization, and difficulty scales with progress.

🔑• 50+ models including GPT-o3 @OpenAI, Gemini @Google, Claude @AnthropicAI, DeepSeek @deepseek_ai, Llama @Meta, @Alibaba_Qwen evaluated via agentic gameplay.
• Reasoning models do much better than non-reasoning, yet many still can’t beat random play.
• Top models reach ~758 Elo: good, but nowhere near strong humans.

🧑‍🤝‍🧑 Thank you amazing collaborators @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq_, @Chi_Wang_!

📍The work will also be presented at NeurIPS FoRLM Workshop at Sun, Dec 7 3:00–4:15pm PT in Upper Level Room 33ABC. Come chat with us and check out the live leaderboard!

0

9

2

3

651

msmxm retweeted

Nicholas Crispino

@NRCrispino

6 months ago

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to **+20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: https://t.co/NgKlPyzP6h 🏆 Leaderboard: https://t.co/fgSKij6SQd 💻 Code: https://t.co/ngUdQGjLf0 Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏

NRCrispino's tweet photo. Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉

Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess.

LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*.

Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making.

Unlike static benchmarks that get contaminated or saturated, chess offers:
✅ Dynamic, stochastic gameplay
✅ Adjustable difficulty via engine skill
✅ Resistance to memorization

Our setup: LLMs play in an agentic environment, making moves through tool calls.

**Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance.

**Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo).

Key findings for Phase 1:

♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player
♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%)
♟️ Test-time scaling for reasoning effort boosts performance up to **+20%**

Key findings for Phase 2:

📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**.

While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo).

🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination.

Check it out and let us know what you think! We are continually evaluating more models on the benchmark.

Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS!

📄 Paper: https://t.co/NgKlPyzP6h
🏆 Leaderboard: https://t.co/fgSKij6SQd
💻 Code: https://t.co/ngUdQGjLf0

Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏

1

11

5

0

541

Maxim Saplin

@msmxm

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users