Eval Sys @EvalSysOrg - Twitter Profile

Pinned Tweet

5 days ago

MCPMark Verified is live ✅ Independently benchmarked, reproducible MCP agent performance. And the first Verified result is in 🎉 Congrats @Kimi_Moonshot K2.7-Code — Ranked #2 at 81.1% +8.3pts over K2.6 · Ahead of Claude Opus 4.8 max Open-source code models are closing in 💪

EvalSysOrg's tweet photo. MCPMark Verified is live ✅

Independently benchmarked, reproducible MCP agent performance.

And the first Verified result is in 🎉

Congrats @Kimi_Moonshot
K2.7-Code — Ranked #2 at 81.1%
+8.3pts over K2.6 · Ahead of Claude Opus 4.8 max

Open-source code models are closing in 💪 https://t.co/BerMSfUkLf

Kimi.ai @Kimi_Moonshot

6 days ago

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: https://t.co/uvoSJKyGCY 🔗 API: https://t.co/EOZkbOwCN4

Kimi_Moonshot's tweet photo. 🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!

🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.

⚡️ 6x High-Speed Mode coming soon!
🔌 Available today via Kimi API and Kimi Code.

🔗 Kimi Code: https://t.co/uvoSJKyGCY
🔗 API: https://t.co/EOZkbOwCN4

625

14K

2K

3K

2M

2

7

1

7K

Eval Sys

@EvalSysOrg

6 months ago

💻 Code & Community: https://t.co/Mjm1mfK5ry 🌐 Website: https://t.co/Nft5NLKU3P 📄 Paper: https://t.co/XOV8gCANSo

0

2

0

165

Eval Sys

@EvalSysOrg

6 months ago

MCPMark Leaderboard Update 🚀 🌟 DeepSeek-V3.2-thinking jumps to the #1 spot among open-source models — and we’re honored to see MCPMark cited in the @deepseek_ai technical report. ⚡️ Gemini 3 Pro High @GoogleDeepMind now leads with the highest pass@1 and pass@4 success rates. This update brings two newly released models onto the leaderboard: Gemini 3 and DeepSeek V3-2.

EvalSysOrg's tweet photo. MCPMark Leaderboard Update 🚀

🌟 DeepSeek-V3.2-thinking jumps to the #1 spot among open-source models — and we’re honored to see MCPMark cited in the @deepseek_ai technical report.

⚡️ Gemini 3 Pro High @GoogleDeepMind now leads with the highest pass@1 and pass@4 success rates.

This update brings two newly released models onto the leaderboard: Gemini 3 and DeepSeek V3-2.

2

12

8

0

2K

Eval Sys

@EvalSysOrg

6 months ago

Huge thanks to @m4rkmc for the contributions to the community 🙌 – Added the Gemini 3 model name – Upgraded to LiteLLM 1.80 (with support for passing thought signatures) – Implemented forwarding of thought_signatures, which this model specifically requires

Eval Sys

@EvalSysOrg

6 months ago

MCPMark Leaderboard Update 🚀 🌟 DeepSeek-V3.2-thinking jumps to the #1 spot among open-source models — and we’re honored to see MCPMark cited in the @deepseek_ai technical report. ⚡️ Gemini 3 Pro High @GoogleDeepMind now leads with the highest pass@1 and pass@4 success rates. This update brings two newly released models onto the leaderboard: Gemini 3 and DeepSeek V3-2.

2

12

8

0

2K

1

3

1

0

445

Eval Sys

@EvalSysOrg

9 months ago

Proud to share our first research paper! MCPMark stress-tested Model with MCP servers with 127 CRUD tasks × 5 MCPs × 30+ models. Key findings: 🔸 GPT-5 leads at 52.56% pass@1 🔸 Claude-sonnet-4.5 reaches 32.1% 🔸 The 30% barrier shows MCP workflows remain challenging even for top models Open research: paper, code, trajectory explorer + leaderboard expanding to agent submissions soon. Excited for what the community builds next! 🚀

Michael Qizhe Shieh

@michaelqshieh

9 months ago

Your agent can call tools; can it close the loop ? We stress-tested MCP with 127 CRUD-heavy tasks across 5 MCPs and >30 models, using a minimal but general MCPMark-Agent for fair comparison. 📄 Paper: https://t.co/MfE5cce9r7 🌐 Website: https://t.co/uvSTQWA0Nn 💻 Code: https://t.co/iZIuvwl6LM 🤗 Daily Papers: https://t.co/HvBBz2gwbX GPT-5 reaches 52.56% pass@1 and 33.86% pass^4, yet widely regarded strong models such as claude-sonnet-4 and o3 remain below 30% pass@1 and 15% pass^4. The newest Claude-sonnet-4.5 improves to 32.1% pass@1 and 16.5% pass^4 — just crossing the 30% line. The full report dives into data distributions, failure modes, and case studies (PASS vs FAIL). Plus trajectory explorer to debug agents yourself. 👉 Our leaderboard already tracks by models and MCP servers, and will soon support agent submissions — we welcome the community to submit results! Key insights in thread ⬇️

michaelqshieh's tweet photo. Your agent can call tools; can it close the loop ?

We stress-tested MCP with 127 CRUD-heavy tasks across 5 MCPs and >30 models, using a minimal but general MCPMark-Agent for fair comparison.

📄 Paper: https://t.co/MfE5cce9r7
🌐 Website: https://t.co/uvSTQWA0Nn
💻 Code: https://t.co/iZIuvwl6LM
🤗 Daily Papers: https://t.co/HvBBz2gwbX

GPT-5 reaches 52.56% pass@1 and 33.86% pass^4, yet widely regarded strong models such as claude-sonnet-4 and o3 remain below 30% pass@1 and 15% pass^4. The newest Claude-sonnet-4.5 improves to 32.1% pass@1 and 16.5% pass^4 — just crossing the 30% line.
The full report dives into data distributions, failure modes, and case studies (PASS vs FAIL). Plus trajectory explorer to debug agents yourself.
👉 Our leaderboard already tracks by models and MCP servers, and will soon support agent submissions — we welcome the community to submit results!
Key insights in thread ⬇️

2

57

22

12K

1

5

2

0

664

Eval Sys

@EvalSysOrg

9 months ago

Congrats on the launch of Strata! Thrilled that @Klavis_AI chose MCPMark. 🚀 MCPMark now benchmarks not only model agentic performance, but also MCP Services and frameworks. Can’t wait to see what the community builds next — and always open to partnership!

Klavis AI (YC X25)

@Klavis_AI

9 months ago

AI agents fail when given too many tools - a lesson from our work on tool use at Google Gemini. So we're launching Strata: one MCP server for AI agents to handle thousands of tools progressively. The Result? A +13% success rate boost on benchmarks & 83%+ accuracy on human eval. https://t.co/rS0UiJiVlB

44

339

32

238

333K

0

6

2

0

861

Eval Sys

@EvalSysOrg

9 months ago

@Alibaba_Qwen @grok @Kimi_Moonshot We're always looking for contributors in community! Check out more detail here: Github：https://t.co/cGOjwTX6jr Website：https://t.co/Nft5NLKU3P

EvalSysOrg's tweet photo. @Alibaba_Qwen @grok @Kimi_Moonshot We're always looking for contributors in community! Check out more detail here:

Github：https://t.co/cGOjwTX6jr
Website：https://t.co/Nft5NLKU3P https://t.co/1AMKROJlfk

0

4

0

5

2K

Eval Sys

@EvalSysOrg

9 months ago

MCPMark Leaderboard Update 🚀 🌟 Qwen-3-Coder takes the #1 spot among open-source models, with an impressive per-run cost of just $36.46. ⚡️ Grok-Code-Fast-1 delivers the lowest per-run cost ($16.08) and the fastest average agent time (156.63s) across the top 10 models. Kimi-K2-0905 outperforms Kimi2 in success rate, though at nearly double the per-run cost and average agent time. Notably, Qwen-3-Coder achieves a success rate close to O3, but at roughly one-third the per-run cost — offering the community a highly cost-effective option for MCP tool-use applications. This update introduces three newly released models to the leaderboard: Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905.

EvalSysOrg's tweet photo. MCPMark Leaderboard Update 🚀

🌟 Qwen-3-Coder takes the #1 spot among open-source models, with an impressive per-run cost of just $36.46.

⚡️ Grok-Code-Fast-1 delivers the lowest per-run cost ($16.08) and the fastest average agent time (156.63s) across the top 10 models.

Kimi-K2-0905 outperforms Kimi2 in success rate, though at nearly double the per-run cost and average agent time.

Notably, Qwen-3-Coder achieves a success rate close to O3, but at roughly one-third the per-run cost — offering the community a highly cost-effective option for MCP tool-use applications.

This update introduces three newly released models to the leaderboard: Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905.

5

133

21

49

95K

Eval Sys

@EvalSysOrg

10 months ago

🚀 MCPMark website updated! → https://t.co/ipAZazrO1W On MCPMark, you can now dive into each task’s description, verification, and model performance. We’ve also added model trajectories to each task leaderboard, so you can clearly see how models execute step by step. Many users found this super helpful for understanding models better—some even spent the whole afternoon exploring it. 💡 Feedback is always welcome—stay tuned for more updates! 👍 Big thanks to @arvinxu95 for the brilliant UX design and feature delivery!

0

4

0

1

2K

EvalSysOrg retweeted

Allison Zhan @AllisonXinyuan

10 months ago

We believe MCP servers are shaping the future of software. That’s why we built MCP Mark: a live benchmark for model mastery in real-world MCP use. Amazing experience collaborating with great friends & building community—EvalSys will keep driving meaningful work for the community!

0

6

3

0

1K

EvalSysOrg retweeted

Rick Lamers

@ricklamers

10 months ago

Excellent eval, and open weight models holding ground 🙌

0

3

2

3

1K

EvalSysOrg retweeted

LobeHub @lobehub

10 months ago

1/8) Excited to launch MCPMark (https://t.co/ImopHLfjou) today with NUS TRAIL and @EvalSysOrg! 🥳 It’s a high‑quality, program‑verifiable benchmark for MCP (Model Context Protocol) — designed to measure model's agentic capability & stability in MCP Use. Not another “lab” benchmark — this one reflects the messy reality of production. #MCP #LLM #Benchmark

1

12

7

3

2K

EvalSysOrg retweeted

Lingjun_C @DDDDDomain

10 months ago

🚀 🚀Just launched MCPMark, a challenging MCP benchmark I participated in. Its filesystem section include ops on files, structure exploration, reasoning, and multi-skill tasks. Most models show clear room for improvement, while GPT series excel in precise text manipulation

0

7

2

0

822

EvalSysOrg retweeted

Jiawei Wang @JarvisMSUstc

10 months ago

Excited to be part of @EvalSysOrg and contribute to our debut milestone, MCPMark! Follow us for updates, and come join the team—we’re just getting started! Github： https://t.co/o5oEYKtnHP Website： https://t.co/Yui64X3Snr Huggingface trajectory log： https://t.co/YaXFPnsD1w

0

15

4

0

1K

EvalSysOrg retweeted

Xiangyan Liu @dobogiyy

10 months ago

Sharing some of my thoughts when developing, hope they can help 👇 1/ Choosing the initial state defines task diversity, difficulty, and usefulness. 2/ State tracking and management is the trickiest stage. Each MCP needs its own isolation strategy. Worth it though: sandboxing lets agents CRUD freely instead of being stuck read-only. 3/ In many MCP use cases, top agents (esp. Claude Code) outperform humans. That's why we design human-agent workflows to co-create tasks🤓. 4/ GUI may feel more human-friendly, but MCP seems like the closest path to AGI right now.

0

11

5

0

1K

EvalSysOrg retweeted

Zijian Wu

@Jaku_metsu

10 months ago

Thrilled to see MCPMark officially live! 🚀 @AnthropicAI's vision for MCP — a universal, open standard for AI integrations — is summed up perfectly by the “USB‑C port for AI” analogy: a single, reliable connector that lets LLMs access tools and data seamlessly. With MCPMark, we turned that vision into a concrete benchmark, stress‑testing models across diverse MCP workflows — including filesystem, GitHub, Notion, Playwright, and Postgres. We absolutely went all in; it was an intense journey, and I’m immensely proud of what our team has built. Here’s to pushing LLMs ever closer to becoming tool‑aware, context‑rich AI systems! ✨

0

18

3

0

1K

Eval Sys

@EvalSysOrg

Last Seen Users on Sotwe

Trends for you

Most Popular Users