Sigmabench @sigmabench - Twitter Profile

Composer 2 is Kimi K2.5 + RL. Cursor's co-founder called not disclosing it "a miss." On Sigmabench: accuracy at finding the right files has dropped across all three Composer releases. Code quality has improved. Better code. Wrong files.

sigmabench's tweet photo. Composer 2 is Kimi K2.5 + RL. Cursor's co-founder called not disclosing it "a miss."

On Sigmabench: accuracy at finding the right files has dropped across all three Composer releases. Code quality has improved.

Better code. Wrong files. https://t.co/Huq0bHDCCK

1

2

0

65

sigmabench retweeted

Sigmabench

@sigmabench

3 months ago

@OpenAI GPT-5.4 Mini just dropped. We ran it with Codex CLI on Sigmabench. Vs full GPT-5.4: - Accuracy rank: 6th → 16th - Speed rank: 12th → 4th - 31% cheaper to run - 40% faster inference - Tier 1 consistency locked in, along with the other OpenAI frontier models. When cost + latency > max accuracy, Mini dominates. #sigmabench #GPT54Mini

0

3

2

0

134

Sigmabench

@sigmabench

3 months ago

@OpenAI GPT-5.4 Mini just dropped. We ran it with Codex CLI on Sigmabench. Vs full GPT-5.4: - Accuracy rank: 6th → 16th - Speed rank: 12th → 4th - 31% cheaper to run - 40% faster inference - Tier 1 consistency locked in, along with the other OpenAI frontier models. When cost + latency > max accuracy, Mini dominates. #sigmabench #GPT54Mini

0

3

2

0

134

Sigmabench

@sigmabench

3 months ago

Sigmabench: Codex CLI + GPT-5.4 vs GPT-5.3 Codex - 2 tiers lower on SigmaScore (worse accuracy + speed) 42% slower - 900 runs, 0 timeouts (flawless consistency, like 5.3) - But accuracy trails 5.3 Codex, Sonnet 4.6, and Opus 4.6 - 5.3 Codex 40% cheaper to run GPT-5.4 is a bigger, slower, more expensive general-purpose model, not a coding specialist. Generality hurts coding perf + cost. OpenAI pitches it as 5.3 Codex replacement. Our data says not yet for coding workflows. Note: This is not the code-optimized version. Stay tuned: New code quality eval. GPT-5.4 leads early charts #OpenAI #AICoding

0

3

1

3

554

Sigmabench

@sigmabench

4 months ago

We just ran Codex CLI + GPT-5.3 Codex. Wow! • Accuracy: Opus 4.6 / Sonnet 4.6 level • Consistency: #1 • Speed: ~2× faster than GPT-5.2 Codex • Cost: 70% less than Opus, 55% less than Sonnet Opus level accuracy, at twice the speed, and 30% of the cost! @OpenAI

sigmabench's tweet photo. We just ran Codex CLI + GPT-5.3 Codex. Wow!

• Accuracy: Opus 4.6 / Sonnet 4.6 level
• Consistency: #1
• Speed: ~2× faster than GPT-5.2 Codex
• Cost: 70% less than Opus, 55% less than Sonnet

Opus level accuracy, at twice the speed, and 30% of the cost!
@OpenAI https://t.co/SE73ZUarV4

0

3

1

0

306

Sigmabench

@sigmabench

4 months ago

We ran Gemini 3.1 Pro Preview on OpenCode. The result: performance is very close to Gemini CLI (which is not what we observed with Gemini 3 Pro) The main gap is timeouts: • Gemini CLI: 104 timeouts • OpenCode: 71 timeouts What’s interesting: • Accuracy is effectively tied • Consistency is a full tier higher on OpenCode Same model. Different harness

sigmabench's tweet photo. We ran Gemini 3.1 Pro Preview on OpenCode.

The result: performance is very close to Gemini CLI (which is not what we observed with Gemini 3 Pro)

The main gap is timeouts:
• Gemini CLI: 104 timeouts
• OpenCode: 71 timeouts

What’s interesting:
• Accuracy is effectively tied
• Consistency is a full tier higher on OpenCode

Same model. Different harness

0

1

0

177

Sigmabench

@sigmabench

4 months ago

Gemini 3.1 Pro Preview is out... and with Gemini CLI, the Sigmabench results are honestly disappointing. - Accuracy: 40.1% (Tier 3) - Consistency: 45.6% (Tier 3) - Speed: Tier 9 Lands 16th overall, below even Gemini 3 Pro in the same harness. Accuracy edges up vs 3 Pro, but speed craters hard (likely launch-day load + heavier thinking?) That said, we've seen previous Gemini 3 models leap when run through OpenCode instead of Gemini CLI. Harness makes a massive difference on Sigmabench. Next: testing 3.1 Pro across other agents (OpenCode, Cursor, etc.) to see its real ceiling. The agent still matters a lot.

sigmabench's tweet photo. Gemini 3.1 Pro Preview is out... and with Gemini CLI, the Sigmabench results are honestly disappointing.

- Accuracy: 40.1% (Tier 3)
- Consistency: 45.6% (Tier 3)
- Speed: Tier 9

Lands 16th overall, below even Gemini 3 Pro in the same harness.
Accuracy edges up vs 3 Pro, but speed craters hard (likely launch-day load + heavier thinking?)

That said, we've seen previous Gemini 3 models leap when run through OpenCode instead of Gemini CLI.

Harness makes a massive difference on Sigmabench.

Next: testing 3.1 Pro across other agents (OpenCode, Cursor, etc.) to see its real ceiling.

The agent still matters a lot.

0

2

1

0

229

Sigmabench

@sigmabench

4 months ago

It's a dead heat between Sonnet 4.6 and Opus 4.6 Speed, accuracy and consistency are all within the margin of error. What are we seeing in practice? @claudeai

0

2

0

113

Sigmabench

@sigmabench

4 months ago

Sonnet 4.6 is the most accurate model we’ve tested. Matches Opus 4.6 on performance at ⅔ the cost. Key benchmarks • Sigmabench Accuracy: 48.3% (Opus 47.6%) • SWE-bench Verified: 79.6% (Opus 80.8%) • Terminal-Bench 2.0: 59.1% (Opus 65.4%) • OSWorld-Verified: 72.5% (Opus 72.7%) #Claude #AICoding

sigmabench's tweet photo. Sonnet 4.6 is the most accurate model we’ve tested.
Matches Opus 4.6 on performance at ⅔ the cost.

Key benchmarks
• Sigmabench Accuracy: 48.3% (Opus 47.6%)
• SWE-bench Verified: 79.6% (Opus 80.8%)
• Terminal-Bench 2.0: 59.1% (Opus 65.4%)
• OSWorld-Verified: 72.5% (Opus 72.7%)

#Claude #AICoding

0

4

0

304

Sigmabench

@sigmabench

4 months ago

Claude Sonnet 4.6 just dropped. Not sure why we need a less capable model than Opus, unless it is MUCH cheaper and faster. Sigmabench results will be live later today, evaluating the Claude CLI + Sonnet 4.6. Follow for the update.

Claude

@claudeai

4 months ago

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.

1K

22K

2K

5K

8M

0

3

1

135

Sigmabench

@sigmabench

4 months ago

Open-source battle Kimi K2.5 v MiniMax M2.5 #MiniMaxM25 #MiniMaxM2_5

1

2

0

237

Sigmabench

@sigmabench

4 months ago

@cline With the agent harness being responsible for a lot of the performance, we benchmark agents + models together. Worked with OpenCode extensively. If you are interested in privately evaluating Cline with any model, please reach out.

0

114

Sigmabench

@sigmabench

4 months ago

Latest release: OpenCode + MiniMax M2.5 - Enters the Sigmabench leaderboard #16 overall - Comparable performance to Codex CLI + GPT-5.1 Codex Mini - #17 for Accuracy - #17 for Consistency - #13 for Speed From the 900 runs, we had 140 timeouts and 2 failures Follow for the direct comparison with Kimi K2.5 later today. #SWEbench #Sigmabench #MiniMaxM25 #MiniMaxM2_5

0

1

0

215

Sigmabench

@sigmabench

4 months ago

@MiniMax_AI @opencode @thdxr @fanjiewang Thanks for the quick response. We are rerunning the benchmark now. Stand by. Follow for today's update

0

2

0

899

Sigmabench

@sigmabench

4 months ago

@cline @MiniMax_AI We will post the benchmark for OpenCode + MiniMax M2.5 today. How does this open-source combo compare on Accuracy, Consistency, Speed and Cost? Follow for updates

0

1

0

87

Sigmabench

@sigmabench

4 months ago

@askOkara We will post the benchmark for OpenCode + MiniMax M2.5 today. How does this open-source combo compare on Accuracy, Consistency, Speed and Cost? Follow for updates

0

2

0

1

2K

Sigmabench

@sigmabench

4 months ago

@cgtwts We will post the benchmark for OpenCode + MiniMax M2.5 today. How does this open-source combo compare on Accuracy, Consistency, Speed and Cost? Follow for updates

0

2

0

1

961

Sigmabench

@sigmabench

4 months ago

@bindureddy Sigmabench will post the benchmark for OpenCode + MiniMax M2.5 today. How does this open-source agent + model combo compare on Accuracy, Consistency, Speed and Cost? Follow for today's update

0

114

Sigmabench

@sigmabench

4 months ago

@MiniMax_AI We will post the benchmark for OpenCode + MiniMax M2.5 today. How does this open-source agent + model combo compare on Accuracy, Consistency, Speed and Cost? Follow for today's update

1

12

0

4

4K

Sigmabench

@sigmabench

Last Seen Users on Sotwe

Trends for you

Most Popular Users