SWE-bench

6 months ago

SWEbench's tweet photo. https://t.co/QffIiejSP6

CLS

@ChengleiSi

6 months ago

@jyangballin @KLieret @_carlosejimenez @OfirPress how do I join SWE-bench slack John

1

7

0

3K

2

4

2

1

7K

4 months ago

Join us in SWE-bench slack if you're interested in contributing and using these new datasets! (bottom left of https://t.co/ArdUAzkdde) Expect a lot more to come in the following weeks :)

1

2

0

388

4 months ago

More SWE-bench environments, tasks, trajectories, and training recipes for all!

Kevin Li

@kevin_x_li

4 months ago

SWE-smith is going multilingual! We have expanded our task synthesis pipeline to JavaScript! This release includes: • 6,099 new JS tasks • Coverage across 34 popular repos • End-to-end Modal pipeline for fast task synthesis Scaling agentic training data just got easier.

kevin_x_li's tweet photo. SWE-smith is going multilingual!
We have expanded our task synthesis pipeline to JavaScript!

This release includes:
• 6,099 new JS tasks
• Coverage across 34 popular repos
• End-to-end Modal pipeline for fast task synthesis

Scaling agentic training data just got easier. https://t.co/toMOLtYvGe

1

29

8

5

6K

1

0

1

750

4 months ago

🚀🚀🚀

John Yang

@jyangballin

4 months ago

PyPI downloads last month - swebench: 3.1 Million (10M Total) - swesmith: 1.9M (2.8M Total) - mini-swe-agent: 164k (636k Total) We're incredibly grateful ❤️ to the worldwide SWE-* community who continue to build on our work! New releases on all fronts coming soon

jyangballin's tweet photo. PyPI downloads last month
- swebench: 3.1 Million (10M Total)
- swesmith: 1.9M (2.8M Total)
- mini-swe-agent: 164k (636k Total)

We're incredibly grateful ❤️ to the worldwide SWE-* community who continue to build on our work!

New releases on all fronts coming soon https://t.co/0rG9oMboar

2

52

6

3

17K

0

1

0

1

507

6 months ago

@ChengleiSi @jyangballin @KLieret @_carlosejimenez @OfirPress 🫰 https://t.co/MOIhHmiYnu

6 months ago

2

4

2

1

7K

1

3

0

184

Chunyang Chen @chun_yang_chen

6 months ago

SWE-bench blog site launched! Check out our content + expect more SWE-bench/agent/smith content soon!

0

2

0

2

8K

SWEbench retweeted

John Yang

@jyangballin

7 months ago

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

31

417

98

166

102K

SWEbench retweeted

7 months ago

🏆Glad to know that our #ASE25 paper about automated bug repair using MMLM just got the ACM SIGSOFT Distinguished Paper Award🎉 And it is still ranked top #1 in @SWEbench Mulmimodal Track! Thank Kai, Xiaofei @xfxie312, and Jian for the great work!

0

6

1

0

1K

SWEbench retweeted

9 months ago

Congrats to @Zai_org GLM-4.5 on getting the 7th spot on our SWE-bench Verified [Bash Only] leaderboard! w/ @KLieret @_carlosejimenez @jyangballin

OfirPress's tweet photo. Congrats to @Zai_org GLM-4.5 on getting the 7th spot on our SWE-bench Verified [Bash Only] leaderboard!

w/ @KLieret @_carlosejimenez @jyangballin https://t.co/EuL3BkFUCr

2

12

1

2

2K

SWEbench retweeted

9 months ago

Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs! w/ @KLieret @_carlosejimenez @jyangballin

OfirPress's tweet photo. Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs!

w/ @KLieret @_carlosejimenez @jyangballin https://t.co/h7yPOgFD7A

1

16

4

12

3K

SWEbench retweeted

9 months ago

3 out of the top 6 most downloaded datasets on @huggingface are SWE-bench related. Thanks!!! ♥️

1

65

7

18K

SWEbench retweeted

carlos @_carlosejimenez

10 months ago

Recent open model scores on SWE-bench Bash Only: 🥇Qwen3-Coder 480B/A35B Instruct - 55.40% 🥈Kimi-K2-Instruct - 43.80% 🥉gpt-oss-120b - 26.00% See the full leaderboard below! 👇

6

210

27

41

66K

SWEbench retweeted

Kilian Lieret @KLieret

10 months ago

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

KLieret's tweet photo. What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵 https://t.co/6mzfAUbcYn

18

265

20

133

32K

SWEbench retweeted

Kilian Lieret @KLieret

10 months ago

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

KLieret's tweet photo. Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵 https://t.co/iybEEAS076

8

152

21

39

24K

SWEbench retweeted

10 months ago

GPT-5 gets 74.9 on SWE-bench. Wonder what the budget per task is.

3

17

1

0

4K

SWEbench retweeted

carlos @_carlosejimenez

10 months ago

What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇

_carlosejimenez's tweet photo. What happens if you compare LMs on SWE-bench without the fancy scaffolds?
Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash.
More on why this is important 👇 https://t.co/IhxqjKX6Aj

14

204

27

71

33K

SWEbench retweeted

10 months ago

Super exciting to have 3 new open-weight models that all obtain more than 60 on SWE-bench Verified! Looking forward to the results on SWE-bench Multimodal when these models obtain vision capabilities :)

OfirPress's tweet photo. Super exciting to have 3 new open-weight models that all obtain more than 60 on SWE-bench Verified! Looking forward to the results on SWE-bench Multimodal when these models obtain vision capabilities :) https://t.co/Xh1Nlsrmzv

5

22

3

3K

SWEbench retweeted

Kilian Lieret @KLieret

10 months ago

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

KLieret's tweet photo. Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!
Made for benchmarking, fine-tuning, RL, or just for use from your terminal.
It’s open source, simple to hack, and compatible with any LM! Link in 🧵 https://t.co/eKPO4c269d

12

776

73

870

112K

11 months ago

@Alibaba_Qwen Congratulations on amazing SWE-bench Verified + Multilingual performance!

1

2

0

427