Justus Mattern

Verified account

@MatternJustus

Co-Founder @ProximalHQ | prev. research @PrimeIntellect, @MPI_IS and built revideo

San Francisco, CA

Joined March 2021

836 Following

7.9K Followers

1.2K Posts

Pinned Tweet

about 2 months ago

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

MatternJustus's tweet photo. Introducing FrontierSWE, an ultra-long horizon coding benchmark.

We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules.

Despite having 20 hours, they rarely succeed https://t.co/xbqHJRZiPZ

78

1K

140

522

267K

7 days ago

@aryaman2020 Incredibly cool idea

0

0

0

1

2K

MatternJustus retweeted

7 days ago

new paper 🫡 we made serving many different finetunes surprisingly efficient by just… not intervening at decode steps!

3

59

3

55

14K

7 days ago

Opus 4.8 fixes all the issues we observed with previous generations of Opus models. It is much more token-efficient, better calibrated and it attempts to cheat much less than previous generations. Very impressive release!

Proximal @ProximalHQ

7 days ago

We evaluated Claude Opus 4.8 on FrontierSWE ahead of today's release. It is now the best-performing model on FrontierSWE.

ProximalHQ's tweet photo. We evaluated Claude Opus 4.8 on FrontierSWE ahead of today's release. It is now the best-performing model on FrontierSWE. https://t.co/FKzHGibG7Y

6

127

13

10

19K

6

105

7

6

7K

Who to follow

Verified account

PhD student @MIT_CSAIL | Previously @allen_ai | MS'21 BS'19 BA'19 @uwnlp | 💼 on the industry job market

Verified account

Associate Professor @UCLAengineering/@UCLA. Area: #NLProc/#ML/#AI https://t.co/zj1ssZj9ox

Verified account

Prof @UofTCompSci. Director @JinesisLab. Founder @EuroSafeAI. Scientist@MPI_IS w/ @bschoelkopf. @CausalNLP, NLP4SocialGood @NLP4SG. Mentor&mentee @ACLMentorship

MatternJustus retweeted

7 days ago

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

scaling01's tweet photo. Anthropic says Opus 4.8 ranks 1st on FrontierSWE https://t.co/8sWK8Yq3xO

5

112

3

7

38K

9 days ago

Composer 2.5 outperforms all open source models and clearly beats its base model Kimi 2.5 as well as Kimi 2.6. It is roughly on par and slightly ahead of Gemini 3.1 Pro We still see a large gap between models from Anthropic / OpenAI and other labs

Proximal @ProximalHQ

9 days ago

Composer 2.5 is ranked #5 on FrontierSWE The model is broadly on par with Gemini 3.1 Pro, with a slight edge in our evaluation, and it beats all open source models. We still observe a significant performance gap between Composer and models from Anthropic and OpenAI

ProximalHQ's tweet photo. Composer 2.5 is ranked #5 on FrontierSWE

The model is broadly on par with Gemini 3.1 Pro, with a slight edge in our evaluation, and it beats all open source models. We still observe a significant performance gap between Composer and models from Anthropic and OpenAI https://t.co/ZqiYoPr67p

10

236

13

42

36K

5

144

3

22

14K

11 days ago

@rronak_ Congrats!

1

1

0

0

1K

13 days ago

@turtlesoupy congrats Thomas! Excited to see what's next

0

3

0

0

1K

19 days ago

This went surprisingly well for our first event - heard great talks and had very interesting conversations about post-training and evals! A special thanks to our speakers @jyangballin, @rawsh0, @rishiiyer01 and @evan_j_chu, and looking forward to the next one :)

MatternJustus's tweet photo. This went surprisingly well for our first event - heard great talks and had very interesting conversations about post-training and evals!

A special thanks to our speakers @jyangballin, @rawsh0, @rishiiyer01 and @evan_j_chu, and looking forward to the next one :) https://t.co/8Baa3955Cj

24 days ago

Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!

MatternJustus's tweet photo. Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks:

@jyangballin (MSL) will present ProgramBench

@rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B

@evan_j_chu and I will speak FrontierSWE and our research bets! https://t.co/V0W7uI2ial

6

146

10

51

35K

3

85

5

9

8K

23 days ago

@minney_cat thanks Minn!

0

1

0

0

243

24 days ago

Sign up here! https://t.co/JOwcyRTiwt

0

9

0

1

2K

24 days ago

Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!

MatternJustus's tweet photo. Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks:

@jyangballin (MSL) will present ProgramBench

@rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B

@evan_j_chu and I will speak FrontierSWE and our research bets! https://t.co/V0W7uI2ial

6

146

10

51

35K

26 days ago

People from top universities are great on average but nothing gets me more excited than talking to someone who went to a no-name uni (possibly in another country) and ended up at an org with a very high bar

25

2K

77

329

76K

26 days ago

This is a great opportunity for engineers or students that want to get into research and contribute to a high-impact publication! Check out the application form below https://t.co/zfVW1pbeRE

1

14

1

34

2K

26 days ago

We are hiring research fellows to help us improve FrontierSWE! If you want to help build the hardest real-world coding benchmark, reach out! Fellows can work with us for a few weeks up to months and will be supported with compute and a generous stipend https://t.co/KL5va5ydAe

about 2 months ago

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

MatternJustus's tweet photo. Introducing FrontierSWE, an ultra-long horizon coding benchmark.

We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules.

Despite having 20 hours, they rarely succeed https://t.co/xbqHJRZiPZ

78

1K

140

522

267K

6

309

19

194

41K

28 days ago

@samsja19 yes that was absolutely heinous

0

2

0

0

263

28 days ago

Incredible to see how far prime-rl has come! Initially, decoupling training and inference much as possible and making asynchronous RL a first class citizen was done out of necessity to support decentralized training. Later, it turned out that these happened to be exactly the right design choices for agentic RL with extremely long rollouts. Really cool work from @PrimeIntellect and @RampLabs!

28 days ago

https://t.co/2nYGsxMltD

19

594

51

1K

383K

5

123

4

51

12K

29 days ago

@rawsh0 @rishiiyer01 congrats! Super impressive

1

2

0

0

128

29 days ago

Really interesting work! It's super impressive how many of the agentic coding research artifacts (evals, datasets, harnesses, etc.) the community relies on come from @jyangballin, @OfirPress, @KLieret, @18jeffreyma et al.!

about 1 month ago

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

jyangballin's tweet photo. How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access.

Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵 https://t.co/8ayeDJLXaJ

103

2K

246

656

723K

5

49

1

10

6K

MatternJustus retweeted

@brendanigraham

30 days ago

@ProximalHQ @sama oai is mogging

0

9

1

0

1K

30 days ago

GPT-5.5 is an extremely good model! It outperforms all other models by a wide margin

Proximal @ProximalHQ

30 days ago

GPT-5.5 is the best-performing model on FrontierSWE. The model substantially outperforms Opus 4.7 in both mean@5 and best@5 rankings while working faster.

ProximalHQ's tweet photo. GPT-5.5 is the best-performing model on FrontierSWE.

The model substantially outperforms Opus 4.7 in both mean@5 and best@5 rankings while working faster. https://t.co/NznJoDnuNg

5

356

34

52

30K

2

55

0

6

3K

Last Seen Users on Sotwe

Trends for you

Most Popular Users