Naman Jain @StringChaos - Twitter Profile

Pinned Tweet

3 months ago

New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior

Cursor @cursor_ai

3 months ago

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

cursor_ai's tweet photo. We're sharing a new method for scoring models on agentic coding tasks.

Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55

210

3K

255

918

632K

5

147

18

103

39K

Naman Jain

@StringChaos

6 days ago

@novasarc01 Thats why we built CursorBench! https://t.co/ClmF0ZhqnO

0

5

0

3

480

StringChaos retweeted

elie

@eliebakouch

15 days ago

correlation between CursorBench and Artificial Analysis reported scores benchmarks like IFBench or tau2 show ~0 correlation with CursorBench. opus 4.7 (max effort) performs relatively better on CursorBench than on other benchmarks, gpt 5.5 shows the opposite pattern

eliebakouch's tweet photo. correlation between CursorBench and Artificial Analysis reported scores

benchmarks like IFBench or tau2 show ~0 correlation with CursorBench. opus 4.7 (max effort) performs relatively better on CursorBench than on other benchmarks, gpt 5.5 shows the opposite pattern https://t.co/5Z8yA8ZNsW

10

156

7

55

25K

StringChaos retweeted

Michael Truell

@mntruell

15 days ago

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. https://t.co/67u5JEXoM9

106

1K

88

221

1M

Who to follow

Pratik Joshi

@Roprajo

Research Engineer @GoogleDeepMind | Teaching machines to code | Prev @LTIatCMU @GoogleAI, @MSFTResearch @BITSPilaniGoa

Anwesh Bhattacharya

@anwesh_bh

Crytography PhD student @IllinoisCS | Tenured Child | Opinions my own

Shrey Tiwari

@shrey_twr

Here to share my journey in research/life | PhD Student @CarnegieMellon | AS Intern @Amazon | Previous @Uber @MSFTResearch @citrix @MorganStanley @iiscbangalore

Naman Jain

@StringChaos

16 days ago

Check out Composer 2.5, our new model pushing pareto frontier

Cursor @cursor_ai

17 days ago

Composer 2.5 is exceptionally intelligent and up to 10x more efficient than similarly capable models.

55

2K

120

183

458K

0

26

0

2K

StringChaos retweeted

Hao Wang

@MogicianTony

about 2 months ago

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

MogicianTony's tweet photo. SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits.

Our agent scored 100% on both. It solved 0 tasks.

Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

22

672

90

370

826K

StringChaos retweeted

Cursor @cursor_ai

2 months ago

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

cursor_ai's tweet photo. Earlier this week, we published our technical report on Composer 2.

We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours. https://t.co/f75l7Qa4fr

102

2K

128

504

507K

Naman Jain

@StringChaos

2 months ago

@koushik77 For the first probably, almost no model considers the esbuild transpiler issue! For the second, agents can actually tune approximate algorithm quite well.

0

1

0

223

Naman Jain

@StringChaos

2 months ago

Check out the tech report detailing our continued pre-training and RL setup behind Composer2! Also sharing some example CursorBench problems by popular demand

StringChaos's tweet photo. Check out the tech report detailing our continued pre-training and RL setup behind Composer2! Also sharing some example CursorBench problems by popular demand https://t.co/Ki9dDLcFX7

Cursor @cursor_ai

2 months ago

We're releasing a technical report describing how Composer 2 was trained.

169

5K

484

4K

1M

1

48

3

4

3K

StringChaos retweeted

Sasha Rush

@srush_nlp

2 months ago

It's really neat to see all the interest in the Composer 2 technical report, from training to kernel design to inference. If you have any questions about why we did things, feel free to ask. I'll run around the office and bug people.

35

320

18

104

58K

Naman Jain

@StringChaos

3 months ago

Excited to share Composer-2 with everyone. It has come a long way since Composer-1, still lots more to go! Hope you like it!

Cursor @cursor_ai

3 months ago

Composer 2 is now available in Cursor.

647

10K

882

2K

5M

3

66

5

1

3K

StringChaos retweeted

Cursor @cursor_ai

3 months ago

We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.

cursor_ai's tweet photo. We trained Composer to self-summarize through RL instead of a prompt.

This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions. https://t.co/ryfalZHLZS

91

2K

97

373

229K

Naman Jain

@StringChaos

3 months ago

Check out full post at: https://t.co/ClmF0ZhqnO

0

10

1

9

908

Naman Jain

@StringChaos

3 months ago

New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior

Cursor @cursor_ai

3 months ago

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

210

3K

255

918

632K

5

147

18

103

39K

Naman Jain

@StringChaos

3 months ago

Lots more details in the post: 1. Pareto frontier across different metrics 2. How CursorBench has shifted as agent capabilities changed 3. CursorBench vs public evals: what’s missing and future work directions 4. CursorBench vs online: how online metrics shape offline evals

1

12

1

1K

StringChaos retweeted

Manish Shetty

@slimshetty_

3 months ago

GSO Update. gpt-5.4 (xhigh) scores 31.4% with reasoning_effort=high, gpt-5.4 slightly lower than gpt-5.2. a quick thought on why below:

slimshetty_'s tweet photo. GSO Update.

gpt-5.4 (xhigh) scores 31.4%

with reasoning_effort=high, gpt-5.4 slightly lower than gpt-5.2. a quick thought on why below: https://t.co/SMSx7Ne39v

3

60

4

6

7K

StringChaos retweeted

Manish Shetty

@slimshetty_

4 months ago

https://t.co/iXXbCssFXY

2

54

6

41

24K

StringChaos retweeted

Cursor @cursor_ai

4 months ago

Long-running agents are now available at https://t.co/3PT8c7azU3 for Ultra, Teams, and Enterprise plans. With our new harness, agents can complete much larger tasks. https://t.co/7p57WeR04t

cursor_ai's tweet photo. Long-running agents are now available at https://t.co/3PT8c7azU3 for Ultra, Teams, and Enterprise plans.

With our new harness, agents can complete much larger tasks.

https://t.co/7p57WeR04t https://t.co/pGePEFRPTT

60

955

93

286

348K

StringChaos retweeted

Cursor @cursor_ai

4 months ago

Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.

155

2K

184

241

664K

StringChaos retweeted

Michael Truell

@mntruell

5 months ago

We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM. It *kind of* works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.

mntruell's tweet photo. We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.

It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.

It *kind of* works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.

677

9K

902

4K

6M

Naman Jain

@StringChaos

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users