nate

12 days ago

https://t.co/dxbOh1QXcZ now includes steps and output tokens as well! These are additional signals our team uses to eval models.

12 days ago

Claude Fable 5 is now available in Cursor. It sets a new state of the art on CursorBench at 72.9%, 8 points above the previous best.

cursor_ai's tweet photo. Claude Fable 5 is now available in Cursor.

It sets a new state of the art on CursorBench at 72.9%, 8 points above the previous best. https://t.co/L3Wm8mSYq9

265

6K

454

682

1M

3

23

2

4

48K

12 days ago

this was an exciting model to eval!

12 days ago

See how Claude Fable 5 compares across every model: https://t.co/61FZktIGzz

9

173

7

14

72K

2

14

0

3K

_nateschmidt retweeted

Charlie Holtz

@charlieholtz

13 days ago

We've added a new harness! Cursor Composer 2.5 is live in Conductor. It's fast, precise, and cost-efficient. And when I say fast I mean _really_ fast. Excited to hear your thoughts!

56

679

28

147

121K

17 days ago

@JoshuaPachter @jediahkatz no

0

1

0

32

_nateschmidt retweeted

17 days ago

With canvases, Cursor can create apps like dashboards, reports, and internal tools. Now you can publish a canvas and share it with your team via URL.

97

2K

103

584

168K

17 days ago

https://t.co/z74iIHebKu

eric zakariasson

@ericzakariasson

17 days ago

introducing cursor profiles! go claim your handle at https://t.co/6t5lg2jqvg

356

2K

95

723

710K

3

40

0

2K

18 days ago

@dwarkesh_sp @srush_nlp !!!

0

3

0

731

_nateschmidt retweeted

Lee Robinson

@leerob

19 days ago

Quick rant on AI model benchmarks: - Some of the most popular ones are no longer helpful (SWE-bench¹) - It can be very hard to reproduce reported results (so lots of variance) - Take them with a grain of salt, look at the average across many We need some creative new ideas for AI model marketing. Supportive of a Survivor spin-off (who is the AI Jeff Probst!?). I get why every model release shows benchmark scores as the headline. It's actually pretty hard to describe how a model has improved without it sounding like fluff. And also it sounds boring to say the same thing over and over ("it's better at following instructions" repeat x10). Benchmarks make it very clear there is a number, which likely started bad, and is now going up. Yay! The reality is that benchmarks are most useful to those *training* the model so they know where to improve. Model labs use these benchmarks to measure progress, which is why having non-saturated benchmarks is extremely helpful. If you see models getting 90% on an eval, it's probably time to make a harder version. I do think there's a word of caution for everyone interpreting benchmarks. It's very hard to get exactly the same scores, which is why some benches show error bars and do the average over multiple runs. But even further, the hardware and GPUs the evals are running on really matter! Small differences there, or minor tweaks to the prompt, can swing scores by multiple percentage points². All of that to say, it's important to look at many different benchmarks, and then actually use the model to make your own opinion. For example, there's recently been a lot of debate on here about Opus 4.8 not benchmarking as well as other models. But personally I've found the model really good from my own usage. Your mileage may vary! There aren't many high-quality public benchmarks that measure things like the UX of the model responses, the style of the messages, the warmth or directness of the "personality". These things matter *a lot* for the day-to-day usage. How the model performs in the real world is often different from very specific benches. In summary, benchmarks matter but they are not a substitute for extensively testing the model yourself with real work. ¹: https://t.co/Zs3R7Ep2d6 ²: https://t.co/58dvc78FDo

30

227

8

74

38K

22 days ago

@eliebakouch very nice

0

1

0

90

29 days ago

@RayFernando1337 @mikeysee @convex we’re gonna keep improving rules / skills adherence - any interesting examples come to mind @RayFernando1337 ?

0

11

_nateschmidt retweeted

about 1 month ago

With the Cursor SDK, you can build your own agents with Composer 2.5. It's now available in Python and TypeScript. This long weekend, Composer usage is 90% off in the SDK. We're excited to see what you build!

169

3K

204

971

586K

about 1 month ago

@_ChrisCovington @icanvardar tysm

0

1

0

9

about 1 month ago

@_ChrisCovington @icanvardar tell me more, we’ve made recent cli improvements and will continue to do so

1

0

18

about 1 month ago

@EnTr0pY_88 @elonmusk Composer lives in the Cursor agent, but you can call it via our SDK https://t.co/0c4vpLZqFO

2

3

0

72

_nateschmidt retweeted

Michael Truell

@mntruell

about 1 month ago

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. https://t.co/67u5JEXoM9

111

1K

94

233

1M

_nateschmidt retweeted

Michael Truell

@mntruell

about 1 month ago

Composer 2.5 is now the most-chosen model in Cursor. We're giving everyone 10x usage for the rest of the day. Enjoy!

332

4K

247

518

40M

about 1 month ago

Composer is now more resourceful! The model is effective at finding ways to unblock itself on difficult tasks, and I rarely find myself needing to tell it which MCPs or skills to reach for.