Scott Wu @scottwu46 - Twitter Profile

Pinned Tweet

18 days ago

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects. The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%! "Where others grade like a CI, FrontierCode grades like a tech lead."

Cognition @cognition

18 days ago

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

cognition's tweet photo. Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

243

4K

317

2K

3M

42

626

41

131

88K

ScottWu46 retweeted

Engram

@EngramLab

3 days ago

https://t.co/CGIef5lIBI

58

435

66

169

162K

Scott Wu

@ScottWu46

3 days ago

@EngramLab Congrats guys!

0

2

0

104

ScottWu46 retweeted

Sahin Olut

@sahinolut

4 days ago

first day of work at @cognition excited to build the future here

17

114

5

6

15K

Who to follow

Zihan Tan

@zihantan09

Assistant Professor @UMNComputerSci. Research interest: theoretical computer science.

Ruizhe Zhang

@zrz96zrz

Assistant Professor at Purdue University

ScottWu46 retweeted

6 days ago

In 1958 Ian Donald published what is now the foundational paper on medical ultrasound for obstetrics. He was so widely ridiculed by his colleagues at the time that they nicknamed him Mad Donald [1], and one said ultrasound would be useful only to "a gynecologist who was blind and had lost the use of both hands" [2]. Another noted that he'd invented a £10,000 device to undertake a task that could be accomplished with a £0.02 rubber glove. [3] Last week, my wife and I welcomed our first child into the world. She had a rare pregnancy complication that until recently would have meant only a 28% intact survival rate for our newborn. But in 2013 US guidance was updated to add regular preventative screening for her condition at the 20-week ultrasound, and with early detection the survival rate is ~99%. (In the UK, preventative screening is still not recommended, for reasons like "it is not known how accurate screening tests are" [4].) The entire history of radiology is people expressing skepticism about the work done by innovators. I for one am grateful for folks like @DavidSHolz building new classes of devices that can help us see things in new ways, and I'll be rooting for their success. Hand in hand with my wife and our healthy baby boy.

russelljkaplan's tweet photo. In 1958 Ian Donald published what is now the foundational paper on medical ultrasound for obstetrics. He was so widely ridiculed by his colleagues at the time that they nicknamed him Mad Donald [1], and one said ultrasound would be useful only to "a gynecologist who was blind and had lost the use of both hands" [2]. Another noted that he'd invented a £10,000 device to undertake a task that could be accomplished with a £0.02 rubber glove. [3]

Last week, my wife and I welcomed our first child into the world. She had a rare pregnancy complication that until recently would have meant only a 28% intact survival rate for our newborn. But in 2013 US guidance was updated to add regular preventative screening for her condition at the 20-week ultrasound, and with early detection the survival rate is ~99%. (In the UK, preventative screening is still not recommended, for reasons like "it is not known how accurate screening tests are" [4].)

The entire history of radiology is people expressing skepticism about the work done by innovators. I for one am grateful for folks like @DavidSHolz building new classes of devices that can help us see things in new ways, and I'll be rooting for their success. Hand in hand with my wife and our healthy baby boy.

101

1K

49

110

82K

Scott Wu

@ScottWu46

14 days ago

huge shoutout to @ryanbai who spearheaded our AI Productivity Guarantee work!

Ryan Bai

@ryanbai

14 days ago

https://t.co/DG9viAfhVN

14

168

22

186

128K

3

76

4

27

15K

ScottWu46 retweeted

Cognition @cognition

15 days ago

Congratulations to @axliu42 on receiving the ACM Doctoral Dissertation Award for his work on learning-theoretic foundations for understanding quantum systems. Allen joined Cognition's research team this week while continuing as a professor at NYU. We’re honored and excited to work with him. https://t.co/W1u9dAefnF

6

110

10

11

27K

Scott Wu

@ScottWu46

16 days ago

we are going to need up our own human orchestration as much as we're upping our agent orchestration:

Walden

@walden_yan

16 days ago

My take 24 hours after Fable 5: Your organization will likely not scale with the exponential curve of AI. I'l just come out to say: This should be a wakeup call for engineering teams. Set up your cloud software factories. Now. Models can now fix impossible bugs, UI-test the hardest flows, writing extremely good code, etc. I have't opened Datadog manually as far as I can remember. AI should be the first-line defense for bugs and feedback. Humans should only look at PRs after an AI has already reviewed it. AI should generate screen recordings of any PR before a human eye even reaches it. The agent should just prompt itself most of the time. Ex. (pictured) our ui feedback channel manages itself, creates tickets, assigns itself automatically You might also be worried about cost. Anthropic, OpenAI, and other labs will likely continue to put out bigger and more expensive models. But, we will also continue to get more capable small models. Not everything will need the smartest models. It's about having the organizational harness in place to continue taking advantage of this rising tide. Moreover, if you use Devin, we've already optimized our harness a bit, and Fable is actually only ~40% more expensive in practice (vs the 2x people assume). I'm honestly pleasantly surprised - it might be higher ROI than you think. Anyway, if you take anything away, engineers shouldn't be manually picking up tickets, humans shouldn't be digging into logs themselves, rethink what you do with your time that shouldn't just be an AI. We need to rethink what humans spend their time going.

walden_yan's tweet photo. My take 24 hours after Fable 5:

Your organization will likely not scale with the exponential curve of AI.

I'l just come out to say: This should be a wakeup call for engineering teams.

Set up your cloud software factories. Now.

Models can now fix impossible bugs, UI-test the hardest flows, writing extremely good code, etc. I have't opened Datadog manually as far as I can remember.

AI should be the first-line defense for bugs and feedback. Humans should only look at PRs after an AI has already reviewed it. AI should generate screen recordings of any PR before a human eye even reaches it. The agent should just prompt itself most of the time.

Ex. (pictured) our ui feedback channel manages itself, creates tickets, assigns itself automatically

You might also be worried about cost. Anthropic, OpenAI, and other labs will likely continue to put out bigger and more expensive models. But, we will also continue to get more capable small models. Not everything will need the smartest models. It's about having the organizational harness in place to continue taking advantage of this rising tide.

Moreover, if you use Devin, we've already optimized our harness a bit, and Fable is actually only ~40% more expensive in practice (vs the 2x people assume). I'm honestly pleasantly surprised - it might be higher ROI than you think.

Anyway, if you take anything away, engineers shouldn't be manually picking up tickets, humans shouldn't be digging into logs themselves, rethink what you do with your time that shouldn't just be an AI. We need to rethink what humans spend their time going.

42

926

57

796

215K

6

163

6

64

37K

Scott Wu

@ScottWu46

17 days ago

Really thoughtful piece by @saranormous !

sarah guo

@saranormous

17 days ago

https://t.co/Hw02laH9yp

100

2K

197

5K

1M

4

371

7

552

154K

Scott Wu

@ScottWu46

17 days ago

A new top scorer just one day after our benchmark released! Especially strong on the hardest tasks: 13.4% -> 29.3% on FrontierCode Diamond compared to Opus 4.8.

Cognition @cognition

17 days ago

Claude Fable 5 is now available in Devin. Fable 5 earns the #1 spot on FrontierCode, our benchmark for real-world engineering tasks that grades mergeability and quality:

cognition's tweet photo. Claude Fable 5 is now available in Devin.

Fable 5 earns the #1 spot on FrontierCode, our benchmark for real-world engineering tasks that grades mergeability and quality:

43

1K

101

179

313K

11

223

10

13

22K

Scott Wu

@ScottWu46

18 days ago

@NickADobos I hope that it gets solved in the next 6 months and then we can move on to even more challenging tasks!

0

53

0

34K

Scott Wu

@ScottWu46

18 days ago

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects. The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%! "Where others grade like a CI, FrontierCode grades like a tech lead."

Cognition @cognition

18 days ago

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

243

4K

317

2K

3M

42

626

41

131

88K

Scott Wu

@ScottWu46

22 days ago

Agree - the thing is that for agent work folks often don't know how to measure the productivity they're getting even within an order of magnitude! So there will be a point of diminishing returns on measurement eventually but we're not yet too close to the 2-5% micromanaging territory.

0

3

0

1K

Scott Wu

@ScottWu46

22 days ago

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org? We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions? On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does. The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

Cognition @cognition

22 days ago

AI should earn its keep. Introducing the AI Productivity Guarantee. If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million. It’s time for the AI industry to stop maximizing tokens and start maximizing productive output.

cognition's tweet photo. AI should earn its keep. Introducing the AI Productivity Guarantee.

If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million.

It’s time for the AI industry to stop maximizing tokens and start maximizing productive output.

73

1K

100

434

432K

59

895

58

320

161K

Scott Wu

@ScottWu46

23 days ago

Devin pushing the frontier on ECDSA circuits!

gautham

@bbuddha_xyz

23 days ago

Looks like @DevinAI the best research agent out there? Nikhil from Devin is moving the frontier forward by reducing our the number of qubits our circuit requires.

bbuddha_xyz's tweet photo. Looks like @DevinAI the best research agent out there?

Nikhil from Devin is moving the frontier forward by reducing our the number of qubits our circuit requires. https://t.co/Y5GhaMqQEC

8

61

6

13

32K

5

122

9

11

17K

ScottWu46 retweeted

Cognition @cognition

23 days ago

At @harvey, the engineering team integrated Spectre — their internal background agent — into Devin Desktop. Now Spectre's organizational context can live on every engineer's laptop and flow across their favorite agents.

cognition's tweet photo. At @harvey, the engineering team integrated Spectre — their internal background agent — into Devin Desktop.

Now Spectre's organizational context can live on every engineer's laptop and flow across their favorite agents. https://t.co/G2OqkM9Ncj

14

158

23

44

38K

Scott Wu

@ScottWu46

23 days ago

Excited to partner with @Carahsoft! An important step for more of the public sector to have access to the same tools and technology that the private sector is adopting en masse today.

Carahsoft

@Carahsoft

24 days ago

We have partnered with @Cognition to bring their AI coding platform, Devin, to the Public Sector. Together, we’re helping organizations modernize software development and accelerate secure AI adoption. Learn more: https://t.co/f2AQbyDnqh

Carahsoft's tweet photo. We have partnered with @Cognition to bring their AI coding platform, Devin, to the Public Sector. Together, we’re helping organizations modernize software development and accelerate secure AI adoption. Learn more: https://t.co/f2AQbyDnqh https://t.co/CCtmOvu1FA

4

42

2

3

18K

1

70

1

9

8K

Scott Wu

@ScottWu46

24 days ago

Standalone IDEs have about 6 months left to live. An interface for manually editing and refactoring doesn’t need to exist if you're not manually editing and refactoring anymore. So what's the right interface for a dev to be working in for 8h / day? Some parts are obvious: you want to be able to spin up agents (either local or cloud agents) and to have a clean interface to keep up with all of your parallel running agents. Then you want to be able to get into the weeds whenever needed for last-mile fixes and review. But as software engineering continues to evolve we will see more and more of the lifecycle get reinvented. How do you build a single surface that allows you to plan, spec, prototype, debug, review, QA? Bringing Devin and Windsurf together has been our vision ever since the acquisition. Devin Desktop is our first shot at what this looks like. Excited to make this a reality today!

Devin Desktop

@devindesktop

24 days ago

Introducing Devin Desktop: the next generation of Windsurf Manage fleets of local and cloud agents from one surface Support for any ACP-compatible agent With a full IDE for when you need to jump into the code

105

953

110

352

498K

69

812

50

444

249K

ScottWu46 retweeted

Devin Desktop

@devindesktop

25 days ago

This one doesn't fit in a wave either! See you tomorrow

28

232

9

20

39K

Scott Wu

@ScottWu46

28 days ago

from the goat himself

Ido Pesok

@ido_pesok

28 days ago

https://t.co/QYMZzNRa0N

20

444

33

916

261K

7

495

31

622

124K

ScottWu46 retweeted

Ryan Carson

@ryancarson

30 days ago

Word on the street is that everyone is going to be switching back to Opus when the new model drops. This is exactly why I use an independent agent lab like Devin for my main software factory. They're going to deal with that headache for me. There's no way you can move fast and reliably if you're constantly switching between Claude Code, Codex and insert-other-shiny-object-here.

59

206

18

50

34K

Scott Wu

@ScottWu46

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users