Charley Lee @charleyslee - Twitter Profile

Charley Lee

@charleyslee

5 days ago

A very impressive open model

Datacurve @datacurve

5 days ago

GLM 5.2 is now on DeepSWE as the top open-source model on our leaderboard. With a pass@1 score of 44% at max effort, GLM 5.2 is indisputable #1 open-source model besting Kimi K2.7 Code by 17%.

104

3K

239

395

560K

0

19

0

2

1K

charleyslee retweeted

Datacurve @datacurve

7 days ago

Claude Fable 5 debuts at #1 on DeepSWE. It outscores the previous best by 3% and sets a new state-of-the-art on our long-horizon coding benchmark.

108

2K

86

261

463K

charleyslee retweeted

Artificial Analysis

@ArtificialAnlys

13 days ago

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task. The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others. More below.

ArtificialAnlys's tweet photo. We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top

DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

114

2K

185

412

569K

Charley Lee

@charleyslee

16 days ago

@elithrar Looks super interesting for our use case, dropped you a dm :)

0

50

Who to follow

Serena Ge (Datacurve)

execution platform @snowflake | overwatch/cs/c&o @uwaterloo | previously @squintai @invisionai | ioaa '21, '22 🌌 | sparc '22, aspr '24 😊

Bonnie

@bonkaton

reversed 4 of cups ✦ nova scotian in toronto!

Charley Lee

@charleyslee

24 days ago

@chribjel @MatthewBerman @datacurve 👀

0

3

0

63

charleyslee retweeted

Theo - t3.gg

@theo

24 days ago

swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.

58

872

27

187

166K

Charley Lee

@charleyslee

26 days ago

Our results are out for 4.8 on DeepSWE – overall better performance while being more efficient than 4.7

Datacurve @datacurve

26 days ago

Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

89

2K

120

404

978K

0

10

0

604

charleyslee retweeted

Theo - t3.gg

@theo

30 days ago

This is the first code bench that actually aligns with how it feels to use these models coding.

120

4K

157

983

305K

Charley Lee

@charleyslee

30 days ago

Excited to launch this as a measure of performance on truly uncontaminated SWE tasks

Serena Ge (Datacurve)

@serenaa_ge

about 1 month ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

510

6K

741

3K

2M

0

5

0

221

Charley Lee

@charleyslee

about 1 month ago

So many coding agent interfaces treat great code review UX as an afterthought (understandable for CLIs, but even GUIs??) when it’s such a critical part of the process

0

2

0

125

Charley Lee

@charleyslee

about 1 month ago

@bentlegen Being able to comment inline like revdiff would be amazing 🥹

1

2

0

541

Charley Lee

@charleyslee

about 2 months ago

@nicole_clash right but that's still an intelligence bottleneck..

1

0

112

charleyslee retweeted

TBPN

@tbpn

8 months ago

NEWS: @datacurve raises $15M Series A, led by Chemistry

8

67

7

3

96K

Charley Lee

@charleyslee

9 months ago

@MelkeyDev 🤩

0

328

Charley Lee

@charleyslee

9 months ago

It's been a bit since we started as UncleGPT

Y Combinator

@ycombinator

9 months ago

Congrats to @datacurve on their $15M Series A! Datacurve provides frontier training data to the world’s leading foundation-model labs, helping to push the boundaries of what AI can do. https://t.co/8SCyvZ6OEt

11

229

18

31

111K

19

223

6

16

34K

charleyslee retweeted

Serena Ge (Datacurve)

@serenaa_ge

9 months ago

Today we’re announcing we’ve raised $17.5 million in funding across a $15M Series A led by Chemistry and a $2.7M Seed to accelerate foundation model progress through providing frontier training data for LLMs. When we first started Datacurve, it came from a simple realization: foundation model progress is limited not just by compute, but by data quality and complexity. The right data unlocks new capabilities, especially in coding, where accuracy and reasoning matter most. We’re now proud to partner with the world’s leading foundation-model labs, providing them with high-quality, complex training data that helps push the boundaries of what AI can do. This is still just the start. Come build the future of technology with us in San Francisco: https://t.co/6vVsjUmj4t Huge thanks to our incredible team and investors who’ve believed in us since day one and beyond: @garrytan at @ycombinator, @1vnzh from @cohere , @Mark_Goldberg_ from @chemistry_fund, @TheDerrickLi from @AforeVC, @forwarddeploy, @SoheilK, and @shyamalanadkat.