Jeff Da @_jeffda - Twitter Profile

Jeff Da @_jeffda

about 1 month ago

@yannis__he Let’s go 🔥

0

1

0

34

Jeff Da @_jeffda

about 2 months ago

@alexfabbri4 Congrats on the launch!

0

1

0

37

_jeffda retweeted

Cognition @cognition

3 months ago

We are sharing an early preview of our ongoing SWE-1.6 training run. It significantly improves upon SWE-1.5 while being post-trained on the same pre-trained model - and it runs equally as fast at 950 tok/s. On SWE-Bench Pro it exceeds top open-source models. The preview model still exhibits some undesirable behaviors like overthinking and excessive self-verification, which we aim to improve. We are rolling out early access to a small subset of users in Windsurf.

cognition's tweet photo. We are sharing an early preview of our ongoing SWE-1.6 training run.

It significantly improves upon SWE-1.5 while being post-trained on the same pre-trained model - and it runs equally as fast at 950 tok/s. On SWE-Bench Pro it exceeds top open-source models.

The preview model still exhibits some undesirable behaviors like overthinking and excessive self-verification, which we aim to improve. We are rolling out early access to a small subset of users in Windsurf.

64

1K

113

305

506K

_jeffda retweeted

Bing Liu

@vbingliu

3 months ago

OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination. We agree. These were exactly the motivations behind SWE-Bench Pro (https://t.co/ctvRFXzWu6). What we changed: → Underspecified tasks: structured, executable problem definitions → Contamination: strict curation + private / commercial codebases But this is just step one. Where we’re pushing frontier coding evals next: → Beyond unit tests: rubric-based evaluation (https://t.co/9oifP5FQJO) → From static tasks to real-world agentic environments Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that. SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents. We’ll keep pushing the frontier.

vbingliu's tweet photo. OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination.

We agree. These were exactly the motivations behind SWE-Bench Pro (https://t.co/ctvRFXzWu6).

What we changed:

→ Underspecified tasks: structured, executable problem definitions
→ Contamination: strict curation + private / commercial codebases

But this is just step one.

Where we’re pushing frontier coding evals next:

→ Beyond unit tests: rubric-based evaluation (https://t.co/9oifP5FQJO)
→ From static tasks to real-world agentic environments

Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that.

SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents.

We’ll keep pushing the frontier.

2

39

3

18

3K

Who to follow

Hao Zhu

@_Hao_Zhu

Building the AI social brain for humans @StanfordNLP PhD @LTIatCMU

Freda Shi

@fredahshi

Professor, programmer, and hacker @UWCheritonCS and @VectorInst excited about computational linguistics, intuition, and grounding.

Yizhong Wang

@yizhongwyz

Researching AI for an infinite-sum future. RS@ByteDance Seed, incoming AP@UT Austin. Formerly @uwcse @allen_ai @meta @microsoft

_jeffda retweeted

OpenAI Developers

@OpenAIDevs

3 months ago

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. https://t.co/3GeAsnUHdC

95

1K

130

395

241K

_jeffda retweeted

Logan Kilpatrick

@OfficialLoganK

4 months ago

Introducing Gemini 3.1 Pro, our new SOTA model across most reasoning, coding, and stem use cases!

555

7K

578

708

645K

_jeffda retweeted

MiniMax (official) @MiniMax_AI

4 months ago

Introducing M2.5, an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution, 37% faster at complex tasks. - At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible MiniMax Agent: https://t.co/aIzrFYcfUz API: https://t.co/fHRdSV7BwZ CodingPlan: https://t.co/FDhZBBjQrX

MiniMax_AI's tweet photo. Introducing M2.5, an open-source frontier model designed for real-world productivity.

- SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.

- Optimized for efficient execution, 37% faster at complex tasks.

- At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible

MiniMax Agent: https://t.co/aIzrFYcfUz
API: https://t.co/fHRdSV7BwZ
CodingPlan: https://t.co/FDhZBBjQrX

454

9K

1K

8K

5M

_jeffda retweeted

Noam Brown

@polynoamial

4 months ago

GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here.

polynoamial's tweet photo. GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here. https://t.co/vlOmyxIJmv

36

1K

86

96

155K

_jeffda retweeted

Sam Altman

@sama

4 months ago

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.

2K

19K

2K

2M

_jeffda retweeted

Wenting Zhao

@wzhao_nlp

4 months ago

This release is an emtional one for me because I had stayed up so much for it 🥹 It has been truly amazing to see this model becomes better bit by bit through every change we make, and we have come a long way. Since I did mid-training for this model, I wanted to share a little anecdote about this part. We really made this model with user experience as first-class consideration. We want people to actually use it, period. We took it so serious that we redid midtraining because we saw cases where models failed to follow instructions on out-of-distribution scaffolds. We decided straight-up that we would fix this in a fundamental way instead of surface-level patching. The resulting base model, which we also release, is thus a healthy base. We find that, compared to other base models, this one better learns new tasks. Try fine-tuning our base and lmk what you think 🥳 https://t.co/KSvowSEdTu

56

1K

83

309

109K

Jeff Da @_jeffda

4 months ago

A strong and fast open-source coding model, and a tech report 😍

Qwen

@Alibaba_Qwen

4 months ago

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: https://t.co/rZoW4vRJpr 🤖 ModelScope: https://t.co/P0vT5zILBZ 📝 Blog: https://t.co/hFfFDYcwvd 📄 Tech report: https://t.co/Qx83PWS3oi

Alibaba_Qwen's tweet photo. 🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development.
What’s new:
🤖 Scaling agentic training: 800K verifiable tasks + executable envs
📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active
✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc
🤗 Hugging Face: https://t.co/rZoW4vRJpr
🤖 ModelScope: https://t.co/P0vT5zILBZ
📝 Blog: https://t.co/hFfFDYcwvd
📄 Tech report: https://t.co/Qx83PWS3oi

212

6K

790

2K

2M

0

3

0

132

_jeffda retweeted

MiniMax (official) @MiniMax_AI

5 months ago

#1 open source on SWE-Bench Pro. Ahead of Gemini 3 Flash. Level with Haiku 4.5. Thanks @scale_AI for the solid benchmark. Let's keep pushing forward 💪

8

231

12

25

21K

Jeff Da @_jeffda

5 months ago

Rubrics are effective verifiers for SWE-Agents!

Mohit Raghavendra @mohit_r9a

5 months ago

🚀New @scale_AI research: Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time? We explore Agentic Rubrics to fill this gap 💡 Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑‍💻

mohit_r9a's tweet photo. 🚀New @scale_AI research:

Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time?

We explore Agentic Rubrics to fill this gap 💡

Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑‍💻

1

31

10

6

4K

0

1

0

176

_jeffda retweeted

Mohit Raghavendra @mohit_r9a

5 months ago

🚀New @scale_AI research: Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time? We explore Agentic Rubrics to fill this gap 💡 Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑‍💻

1

31

10

6

4K

_jeffda retweeted

Yuxiang Wei

@YuxiangWei9

6 months ago

Results: - self-improvement on SWE-bench Verified (+10.4) and Pro (+7.8) - better than the baseline RL using human issue data over the course of training

YuxiangWei9's tweet photo. Results:
- self-improvement on SWE-bench Verified (+10.4) and Pro (+7.8)
- better than the baseline RL using human issue data over the course of training

2

55

3

8

7K

_jeffda retweeted

Scale AI

@scale_AI

6 months ago

New Scale research: Do AI models actually reason in ways humans can trust for real-world decisions? Introducing MoReBench, the first benchmark for procedural moral reasoning in LLMs, measuring not just what models decide, but how they reason through moral ambiguity.

scale_AI's tweet photo. New Scale research: Do AI models actually reason in ways humans can trust for real-world decisions?

Introducing MoReBench, the first benchmark for procedural moral reasoning in LLMs, measuring not just what models decide, but how they reason through moral ambiguity. https://t.co/TKpddknD99

7

48

14

12

12K

Jeff Da @_jeffda

6 months ago

@scale_AI Check out the paper and dataset: Paper: https://t.co/2q8KJEicaD Github: https://t.co/JqshJFKsSb Dataset: https://t.co/yOW8B2QLfn Leaderboard: https://t.co/p8bL3cWjZG

0

77

Jeff Da @_jeffda

6 months ago

New open-source benchmark from @scale_AI: MCP-Atlas MCP-Atlas is a large-scale benchmark for evaluating tool-use competency using 36 real MCP servers and 220 tools. The benchmark was featured in recent model cards (GPT, Claude, Gemini), and now it's open-source!

_jeffda's tweet photo. New open-source benchmark from @scale_AI: MCP-Atlas

MCP-Atlas is a large-scale benchmark for evaluating tool-use competency using 36 real MCP servers and 220 tools. The benchmark was featured in recent model cards (GPT, Claude, Gemini), and now it's open-source! https://t.co/Dl2NYCDPcZ

Bing Liu

@vbingliu

6 months ago

🚀 Today we’re open-sourcing MCP Atlas — a large-scale, real-server benchmark for agentic tool use, which has been used in the recent GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash model releases! 🧠 Key insight: realistic agentic tool use is not a function-calling problem. It requires tool discovery, orchestration, and recovery in real environments. 🔧 MCP Atlas evaluates agents on real MCP servers (36 servers, 220 tools, 1K human-written tasks). Models must find the right tools, call them correctly, chain them together, and handle failures. 📉 What we found: • Agents fail more often at tool interaction than at reasoning • Performance drops sharply with real-world tool friction • Scaling models helps unevenly, robustness remains hard • Claims-based eval reveals how agents fail, not just if they finish Check it out! 📄 Paper: https://t.co/WlUkPxHbUS 🌍 Environment: https://t.co/48QAhFiiZU 📂 Dataset: https://t.co/eCH9tbPDgm 📊 Leaderboard: https://t.co/dbFPkpqAFk #AgenticAI #ToolUse #LLMEval #Benchmarks #MCP

18

220

22

154

22K

1

3

0

1

246

_jeffda retweeted

Scale AI

@scale_AI

6 months ago

We recently introduced MCP-Atlas, a benchmark for evaluating how well LLMs handle tool use via the Model Context Protocol. Even top models failed nearly half of realistic multi-tool tasks. Today, we’re open-sourcing the benchmark so you can measure performance yourself.

scale_AI's tweet photo. We recently introduced MCP-Atlas, a benchmark for evaluating how well LLMs handle tool use via the Model Context Protocol. Even top models failed nearly half of realistic multi-tool tasks.

Today, we’re open-sourcing the benchmark so you can measure performance yourself. https://t.co/3OvnB2Jaeb

1

33

6

10

5K

_jeffda retweeted

Bing Liu

@vbingliu

6 months ago

🚀 Today we’re open-sourcing MCP Atlas — a large-scale, real-server benchmark for agentic tool use, which has been used in the recent GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash model releases! 🧠 Key insight: realistic agentic tool use is not a function-calling problem. It requires tool discovery, orchestration, and recovery in real environments. 🔧 MCP Atlas evaluates agents on real MCP servers (36 servers, 220 tools, 1K human-written tasks). Models must find the right tools, call them correctly, chain them together, and handle failures. 📉 What we found: • Agents fail more often at tool interaction than at reasoning • Performance drops sharply with real-world tool friction • Scaling models helps unevenly, robustness remains hard • Claims-based eval reveals how agents fail, not just if they finish Check it out! 📄 Paper: https://t.co/WlUkPxHbUS 🌍 Environment: https://t.co/48QAhFiiZU 📂 Dataset: https://t.co/eCH9tbPDgm 📊 Leaderboard: https://t.co/dbFPkpqAFk #AgenticAI #ToolUse #LLMEval #Benchmarks #MCP

18

220

22

154

22K

Jeff Da

@_jeffda

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users