Scale Labs @ScaleAILabs - Twitter Profile

2 days ago

How do you turn agent traces into an improvement flywheel? Excited to share Insights Generator (IG) — new @scale_AI / @ScaleAILabs research that finds behavioral patterns and bugs in agent traces. Engineers & coding agents using IG achieved 30+% gains on agent benchmarks. 🧵

akshay_manglik's tweet photo. How do you turn agent traces into an improvement flywheel?

Excited to share Insights Generator (IG) — new @scale_AI / @ScaleAILabs research that finds behavioral patterns and bugs in agent traces.

Engineers & coding agents using IG achieved 30+% gains on agent benchmarks.

🧵

2

13

7

0

385

Scale Labs @ScaleAILabs

3 days ago

HiL-Dynamics: https://t.co/ewdzSwOqsp Blog: https://t.co/FRlvGlCCzt

0

1

0

189

Scale Labs @ScaleAILabs

3 days ago

Today we're releasing HiL-Dynamics, the first open-source tool that measures how production agents actually collaborate with humans under uncertainty. Not just whether they got the answer. Now you can measure exactly when your agent asks for help, when it makes assumptions, and when it'll confidently ship the wrong answer. Our findings 🧵

ScaleAILabs's tweet photo. Today we're releasing HiL-Dynamics, the first open-source tool that measures how production agents actually collaborate with humans under uncertainty. Not just whether they got the answer.

Now you can measure exactly when your agent asks for help, when it makes assumptions, and when it'll confidently ship the wrong answer.

Our findings 🧵

3

25

7

6

2K

Scale Labs @ScaleAILabs

3 days ago

Selective escalation remains one of the biggest challenges for reliable human-in-the-loop AI. We hope HiL-Dynamics helps users find the right setup for their workflows and gives model builders clearer signals for building agents that collaborate with humans more effectively.

1

0

184

Scale Labs @ScaleAILabs

7 days ago

Claude Opus 4.8 just landed on our MCP Atlas Leaderboard! Opus 4.8’s performance places it in the top band of SOTA models for agentic tool calling. The Claude 4 family keeps getting better at long-horizon tool use. Check out the updated rankings: https://t.co/ozbAVmlUWS

1

11

1

0

600

Scale Labs @ScaleAILabs

9 days ago

Full paper: https://t.co/bByUtL8LGq

0

2

0

1

203

Scale Labs @ScaleAILabs

9 days ago

We built ASPI to isolate clarification-seeking as its own agent state. Each benchmark scenario compares: - Execution mode → the agent receives a fully specified task - Clarification mode → the agent must ask follow-up questions before acting This allows us to measure how ambiguity changes an agent’s security profile.

1

2

0

564

Scale Labs @ScaleAILabs

9 days ago

The takeaway: standard security evaluations may be underestimating the attack surface of interactive AI agents. A model that appears secure on fully specified tasks may become significantly more vulnerable once it has to handle ambiguity and request additional user input.

1

2

0

256

Scale Labs @ScaleAILabs

9 days ago

New @scale_AI research introduces ASPI: Ambiguous-State Prompt Injection. Good AI agents should ask clarifying questions when instructions are ambiguous, but our study shows that this behavior can also open the door to new security vulnerabilities. Across 728 attack scenarios and 10 frontier models, here's what we found 🧵

ScaleAILabs's tweet photo. New @scale_AI research introduces ASPI: Ambiguous-State Prompt Injection.

Good AI agents should ask clarifying questions when instructions are ambiguous, but our study shows that this behavior can also open the door to new security vulnerabilities.

Across 728 attack scenarios and 10 frontier models, here's what we found 🧵

32

21

3

2K

Scale Labs @ScaleAILabs

13 days ago

Rubric-based rewards are now standard for open-ended RL. But higher rubric scores don’t always mean better models. Our latest research shows models can learn to optimize the rubric-verifier setup itself, improving checklist coverage while broader quality declines. Robust post-training needs stronger verifiers and better ways to detect reward hacking.

Anas Mahmoud

@nas_mahmoud_

22 days ago

1/ Using rubrics (a.k.a. checklists) in RL training is now standard for open-ended tasks without final verifiable result. However, rubric rewards are still proxy rewards that can get hacked during RL training. We study when rubric-based RL genuinely improves models vs. teaches them to hack the verifier/rubric. We quantify this through exploitation, analyze the failure modes, and introduce a verifier-free metric. https://t.co/D4L9DdfphF

nas_mahmoud_'s tweet photo. 1/ Using rubrics (a.k.a. checklists) in RL training is now standard for open-ended tasks without final verifiable result. However, rubric rewards are still proxy rewards that can get hacked during RL training.

We study when rubric-based RL genuinely improves models vs. teaches them to hack the verifier/rubric. We quantify this through exploitation, analyze the failure modes, and introduce a verifier-free metric.
https://t.co/D4L9DdfphF

5

170

21

196

99K

2

65

5

53

6K

ScaleAILabs retweeted

Utkarsh Tyagi

@utkarsh4430

15 days ago

1/ New from @ScaleAILabs: Rubrics (a.k.a. checklists) have become the default reward interface for RL on open-ended tasks without final verifiable answers. But most rubric RL still relies on static aggregation: fixed human weights over criteria, summed into one scalar reward. We show that this conflates what should matter in the final answer with what can actually teach the current policy. https://t.co/H5wTQ27ulb

utkarsh4430's tweet photo. 1/ New from @ScaleAILabs: Rubrics (a.k.a. checklists) have become the default reward interface for RL on open-ended tasks without final verifiable answers.

But most rubric RL still relies on static aggregation: fixed human weights over criteria, summed into one scalar reward.

We show that this conflates what should matter in the final answer with what can actually teach the current policy.

https://t.co/H5wTQ27ulb

2

74

21

53

8K

Scale Labs @ScaleAILabs

16 days ago

Congrats to @GoogleDeepMind for releasing Gemini 3.5 Flash and topping our MCP Atlas leaderboard! 🥇

0

57

4

2

2K

ScaleAILabs retweeted

jade

@jadechoghari

16 days ago

At @ScaleAILabs, we’ve been exploring how to get models to accurately caption large-scale robot and human manipulation videos. More than 1,000 hours of new demonstrations hit our platform daily from factories, homes, and industrial sites and every episode needs precise action level captions: what happened, what object was used, and where it ended up. Here’s what we’ve found so far 🧵

7

192

15

86

6K

Scale Labs @ScaleAILabs

17 days ago

Last week, we brought together builders across the AI ecosystem at @scale_AI SFHQ to talk all things agentic code — from evals to where coding agents still break down in real world workflows. Thanks to everyone who joined us. More soon!