Cleanlab @cleanlabai - Twitter Profile

Pinned Tweet

7 months ago

🚀 New from Cleanlab: Expert Guidance AI agents running multi-step workflows can fail in tiny, trust-breaking ways. Expert Guidance lets teams fix these behaviors with simple human feedback, instantly. ✈️In one airline workflow: 76% → 90% after only 13 guidance entries.

1

15

3

5

9K

Cleanlab @CleanlabAI

4 months ago

We're thrilled to join forces with @joinHandshake, where we'll be able to scale our team's pioneering work to inflect change with the world's leading AI labs. Hear more from our CEO and Co-founder, @cgnorthcutt, to learn about our next chapter.

Curtis G. Northcutt

@cgnorthcutt

4 months ago

News: @joinHandshake acquires @CleanlabAI! This "ten-year old job marketplace" has quietly become a top human data lab for AI--building an AI research org, acquiring top AI talent, and advancing Cleanlab tech and research to lead data foundations for frontier AI. 1 of 4

2

22

4

5

4K

1

2

0

1K

CleanlabAI retweeted

Kevin Madura

@kmad

6 months ago

Achieving 20%+ improvement in structured extraction tasks using @DSPyOSS and GEPA Building on a blog post from @CleanlabAI I wanted to see how quickly I could optimize a structured extraction task with DSPy + GEPA In about 3 hours (mostly me getting in the way of claude code): - +22 percentage points over vanilla structured outputs - Ran 4 experiments in total - ~$3 total cost I tested 5 approaches incrementally: • OpenAI Baseline: 32.1% exact match • DSPy Baseline: 39.8% • DSPy + BAML: 42.7% • DSPy + GEPA: 53.8% • DSPy + BAML + GEPA: 54.4%

kmad's tweet photo. Achieving 20%+ improvement in structured extraction tasks using @DSPyOSS and GEPA

Building on a blog post from @CleanlabAI I wanted to see how quickly I could optimize a structured extraction task with DSPy + GEPA

In about 3 hours (mostly me getting in the way of claude code):
- +22 percentage points over vanilla structured outputs
- Ran 4 experiments in total
- ~$3 total cost

I tested 5 approaches incrementally:
• OpenAI Baseline: 32.1% exact match
• DSPy Baseline: 39.8%
• DSPy + BAML: 42.7%
• DSPy + GEPA: 53.8%
• DSPy + BAML + GEPA: 54.4%

2

92

16

87

18K

CleanlabAI retweeted

Prashanth Rao

@tech_optimist

6 months ago

For anyone who cares about structured output benchmarks as much as I do, here's an early Christmas present 🎁 ! Pretty well thought out from the folks @CleanlabAI. Seems like I'll def be using it to compare LLMs using BAML and DSPy! https://t.co/clQ0BuaX9l

4

60

11

43

4K

Who to follow

Lilian Weng

@lilianweng

Co-founder of Thinking Machines Lab @thinkymachines; Ex-VP, AI Safety & robotics, applied research @OpenAI; Author of Lil'Log

Akshay 🚀

@akshay_pachaar

Simplifying LLMs, AI Agents, RAG, and Machine Learning for you! • Co-founder @dailydoseofds_• BITS Pilani • 3 Patents • ex-AI Engineer @ LightningAI

Pau Labarta Bajo

@paulabartabajo_

Citizen of the World who teaches AI that works | @liquidai | Maths Olympian | Father of 1… sorry 2 | Opinions are my own

CleanlabAI retweeted

Menlo Ventures @MenloVentures

6 months ago

Where Did $37B in Enterprise AI Spending Go? $19B → Applications (51%) $18B → Infrastructure (49%) Our report includes a snapshot of the Enterprise AI ecosystem, mapped across departmental, vertical AI, and infrastructure. Although coding captures more than half of departmental AI spend at $4 billion, the technology is gaining traction across many enterprise departments: IT operations tools ($700M), marketing platforms ($660M), customer success tools ($630 M). AI-native startups are rapidly emerging across every job function, capturing a meaningful share of the $7.3B spent on departmental AI in 2025. https://t.co/v1RT23RP2n

MenloVentures's tweet photo. Where Did $37B in Enterprise AI Spending Go?

$19B → Applications (51%)
$18B → Infrastructure (49%)

Our report includes a snapshot of the Enterprise AI ecosystem, mapped across departmental, vertical AI, and infrastructure.

Although coding captures more than half of departmental AI spend at $4 billion, the technology is gaining traction across many enterprise departments: IT operations tools ($700M), marketing platforms ($660M), customer success tools ($630 M).

AI-native startups are rapidly emerging across every job function, capturing a meaningful share of the $7.3B spent on departmental AI in 2025. https://t.co/v1RT23RP2n

2

14

2

10

2K

CleanlabAI retweeted

Jonas Mueller

@jomulr

6 months ago

Which LLM is better for Structured Outputs / Data Extraction: Gemini-3-Pro or GPT-5? We ran popular benchmarks, but found their "ground truth" is full of errors. To enable reliable benchmarking, we've open-sourced 4 new Structured Outputs benchmarks with *verified* ground-truth

jomulr's tweet photo. Which LLM is better for Structured Outputs / Data Extraction: Gemini-3-Pro or GPT-5?

We ran popular benchmarks, but found their "ground truth" is full of errors.

To enable reliable benchmarking, we've open-sourced 4 new Structured Outputs benchmarks with *verified* ground-truth https://t.co/3HQSxh4NS5

3

34

9

24

24K

Cleanlab @CleanlabAI

6 months ago

@karanjagtiani04 One example could be: if there is an ambiguous context shift and the agent's original LLM message wrongly assumes something about the context, this can be auto-detected via a low trust score and the auto-revised message can be a follow-up question to clarify instead of assuming

1

0

19

Cleanlab @CleanlabAI

6 months ago

We discovered how to cut the failure rate of any AI agent on Tau²-Bench, the #1 benchmark for customer service AI. Agents often fail in multi-turn, tool-use tasks due to a single bad LLM output (reasoning slip, hallucinated fact, misunderstanding, wrong tool call, etc). We introduce an automated LLM trust scoring + message revision pipeline that mitigates this brittleness and keeps agents on the rails. Benchmarks show that our approach remains effective across all Tau²-Bench domains (Telecom, Retail, Airline) and different LLMs -- cutting agent failure rates up to 50%.

CleanlabAI's tweet photo. We discovered how to cut the failure rate of any AI agent on Tau²-Bench, the #1 benchmark for customer service AI.

Agents often fail in multi-turn, tool-use tasks due to a single bad LLM output (reasoning slip, hallucinated fact, misunderstanding, wrong tool call, etc). We introduce an automated LLM trust scoring + message revision pipeline that mitigates this brittleness and keeps agents on the rails.

Benchmarks show that our approach remains effective across all Tau²-Bench domains (Telecom, Retail, Airline) and different LLMs -- cutting agent failure rates up to 50%.

2

4

1

0

212

Cleanlab @CleanlabAI

6 months ago

This pipeline can used to automatically make any agent more reliable. Extensive benchmarks here: https://t.co/xXBHgq09cO

0

1

0

124

Cleanlab @CleanlabAI

7 months ago

👉 Full announcement here: https://t.co/9BFWXQtb7Z

0

1

2K

Cleanlab @CleanlabAI

7 months ago

🚀 New from Cleanlab: Expert Guidance AI agents running multi-step workflows can fail in tiny, trust-breaking ways. Expert Guidance lets teams fix these behaviors with simple human feedback, instantly. ✈️In one airline workflow: 76% → 90% after only 13 guidance entries.

1

15

3

5

9K

Cleanlab @CleanlabAI

7 months ago

The reality: We’re moving from hype to hardening, building the reliability layer AI needs. 🔍 Read the full Cleanlab report → https://t.co/pQRAlTujqj 📰 @Computerworld feature → https://t.co/T4OicSPeSb

0

1

0

2K

Cleanlab @CleanlabAI

7 months ago

The “Year of the Agent” just got pushed back. Out of 1,837 enterprise leaders, most are struggling with stack churn + reliability. ⚙️ 70% rebuild every 90 days 😬 Less than 35 % are happy with their infrastructure 🤖 Most “agents” still aren’t really acting yet

CleanlabAI's tweet photo. The “Year of the Agent” just got pushed back.

Out of 1,837 enterprise leaders, most are struggling with stack churn + reliability.
⚙️ 70% rebuild every 90 days
😬 Less than 35 % are happy with their infrastructure
🤖 Most “agents” still aren’t really acting yet https://t.co/2renQZAvcf

5

24

7

17

15K

Cleanlab @CleanlabAI

7 months ago

🚧 Even the best AI models still hallucinate. OpenAI’s recent paper on Why Language Models Hallucinate shows why this problem persists, especially in domain-specific settings. For teams implementing guardrails, we put together a short walkthrough: https://t.co/enIWYlYY3J

0

3

1

4

2K

Cleanlab @CleanlabAI

8 months ago

AI pilots prove intelligence, but AI in production demands reliability. The best teams separate their stack early: 🧠 Core = how AI thinks 🛡️ Reliability = how it stays safe That’s how prototypes become products. 👉https://t.co/JtOO6rpKhV

CleanlabAI's tweet photo. AI pilots prove intelligence, but AI in production demands reliability.

The best teams separate their stack early: 🧠 Core = how AI thinks 🛡️ Reliability = how it stays safe

That’s how prototypes become products.

👉https://t.co/JtOO6rpKhV https://t.co/ZWlIxm68vz

2

22

7

14

13K

Cleanlab @CleanlabAI

8 months ago

@rajistics Love this!

0

43

Cleanlab @CleanlabAI

8 months ago

AI agents won’t replace humans. Their real power comes when humans guide it. We just added Expert Answers to our platform: 👩‍🏫 SMEs fix AI mistakes right away 🔁 Fixes are reused across future queries 📈 Accuracy improves, “IDK” drops 10x Full blog: https://t.co/iLq78qcUhg

CleanlabAI's tweet photo. AI agents won’t replace humans. Their real power comes when humans guide it.

We just added Expert Answers to our platform:
👩‍🏫 SMEs fix AI mistakes right away
🔁 Fixes are reused across future queries
📈 Accuracy improves, “IDK” drops 10x

Full blog: https://t.co/iLq78qcUhg https://t.co/CT5ERFCoPj

0

192

Cleanlab @CleanlabAI

8 months ago

Launching an AI agent without human oversight is basically launching a rocket without mission control 🚀 Cool for a few minutes… until something breaks. 🕹�� It’s not the rocket that makes the mission succeed. It’s the control center. https://t.co/ZZKaXQzl5v

CleanlabAI's tweet photo. Launching an AI agent without human oversight is basically launching a rocket without mission control 🚀

Cool for a few minutes… until something breaks.

🕹�� It’s not the rocket that makes the mission succeed. It’s the control center.

https://t.co/ZZKaXQzl5v https://t.co/m0kBko7Gbj

9

78

22

47

20K

Cleanlab @CleanlabAI

9 months ago

📍 Live at @AIconference 2025 in San Francisco! Tomorrow, @cgnorthcutt is sharing practical strategies for building trustworthy customer-facing AI systems, and our team is around all day to connect. 👋 Stop by and geek out with us!

CleanlabAI's tweet photo. 📍 Live at @AIconference 2025 in San Francisco!

Tomorrow, @cgnorthcutt is sharing practical strategies for building trustworthy customer-facing AI systems, and our team is around all day to connect.

👋 Stop by and geek out with us! https://t.co/4JkUVfyOqV

0

3

0

198

Cleanlab @CleanlabAI

9 months ago

Most AI pilots in financial services never make it to production. The reason is simple: they can’t be trusted. Today, Cleanlab + @CorridorAI are fixing that by combining governance with real-time remediation so AI is finally safe to deploy at scale. 🔗 https://t.co/PxxZOuW3LG

CleanlabAI's tweet photo. Most AI pilots in financial services never make it to production.

The reason is simple: they can’t be trusted.

Today, Cleanlab + @CorridorAI are fixing that by combining governance with real-time remediation so AI is finally safe to deploy at scale.

🔗 https://t.co/PxxZOuW3LG https://t.co/t5Woz1TcVX

0

4

0

415

Cleanlab

@CleanlabAI

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users