Andrew Mauboussin

Verified account

@amaub

Working on making it easy to get trustworthy human-labeled data for ML @HelloSurgeAI

San Francisco, CA

Joined September 2009

2.1K Following

1.7K Followers

281 Posts

amaub retweeted

4 months ago

"Prognosticative pastry." "A hound circling a tree, nose to bark." These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop. We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this: https://t.co/8HH0vfoqR7 https://t.co/E2BeNhOXq5 We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact. Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks. Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries. Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel. What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena. But that’s not good writing that people actually want. The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious. They were poetic and immersive, but in the right ways. When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend. I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way. Check out the leaderboard and examples here: https://t.co/8HH0vfoqR7 And our blog post describing it: https://t.co/E2BeNhOXq5

echen's tweet photo. "Prognosticative pastry." "A hound circling a tree, nose to bark."

These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop.

We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this:

https://t.co/8HH0vfoqR7

https://t.co/E2BeNhOXq5

We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact.

Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks.

Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries.

Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel.

What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena.

But that’s not good writing that people actually want.

The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious.

They were poetic and immersive, but in the right ways.

When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend.

I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way.

Check out the leaderboard and examples here: https://t.co/8HH0vfoqR7

And our blog post describing it: https://t.co/E2BeNhOXq5

1

43

8

9

4K

amaub retweeted

9 months ago

Honored to be included in the TIME AI 2025 list. What I'm most glad about is that they highlighted the part of Surge's work that actually matters. ...

echen's tweet photo. Honored to be included in the TIME AI 2025 list.

What I'm most glad about is that they highlighted the part of Surge's work that actually matters.

... https://t.co/x1imZR8p3q

6

30

6

6

6K

Andrew Mauboussin

over 2 years ago

Joining Surge is a chance to work at the forefront of AI with the best in the industry. We value autonomy and empower people to do the best work of their careers. It's also a perfect opportunity for engineers who want to transition into AI. If this resonates, please reach out!

0

6

0

0

1K

Andrew Mauboussin

over 2 years ago

We’re founded by ML practitioners focused on producing the best data in the industry, so we do things a little differently. Instead of outsourcing to low wage countries, we hire our own workforce based in the US and write all of our own software for ensuring quality.

2

10

0

0

2K

Who to follow

Silvio Savarese

Verified account

@silviocinguetta

Executive Vice President, Chief Scientist @salesforce. Adjunct Professor of Computer Science @Stanford University. Faculty co-director @StanfordSVL. #AI

Verified account

Transform how work gets done with custom AI agents, connected to your company knowledge and tools, powered by the best AI models. Just use Dust.

Verified account

@WeAreBrightBlue

Bright Blue is the independent think tank that seeks to defend and improve liberal society.

Andrew Mauboussin

over 2 years ago

We partner with the top AI teams in the world on RLHF, evaluations, and more. One example project we work on is providing the fine-tuning data for Anthropic’s Claude: https://t.co/EAUMF5ebCv Research papers using our data below (linked version at https://t.co/B97jt0dlBC)

amaub's tweet photo. We partner with the top AI teams in the world on RLHF, evaluations, and more. One example project we work on is providing the fine-tuning data for Anthropic’s Claude: https://t.co/EAUMF5ebCv

Research papers using our data below (linked version at https://t.co/B97jt0dlBC) https://t.co/uTdhjOoP08

1

5

0

1

1K

Andrew Mauboussin

over 2 years ago

At @HelloSurgeAI, we are building the human data infrastructure to create useful + aligned AI. We're in a period of unprecedented growth and are hiring across eng: full stack, infra, security, applied ml. SF/Remote. DM or email [email protected] More details about us:

1

16

2

12

21K

Andrew Mauboussin

over 2 years ago

@natolambert One random example from TruthfulQA - base davinci and 3.5 typically answer Georgia but have some probability mass on the right answer (California). The GA answer is likely from discussions of the peach state, CA answer more like to be from discussions of ag output

amaub's tweet photo. @natolambert One random example from TruthfulQA - base davinci and 3.5 typically answer Georgia but have some probability mass on the right answer (California). The GA answer is likely from discussions of the peach state, CA answer more like to be from discussions of ag output https://t.co/C0oeHS6xK9

0

3

0

0

83

Andrew Mauboussin

over 2 years ago

@natolambert No intel, just a guess public research, but I think hunch 1 is right. The answer to the adversarial qs is in the weights, but base model choses the more salient wrong answer. Through RLHF it can learn to focus on the more authoritative sources in the training data and improves

1

2

1

0

299

Andrew Mauboussin

almost 3 years ago

@generatorman_ai @lmsysorg Great ideas! My intuition is #1 will be difficult as a binary comparison only gives a small amount of information, but collecting denser human feedback is a promising direction. like a description of *why* certain responses are good similar to constitutional ai from Anthropic

1

2

0

0

39

Andrew Mauboussin

almost 3 years ago

2. Iteratively applying RLHF is crucial. Instead of collecting human feedback on one model, the Llama team collected preference data and applied RLHF 5 times. This chart shows how the quality of responses shifts over the first 2 iterations (more mass to the right = better)

amaub's tweet photo. 2. Iteratively applying RLHF is crucial. Instead of collecting human feedback on one model, the Llama team collected preference data and applied RLHF 5 times. This chart shows how the quality of responses shifts over the first 2 iterations (more mass to the right = better) https://t.co/PKWwHXtunu

0

4

0

1

522

Andrew Mauboussin

almost 3 years ago

1. For supervised fine-tuning (training on prompt-completion pairs) quality is more important than quantity. The Meta team found that sticking to just 27k high-quality examples led to the best results. This matches the conclusions from the LIMA paper

amaub's tweet photo. 1. For supervised fine-tuning (training on prompt-completion pairs) quality is more important than quantity. The Meta team found that sticking to just 27k high-quality examples led to the best results. This matches the conclusions from the LIMA paper https://t.co/0xymP5eWTJ

1

5

0

2

657

Andrew Mauboussin

almost 3 years ago

@harpreet_utd @alexirobbins I haven't tried it, but I don't think it will. Have you tried using Github Copilot for VSCode? Would be curious how just having smart autocomplete compares to using something like the ChatGPT API, which takes in prompts and writes longer responses

0

2

0

0

328

Andrew Mauboussin

almost 3 years ago

I've done analyses in Jupyter notebook for years. Now Code Interpreter can write all the code, but I haven't used it daily because you can't tweak code, re-run analyses, install packages, etc. I built Juno w/ @alexirobbins to bring GPT into Jupyter for the best of the worlds

4

243

41

179

50K

Andrew Mauboussin

almost 3 years ago

@SiVola @sjwhitmore @alexirobbins no, unfortunately not, although I know Google is in the process of rolling out similar tooling to Collab! Initially started working on this because there is no large company pushing AI tools in Jupyter specifically

1

1

0

0

684

Andrew Mauboussin

almost 3 years ago

Hacker news post is here: https://t.co/tM9zV50vz5

0

4

0

0

1K

Andrew Mauboussin

almost 3 years ago

Juno runs in your notebook context so you get working code out of the box. It can write, edit, and debug code. And it's a Jupyter extension so you pip install it and get started in <1 min: https://t.co/SslcgU6RCn If you get a chance to try it out let me know what you think!

1

12

0

3

2K

Andrew Mauboussin

over 3 years ago

It still makes occasional mistakes, but the technology is already good enough to be useful as a Copilot style tool. Removing the need to read through plotting library docs saves a lot of time. Try it out here with public datasets, or upload your own: https://t.co/CimVSYsRPY

0

7

1

3

0

Andrew Mauboussin

over 3 years ago

DALL·E 2 let anyone with an idea for an image to create it by just asking for it, no Photoshop required. Data science could be done in the same way soon. I created AutoPlot w/ @mihail_eric to explore this idea. It uses GPT-3 to do analysis and create charts on your command

amaub's tweet photo. DALL·E 2 let anyone with an idea for an image to create it by just asking for it, no Photoshop required.

Data science could be done in the same way soon.

I created AutoPlot w/ @mihail_eric to explore this idea. It uses GPT-3 to do analysis and create charts on your command https://t.co/SSkyB6FdWx

2

47

5

20

0

amaub retweeted

over 3 years ago

I'm super excited to release AutoPlot, a tool @amaub and I have been working on to answer the question: What if you could do any data analysis by just speaking in natural language? No confusing scripts, Excel macros, or Matplotlib incantations.

2

25

4

3

0

amaub retweeted

@sleepinyourhat

over 3 years ago

I've had a similar experience to Ethan. If you want to do an NLP data collection/labeling process and don't want/need to be managing annotators directly, Surge is remarkably easy to work with and their team does very good work.

1

56

7

15

0

Last Seen Users on Sotwe

Trends for you

Most Popular Users