"Prognosticative pastry." "A hound circling a tree, nose to bark."
These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop.
We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this:
https://t.co/8HH0vfoqR7
https://t.co/E2BeNhOXq5
We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact.
Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks.
Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries.
Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel.
What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena.
But that’s not good writing that people actually want.
The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious.
They were poetic and immersive, but in the right ways.
When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend.
I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way.
Check out the leaderboard and examples here: https://t.co/8HH0vfoqR7
And our blog post describing it: https://t.co/E2BeNhOXq5
Joining Surge is a chance to work at the forefront of AI with the best in the industry. We value autonomy and empower people to do the best work of their careers.
It's also a perfect opportunity for engineers who want to transition into AI.
If this resonates, please reach out!
We’re founded by ML practitioners focused on producing the best data in the industry, so we do things a little differently. Instead of outsourcing to low wage countries, we hire our own workforce based in the US and write all of our own software for ensuring quality.
We partner with the top AI teams in the world on RLHF, evaluations, and more. One example project we work on is providing the fine-tuning data for Anthropic’s Claude: https://t.co/EAUMF5ebCv
Research papers using our data below (linked version at https://t.co/B97jt0dlBC)
At @HelloSurgeAI, we are building the human data infrastructure to create useful + aligned AI.
We're in a period of unprecedented growth and are hiring across eng: full stack, infra, security, applied ml. SF/Remote. DM or email [email protected]
More details about us:
@natolambert One random example from TruthfulQA - base davinci and 3.5 typically answer Georgia but have some probability mass on the right answer (California). The GA answer is likely from discussions of the peach state, CA answer more like to be from discussions of ag output
@natolambert No intel, just a guess public research, but I think hunch 1 is right. The answer to the adversarial qs is in the weights, but base model choses the more salient wrong answer. Through RLHF it can learn to focus on the more authoritative sources in the training data and improves
@generatorman_ai@lmsysorg Great ideas! My intuition is #1 will be difficult as a binary comparison only gives a small amount of information, but collecting denser human feedback is a promising direction. like a description of *why* certain responses are good similar to constitutional ai from Anthropic
2. Iteratively applying RLHF is crucial. Instead of collecting human feedback on one model, the Llama team collected preference data and applied RLHF 5 times. This chart shows how the quality of responses shifts over the first 2 iterations (more mass to the right = better)
1. For supervised fine-tuning (training on prompt-completion pairs) quality is more important than quantity. The Meta team found that sticking to just 27k high-quality examples led to the best results. This matches the conclusions from the LIMA paper
@harpreet_utd @alexirobbins I haven't tried it, but I don't think it will. Have you tried using Github Copilot for VSCode? Would be curious how just having smart autocomplete compares to using something like the ChatGPT API, which takes in prompts and writes longer responses
I've done analyses in Jupyter notebook for years. Now Code Interpreter can write all the code, but I haven't used it daily because you can't tweak code, re-run analyses, install packages, etc.
I built Juno w/ @alexirobbins to bring GPT into Jupyter for the best of the worlds
@SiVola@sjwhitmore@alexirobbins no, unfortunately not, although I know Google is in the process of rolling out similar tooling to Collab!
Initially started working on this because there is no large company pushing AI tools in Jupyter specifically
Juno runs in your notebook context so you get working code out of the box. It can write, edit, and debug code. And it's a Jupyter extension so you pip install it and get started in <1 min: https://t.co/SslcgU6RCn
If you get a chance to try it out let me know what you think!
It still makes occasional mistakes, but the technology is already good enough to be useful as a Copilot style tool. Removing the need to read through plotting library docs saves a lot of time.
Try it out here with public datasets, or upload your own: https://t.co/CimVSYsRPY
DALL·E 2 let anyone with an idea for an image to create it by just asking for it, no Photoshop required.
Data science could be done in the same way soon.
I created AutoPlot w/ @mihail_eric to explore this idea. It uses GPT-3 to do analysis and create charts on your command
I'm super excited to release AutoPlot, a tool @amaub and I have been working on to answer the question:
What if you could do any data analysis by just speaking in natural language?
No confusing scripts, Excel macros, or Matplotlib incantations.
I've had a similar experience to Ethan. If you want to do an NLP data collection/labeling process and don't want/need to be managing annotators directly, Surge is remarkably easy to work with and their team does very good work.