We’re open sourcing the first document OCR benchmark for the agentic era, ParseBench.
Document parsing is the foundation of every AI agent that works with real-world files. ParseBench is a benchmark that measures parsing quality specifically for agent knowledge work:
✅ It optimizes for semantic correctness (instead of exact similarity)
✅ It has the most comprehensive distribution of real-world enterprise documents
It contains ~2,000 human-verified enterprise document pages with 167,000+ test rules across five dimensions that matter most: tables, charts, content faithfulness, semantic formatting, and visual grounding.
We benchmarked 14 known document parsers on ParseBench, from frontier/OSS VLMs to specialized parsers to LlamaParse. Here are some of our findings:
💡 Increasing compute budget yields diminishing returns - Gemini/gpt-5-mini/haiku gain 3-5 points from minimal to high thinking, at 4x the cost.
💡 Charts are the most polarizing dimension for evaluation. Most specialized parsers score below 6%, while some VLM-based parsers do a bit better.
💡 VLMs are great at visual understanding but terrible at layout extraction. GPT-5-mini/haiku score below 10% on our visual grounding task, all specialized parsers do much better.
💡 No method crushes all 5 dimensions at once, but LlamaParse achieves the highest overall score at 84.9%, and is the leader in 4 out of the 5 dimensions.
This is by far the deepest technical work that we’ve published as a company. I would encourage you to start with our blog and explore our links to Hugging Face to GitHub. All the details are in our full 35-page (!!) ArXiv whitepaper.
🌐: Blog: https://t.co/57OHkx0pQW
📄 Paper: https://t.co/Ho2oH2xEAM
💻 Code: https://t.co/6P7UxqOZYA
📊 Dataset: https://t.co/YguIXWm41j
🎥 YouTube: https://t.co/6Fh1Nsk9ei
Okay, @gdb is team CLI all the way. @garrytan thinks MCPs suck.
So we hit the streets of SF to see if the city agreed.
We posed a simple question: MCP or CLI?
- Basically everyone under the age of 35 said CLI
- One person said MCP was as bloated as Java
- & unsurprisingly, numerous people told us to touch grass
Final score- MCP: 3 vs CLI: 17
SF has spoken, and @composio listened.
Our universal CLI is now live!
Drop your best CLI vs MCP hot take in the comments and we'll send the best ones some very sick gear 👀
Link to try our CLI in the next thread ⬇️
To celebrate the launch of @ElevenCreative on X, we’re giving away 111k credits to 3 lucky creators.
To enter: Like + follow @ElevenCreative
Winners announced on May 6 at 4 PM GMT
We just launched ElevenMusic.
We've paid out over $11M to voice creators.
Now the same model comes to music.
Like this post to get the link in your DMs.
https://t.co/2U1kN1JdUd