ZeroEval

@ZeroEval

The self-improving layer for agents.

NYC

Joined July 2025

8 Following

129 Followers

15 Posts

ZeroEval retweeted

LLM Stats @LlmStats

29 days ago

Today we're introducing the LLM Stats Index. For 3.2 years, we've tracked every frontier model release. The Index aggregates 200+ benchmark results into a single TrueSkill rating per model, spanning law, healthcare, coding, tool calling, vision, and reasoning. Across every category and every modality, the leading model on the Pareto Frontier is GPT-5.5 (@OpenAI). On our trajectories, human-knowledge benchmarks saturate by mid-2027. Capability has been the primary axis. The field is converging on it. Two more are opening. The first is efficiency: total task cost is the cleanest proxy we have for intelligence/watt. The second is throughput: inference speed becomes the productivity ceiling once models are cheap and good enough. We're building the next generation of long-horizon coding, tool use, and long context benchmarks. If you're working on long-horizon evaluation in real domains, we'd like to chat.

LlmStats's tweet photo. Today we're introducing the LLM Stats Index.

For 3.2 years, we've tracked every frontier model release. The Index aggregates 200+ benchmark results into a single TrueSkill rating per model, spanning law, healthcare, coding, tool calling, vision, and reasoning.

Across every category and every modality, the leading model on the Pareto Frontier is GPT-5.5 (@OpenAI).

On our trajectories, human-knowledge benchmarks saturate by mid-2027.

Capability has been the primary axis. The field is converging on it. Two more are opening.

The first is efficiency: total task cost is the cleanest proxy we have for intelligence/watt. The second is throughput: inference speed becomes the productivity ceiling once models are cheap and good enough.

We're building the next generation of long-horizon coding, tool use, and long context benchmarks.

If you're working on long-horizon evaluation in real domains, we'd like to chat.

ZeroEval @ZeroEval

2 months ago

If you have agents in production, lets chat: https://t.co/zJ2fQ40qMx → https://t.co/fOLHqoWMKX

124

ZeroEval @ZeroEval

2 months ago

The companies that win the next decade of AI won’t be those that build the best agents. They’ll be the ones whose agents get better over time. Agents should grow, not just get shipped and maintained. We're so early. We're building for the long run.

seb

@sebcrossa

2 months ago

what if your agents could learn from their mistakes, and get better over time? companies are shipping agents to production at a higher rate than ever, and teams keep running into the same issues: incorrect tool calls, low prompt adherence, hallucinations, etc we're closing this loop with @ZeroEval.

834

417

ZeroEval retweeted

LLM Stats @LlmStats

4 months ago

A Failure-Focused Evaluation of Frontier Models Benchmark scores tell you which model is "best on average", but not where they fail. We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages. Our findings: → 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated. → 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions. → Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models. Let’s dig deeper (1/8)

LlmStats's tweet photo. A Failure-Focused Evaluation of Frontier Models

Benchmark scores tell you which model is "best on average", but not where they fail.

We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages.

Our findings:

→ 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated.
→ 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions.
→ Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models.

Let’s dig deeper (1/8)

714

ZeroEval retweeted

LLM Stats @LlmStats

5 months ago

How does the Veo 3 family stack up in video generation? 🎬 I ran a series of tests to understand the capabilities of this model lineup. To my surprise, despite being part of the same family, there are significant differences in how each version approaches and solves the same prompt. Tested 4 different versions of Google's Veo to see which one handles video generation best: ✅ Veo 3.1 ✅ Veo 3.1 Fast ✅ Veo 3.0 ✅ Veo 3.0 Fast

333

ZeroEval retweeted

LLM Stats @LlmStats

6 months ago

🟩 Nemotron 3 Nano is out: → Hybrid Mamba-Transformer architecture: longer context that stays fast and cheap. → 31.6B params but only 3.6B active per token: frontier-adjacent performance at fraction of compute. → 4x faster inference than Nemotron 2 Nano → Open weights available through HF Info: https://t.co/v4riTy9Dgz Blog: https://t.co/T2WA0EF5Wv

$LlmStats's tweet photo. 🟩 Nemotron 3 Nano is out: → Hybrid Mamba-Transformer architecture: longer context that stays fast and cheap. → 31.6B params but only 3.6B active per token: frontier-adjacent performance at fraction of compute. → 4x faster inference than Nemotron 2 Nano → Open weights available through HF Info: https://t.co/v4riTy9Dgz Blog: https://t.co/T2WA0EF5Wv$

356

ZeroEval retweeted

seb

@sebcrossa

6 months ago

what if you could teach the ai that powers your products on what's good and what's bad? after chatting with hundreds of AI co's about prompt engineering, the same things comes up again and again: 95% of them are purely vibe prompting and hate the process. we just built a new feature for @ZeroEval that lets you improve your prompts through human feedback, powered by @DSPyOSS. plug into our sdk, give feedback (ui, sdk or api) and generate prompt improvements. as easy as that. let me show you how it works

585

ZeroEval retweeted

@hi_ventures_

7 months ago

🚀 AI 100 — Latin America’s Early AI Startups Map by Country Following our first edition of the AI 100 Map (by sector), we’re excited to share a new perspective — this time highlighting where innovation is happening across Latin America. This updated version showcases the country of origin for startups that: • Are building core AI products or applying AI in transformative ways • Are VC-backed and have raised no more than $10M • Represent the ambition, creativity, and technical depth that we love at Hi Ventures This geographic view gives us a glimpse into the emerging AI hubs driving the region’s tech revolution — from Mexico City to São Paulo, Buenos Aires, Bogotá, and beyond. Find the download link in the comments. If there’s a startup we missed, tag them below or DM us — we’re always discovering new talent shaping Latin America’s AI future. @mappa_ai, @getdarwinai, @Winclap, @territoriumlife, @JelouAI, @oimagie, @UpFluxPM, @neuralmedai, @start_carreiras, @Leadsales_io, @WeKallco, @heyyatlass, @yana_oficial, @ViewMind_, @Fintalk_ai, @ZapiaAI , @ArkhamInc, @inner_ai_, @joingaus, @Allie_Systems, @Saptiva_AI, @CedalioTech, @TimeToHire_Ai, @kapso_ai, @chambasai, @Leona_health, @pathpilotAI, @ZeroEval, @PicaioAI, @VerveMarketCo, @BircleAI, @SaludNowMX, @instacrops

hi_ventures_'s tweet photo. 🚀 AI 100 — Latin America’s Early AI Startups Map by Country

Following our first edition of the AI 100 Map (by sector), we’re excited to share a new perspective — this time highlighting where innovation is happening across Latin America.

This updated version showcases the country of origin for startups that:
• Are building core AI products or applying AI in transformative ways
• Are VC-backed and have raised no more than $10M
• Represent the ambition, creativity, and technical depth that we love at Hi Ventures

This geographic view gives us a glimpse into the emerging AI hubs driving the region’s tech revolution — from Mexico City to São Paulo, Buenos Aires, Bogotá, and beyond.

Find the download link in the comments.

If there’s a startup we missed, tag them below or DM us — we’re always discovering new talent shaping Latin America’s AI future.

@mappa_ai, @getdarwinai, @Winclap, @territoriumlife, @JelouAI, @oimagie, @UpFluxPM, @neuralmedai, @start_carreiras, @Leadsales_io, @WeKallco, @heyyatlass, @yana_oficial, @ViewMind_, @Fintalk_ai,
@ZapiaAI , @ArkhamInc, @inner_ai_, @joingaus, @Allie_Systems, @Saptiva_AI, @CedalioTech, @TimeToHire_Ai, @kapso_ai, @chambasai, @Leona_health, @pathpilotAI, @ZeroEval, @PicaioAI, @VerveMarketCo, @BircleAI, @SaludNowMX, @instacrops

694

ZeroEval retweeted

LLM Stats @LlmStats

7 months ago

LLM Stats is live on Product Hunt 🥳🎉 We're doubling down on independent benchmarking for AI models and bringing transparency and reproducibility to model performance. Are there any benchmarks you'd like to see or wish existed? Reply below. https://t.co/ABNNBCrmZd

779

ZeroEval @ZeroEval

9 months ago

@ollieforsyth @agentmail @dedaluslabs @DeepAwareAI @deepgrove_ai @Jerr_Wu @onkernel @luminal_ai @manufact @modelencecom @nuntiusai @LilacML @agenthublabs @vibeflowai @try_channel3 @monarcha_ai @nottecore Thank you, Ollie <3

ZeroEval retweeted

The AI Colony R&D @TheAIColonyRD

9 months ago

➡️ ZeroEval / @ZeroEval If you want AI agents that actually get smarter, this is it! ZeroEval builds agents that learn from their mistakes. It runs evaluations that train your models to improve over time, no retraining needed.

592

ZeroEval @ZeroEval

10 months ago

Video by @AustinPeirson

353

ZeroEval @ZeroEval

10 months ago

ZeroEval is a tool to help you evaluate and optimize your agents with human feedback. Learn more at https://t.co/XqCqkLm1KK

Y Combinator

@ycombinator

10 months ago

𝜃 @ZeroEval helps you build reliable AI agents through evaluations that learn from their mistakes and get better over time. https://t.co/aYaAVMBvZf Congrats on the launch, @sebcrossa and @pirchavez!

163

24K

ZeroEval @ZeroEval

10 months ago

The new GPQA Diamond Ranking: GPT-5 is now the leader.

663

ZeroEval

@ZeroEval

Last Seen Users on Sotwe

Trends for you

Most Popular Users