Excited to share a preview of @SnorkelAI 's new Agentic Coding benchmark - testing models on realistic, multi-step software engineering tasks in fully sandboxed execution environments across a calibrated range of task domains and difficulties, inspired by our work with the @terminalbench team!
With a top pass@5 score of 58% (Opus 4.5) - this new benchmark challenges the notion running wild on X right now that LLMs have "solved" software engineering.
And, with both unit tests and both final-output and trajectory-level rubrics, it's already giving us & partners insights into where coding agents fail. Excited to share more here shortly!
Link to benchmark & release post in 🧵👇
Lots of chatter about agentic/RL simulation environments recently!
Some key misconceptions (slightly caricatured):
>> Building RL envs is easy, because you just code up a verifier quickly, and let the model do the tough data generation on its own!
- Usually, this boils down to over-indexing on environments where verification is easy.
- For example: you might need a chess expert to generate realistic expert gameplay traces, but anyone with a basic chess rulebook could verify a win easily.
- However: there are many, many settings where verification is not at all trivial. The simplest examples are settings with nuanced, domain-specific evaluation rubrics (e.g. most real world enterprise settings). An extreme example being: verify whether a program will halt :)
>> Building RL envs will get commoditized as the "standard" environments get rapidly solved.
- RL environments effectively encode a complete product spec - including unique tools, data resources, constraints, rubrics/verifiers, and human/agent simulators - and as such, are as diverse as the space of all possible AI products.
- Yes, certain generic RL envs will rapidly commoditize ('web browsing', 'computer OS') - but these are not the useful ones anyway!
- The useful RL envs will be deeply domain- and product-specific – and will require corresponding human expertise and customization to build and evolve over time.
>> RL (and RL envs) will be all that you need!
- Current evidence suggests that RL / RL envs will be one part of the overall AI development loop- which will continue to require golden human annotations/traces for initial SFT; ongoing human evals; and more
- Just like trial-and-error based learning is only one part of human learning, RL will likely be one tool/phase of many.
In summary:
- (1) Building the components of an RL environment is usually highly non-trivial.
- (2) RL envs effectively describe a product spec - there will be a wide range of unique ones, requiring deep product/domain expertise.
- (3) RL (and RL envs) will be one component of a rich ecosystem of tools for model learning, including human data, rubrics, evals, and more.
If interested in some of the work the @SnorkelAI team is doing in partnership with leading LLM developers here- shoot us a note!
It's an exciting time to build in this space :)
Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with @SnorkelAI Data-as-a-Service, and to share our new leaderboard!
—
Our decade-plus of research and work in AI data has a simple point: scale alone is not enough. AI success is all about the quality, complexity, and distribution of data—in addition to volume. We’re excited to be powering leading LLM developers with @SnorkelAI Expert Data-as-a-Service, our white glove service for custom, expert-level AI datasets—and to now preview some of what we’re building via our new Expert Data Leaderboard (🔗 in 🧵) + upcoming OSS dataset releases!
Snorkel Expert Data-as-a-Service is built to meet the rapidly evolving data needs of the agentic AI world—where success is built on the quality, complexity, and distribution of datasets, in addition to size and scale.
This kind of high-quality, frontier AI data can only come from a union of technology and human expertise. With Snorkel Expert Data-as-a-Service, we’re powering frontier LLM developers across agentic, expert knowledge, reasoning, coding, multi-modal, and other task types via the combination of these two key components:
- (1) The Snorkel Expert Network: A global team of subject matter experts focused wholly on specialized knowledge–spanning thousands of topics in STEM/academic, vertical/professional, and consumer/lifestyle domains.
- (2) @SnorkelAI Data Development Platform: Our unique programmatic data curation and quality control platform, accelerating and improving expert authoring and review through principled techniques developed over the last decade of R&D.
Now: we’re incredibly excited to showcase some of the power of Snorkel Expert Data-as-a-Service via the new Snorkel Leaderboard—putting frontier models to the test in complex, agentic, and reasoning settings inspired by real industry scenarios (not esoteric puzzles)!
We’ll be releasing new leaderboards and accompanying expert-verified open source datasets (coming soon!) regularly. To start, we’re sharing three initial ones in preview:
- SnorkelFinance: Q&A over financial documents requiring agentic tool-calling and reasoning
- SnorkelUnderwrite: Agentic insurance tasks requiring industry-specific reasoning and tool use
- SnorkelSequences: Mathematical tasks requiring compositional multi-step reasoning
Agentic AI will transform every enterprise–but only if agents are trusted experts.
The key: Evaluation & tuning on specialized, expert data.
I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D!
---
Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough.
Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more!
Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value.
If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! https://t.co/w0J8izpn8p
Finally, see thread for details on 🧵👇
- 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task
- 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast@Stanford@QBE & others
- 📊 An upcoming series of benchmark datasets and model artifact releases
👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!
Tanner Beason and Amir Bashti collected the first All-America accolades of their careers on Thursday night. It's the fifth consecutive season Stanford has had multiple players honored.
#GoStanford
https://t.co/NVHnVXBqPa
Cardinal win 5-2 over QPR. Goals from Bashti, Bulut, Joshua, Panchot, and Beason. Quick shower then off to Wembley to watch England vs Italy. #EnglandTour2018