Thanks @SnorkelAI for the great tasks and especially @fredsala, Tom Walshe, and Jeong Shin for the collaboration
Terminal-Bench 2.0 on the horizon ๐ + some other exciting releases!
@phoebethacker Congratulations to the entire team - love the diversity of occupations! Looking forward to insights on how different models perform on practical every day tasks.
-2016 (classic era): focus on data efficiency
2017-2025 (pretraining era): focus on compute efficiency
2026-: focus on data efficiency (again)
The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design decisions, which will be exciting!
Lots of chatter about agentic/RL simulation environments recently!
Some key misconceptions (slightly caricatured):
>> Building RL envs is easy, because you just code up a verifier quickly, and let the model do the tough data generation on its own!
- Usually, this boils down to over-indexing on environments where verification is easy.
- For example: you might need a chess expert to generate realistic expert gameplay traces, but anyone with a basic chess rulebook could verify a win easily.
- However: there are many, many settings where verification is not at all trivial. The simplest examples are settings with nuanced, domain-specific evaluation rubrics (e.g. most real world enterprise settings). An extreme example being: verify whether a program will halt :)
>> Building RL envs will get commoditized as the "standard" environments get rapidly solved.
- RL environments effectively encode a complete product spec - including unique tools, data resources, constraints, rubrics/verifiers, and human/agent simulators - and as such, are as diverse as the space of all possible AI products.
- Yes, certain generic RL envs will rapidly commoditize ('web browsing', 'computer OS') - but these are not the useful ones anyway!
- The useful RL envs will be deeply domain- and product-specific โ and will require corresponding human expertise and customization to build and evolve over time.
>> RL (and RL envs) will be all that you need!
- Current evidence suggests that RL / RL envs will be one part of the overall AI development loop- which will continue to require golden human annotations/traces for initial SFT; ongoing human evals; and more
- Just like trial-and-error based learning is only one part of human learning, RL will likely be one tool/phase of many.
In summary:
- (1) Building the components of an RL environment is usually highly non-trivial.
- (2) RL envs effectively describe a product spec - there will be a wide range of unique ones, requiring deep product/domain expertise.
- (3) RL (and RL envs) will be one component of a rich ecosystem of tools for model learning, including human data, rubrics, evals, and more.
If interested in some of the work the @SnorkelAI team is doing in partnership with leading LLM developers here- shoot us a note!
It's an exciting time to build in this space :)
@united help! Iโm stuck in Malaga trying to get home to California. Last nightโs flight was cancelled, todayโs rebooked flight is delayed on outbound to Newark. System rebooked me & split tickets. Local gate canโt help
help! Iโm stuck in Malaga trying to get home to California. Last nightโs flight was cancelled, todayโs rebooked @united flight is delayed on outbound to Newark. System rebooked me & split tickets. Local gate canโt help.
1/ Super excited to deepen our partnership with @Azure! Most real-world use cases end up being blocked on the data. Solve data-centric development with @SnorkelAI , then connect seamlessly to Azure AI for model development and serving: https://t.co/1dZhPISexb
MyPOV - You know the #AI / #ML transformation has hit the mainstream when... a European / German (fair - high tech) company - presents in the US... #SplunkConf
Starbucks using Splunk ES for years and together with Phantom over 2 Years! Please continue to protect my coffee โ๏ธ yummy waiting for a break to get one #splunkconf18#siem#protectthebusiness