@ajratner's talk from Scale at Meta covering how benchmarks must evolve, the data layer underneath, and the collaborations driving this work forward.
https://t.co/1fEUhVuexK
Join us for late-afternoon boba and research. RSVP: https://t.co/boWl1s0gH6🧋
Next up in the Snorkel AI Reading Group: Russell Yang (@StanfordLaw) on “JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment,” stemming from a collaboration with @harvey and Snorkel AI, with contributions from Charles Dickens.
When evaluating AI systems in high-judgment domains like law, should we use rubrics or pairwise preference rankings? JudgmentBench is a new benchmark that enables direct comparison of these evaluation methods using expert legal judgments.
We’re building the data and environments behind the world’s most advanced AI systems. If you want to work on hard problems that matter, alongside people who hold a high bar and move fast without ego, Snorkel is the place.
Learn more about open roles: https://t.co/luQ4eOnT4T
Join us for late-afternoon 🧋 boba and research. Details/RSVP: https://t.co/GK5lswnZbY
Next up in the Snorkel AI Reading Group: @EchoShao8899 (@stanfordnlp) on “Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration,” recently presented at @iclr_conf.
🚀 We're #hiring a Research Scientist – RL Training @SnorkelAI.
We need someone who's actually RLFT'd agents using complex environments (e.g. SWE-Bench/Terminal-Bench).
Deep hands-on experience with GRPO, RLHF, DPO, reward modeling & frameworks like verl/SkyRL. 30B+ scale and deep expertise in RL algorithms.
Come build SOTA coding agents with us!
📍 RWC / SF / NYC / Remote - US
#ML #ReinforcementLearning #PostTraining
The question has moved past whether coding agents can produce code that works. The harder question is whether they can complete real software work safely, measurably, and repeatedly — and how we supervise it.
Great piece by @realjustinbauer on the evolution of coding agents and why they need better data, evals, and environments: https://t.co/RBHhWfp6Nn
One major factor distorting our perception of AI capabilities: benchmark development now lags behind model development for the first time in AI history.
In traditional AI/ML: The rate of benchmark advancement (i.e. labeling a small-to-mid sized dataset) exceeded that of model development - and so benchmarks gave a pretty useful view of frontier capabilities. This made them canonical measures of AI progress.
Today: it's very difficult to create benchmarks that properly measure *real world* environments, scenarios, and tasks at the jagged frontier of AI capabilities - which itself has become an exponentially bigger space to measure - and are robust to rapid overfitting. Benchmarks show near saturated performance - even though models still have real capability gaps in practice.
One more reason why accelerating the pace of benchmark development - and doing so with the full power of open, academic communities- is so important!
@marklevinshow Carter's biggest foreign policy failure on full display. We traded a genuine ally in the Shah's Iran for a regime that chants "Death to America" and now we're dependent on countries that are allies only when it's convenient for them!
If you are an RL expert with track record of experience in LLM post-training, reach out directly with your resume!
*** Please share with your network ***
Our #research team is looking for an #RL expert to join the team that's working on building a Coding Agent.
Ping me if you have experience with RL for agents at scale and want to build something exciting (or apply through the links below)
#Hiring
Exciting release!
And we are looking for researchers with deep RL and LLM post-training expertise to help us build something amazing!
#hiring#rl#post_training
Our #MLSys2026 paper is live on arXiv 📄
We ran a systematic study of RLVR in low-data regimes across 3 procedurally generated benchmarks (counting, graph, spatial reasoning).
Key finding: dataset composition matters more than dataset size.
https://t.co/Z7ZuG1fLMD
Web agents are getting increasingly capable. Excited that our team @SnorkelAI partnered with @allen_ai on MolmoWeb where we managed the human trajectory annotations, verified correctness, and ensured quality control for the training data.
Blog: https://t.co/7uWdB1qmVq
Your eval rubric might be the weakest link in your entire training pipeline and you'd never know it.
We cataloged 8 failure modes across reliability, validity, and downstream impact along with automated diagnostics to catch them.
Read our latest paper RIFT for more details: https://t.co/r14Hq5otn8
Will be presented at DATA-FM @iclr_conf
یه چیز دیگه که از شدت وضوح ممکنه دیده نشه اینه که اگه قرار باشه همهی کشورها شبیه به رژیم خونخوار اشغالگر ایران بجنگن، اساسا هر کشوری میتونه با آمریکا بجنگه. اینکه هنر نیست که هست و نیستِ نود میلیون نفر رو بذاری وسط و بری پشتش قایم بشی و شروع کنی به سنگ زدن به خونه و هتل و پالایشگاه و نیروگاه کشورهای متحد آمریکا. علت اینکه بقیه این کارو نمیکنن این نیست که نمیتونن، علتش اینه که مثل تو حیوان نیستند.
To the Commanders of the Islamic Revolutionary Guard Corps (IRGC):
Today, few doubt that little remains of the Velayat-e Faqih dictatorship but a half-dead body. As a result of five decades of warmongering and crimes, you are the real decision-makers of this collapsing system.
Your misguided regional policies and apocalyptic madness have brought the theater of war to Iran. Because you used it not for our people but instead militarized it, our country’s economic infrastructure is now in the crosshairs of two powers that have been roaming Iran’s skies for weeks. This infrastructure was built with the national wealth of Iran and is vital for the country’s reconstruction.
The corrupt regime of the Islamic Republic is on its way out. Your choice is not between survival and collapse; it is about how you collapse. The end of the current path is the delivery of a scorched earth to the Iranian nation following your inevitable downfall.
For the sake of Iran, for yourselves, and for your children, abandon this adventurism and warmongering. Do not leave Iran more bloodstained and wounded than it already is.
Allow the country’s infrastructure to be preserved for the Iranian nation. Stop your crimes. Step down from power.
به سرداران سپاه پاسداران انقلاب اسلامی!
امروز کمتر کسی تردید دارد که از نظام ولایت فقیه جز جسمی نیمهجان باقی نمانده است، و تصمیمگیران واقعی این ساختمان در حال ریزش، در نتیجه پنج دهه ماجراجویی و جنایت، شمایید.
سیاستهای غلط منطقهای و دیوانگیهای آخرالزمانی شما، ایران را به صحنه این جنگ بدل کرده است. زیرساختهای اقتصادی که عمدا نظامیسازی کردهاید، در تیررس دو قدرتی قرار گرفتهاند که هفتههاست در آسمان ایران جولان میدهند. این زیرساختها با ثروت ملی ایران ساخته شدهاند و برای بازسازی کشور حیاتیاند.
نظام فاسد جمهوری اسلامی رفتنی است. انتخاب شما میان بقا و سقوط نیست؛ میان چگونه سقوط کردن است. پایان مسیر کنونی، تحویل یک سرزمین سوخته به ملت ایران پس از سقوط حتمیتان است.
برای ایران، برای خودتان، برای فرزندانتان، این ماجراجوییها را رها کنید. ایران را بیش از این خونآلود و زخمی نکنید. بگذارید زیرساختهای کشور برای ملت ایران حفظ شود. به جنایتهایتان پایان دهید. از حکومت کناره بگیرید.