GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500).
*run on 1 seed, 5 ongoing
How many instructions can your LLM follow at once?
Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload?
We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵
@data_beth@alex_gude I’m not really following closely this year but I find the extended highlights on YouTube to be the right amount (30-45 mins) of action/context
@data_pat Trying to come up with a silly response to this has led me down a sad Google path that started from “Jupyter notebooks in production” and has mostly left me shaken
"New State of the Art AI Optimizer: Rectified Adam (RAdam). Improve your AI accuracy instantly versus Adam, & why it works"
It's been a long time since we've seen a new optimizer reliably beat the old favorites; this looks like a very encouraging approach!
https://t.co/1MZmTbmFjn
This is a super cool resource: Papers With Code now includes 950+ ML tasks, 500+ evaluation tables (including SOTA results) and 8500+ papers with code. Probably the largest collection of NLP tasks I've seen including 140+ tasks and 100 datasets.
https://t.co/lTAGE7LGZY
Monaural Sound Separation (input: song with vocals and instruments, output: only vocals) using MaD TwinNet architecture, from Drossos et. al.
Online demo, PyTorch models and arxiv paper available:
https://t.co/0pvxrHlJc1
Last 4 years in MT: more parallel data, more recurrence!
Last 8 days in MT: no parallel data, no recurrence!
Impressive work from Facebook on unsupervised MT (https://t.co/65GNUl6Ldo) and Salesforce on non-autoregressive MT (https://t.co/0bE9B6eJLE).