@kalomaze We show a wider range of models and some experiments with thinking enabled for Claude (small performance bump) in our paper: https://t.co/QIIBTortWr
GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500).
*run on 1 seed, 5 ongoing
@kalomaze Closer to GPT-4.1. The Anthropic model performance in general tracks with sentiment that Claude 3.7 was special in some ways and not superseded by sonnet/opus-4. Opus-4.1 does seem distinctly better at instruction following at scale though:
@mahaoo_ASI That's part of what makes the results so interesting! We see these steep degradation curves even for simple directive instructions. Performance drops are likely even more severe as you introduce greater instruction complexity and variety (conditional, hierarchical, etc.)
How many instructions can your LLM follow at once?
Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload?
We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵
Interested in this work and want to advance the frontier of what LLMs can do in real-world applications? Come join us at Distyl AI! https://t.co/jkPWgz1uZv
Many more insights in our paper that can inform the design of instruction-dense prompts--increasingly relevant for emerging agentic applications that must juggle various tool use instructions and collected context simultaneously https://t.co/2vrVi0fHOw