Daniel J @djarosai - Twitter Profile

10 months ago

@kalomaze We show a wider range of models and some experiments with thinking enabled for Claude (small performance bump) in our paper: https://t.co/QIIBTortWr

0

1

0

49

Daniel J

@djarosai

10 months ago

GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500). *run on 1 seed, 5 ongoing

djarosai's tweet photo. GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500).
*run on 1 seed, 5 ongoing https://t.co/UnybJD4Cph

2

51

15

5

6K

Daniel J

@djarosai

10 months ago

@kalomaze Closer to GPT-4.1. The Anthropic model performance in general tracks with sentiment that Claude 3.7 was special in some ways and not superseded by sonnet/opus-4. Opus-4.1 does seem distinctly better at instruction following at scale though:

djarosai's tweet photo. @kalomaze Closer to GPT-4.1. The Anthropic model performance in general tracks with sentiment that Claude 3.7 was special in some ways and not superseded by sonnet/opus-4. Opus-4.1 does seem distinctly better at instruction following at scale though: https://t.co/66pL9QemIM

1

0

111

Daniel J

@djarosai

11 months ago

@mahaoo_ASI That's part of what makes the results so interesting! We see these steep degradation curves even for simple directive instructions. Performance drops are likely even more severe as you introduce greater instruction complexity and variety (conditional, hierarchical, etc.)

1

0

43

Daniel J

@djarosai

11 months ago

How many instructions can your LLM follow at once? Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload? We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵

2

23

5

13

9K

Daniel J

@djarosai

11 months ago

Interested in this work and want to advance the frontier of what LLMs can do in real-world applications? Come join us at Distyl AI! https://t.co/jkPWgz1uZv

0

3

0

216

Daniel J

@djarosai

11 months ago

Many more insights in our paper that can inform the design of instruction-dense prompts--increasingly relevant for emerging agentic applications that must juggle various tool use instructions and collected context simultaneously https://t.co/2vrVi0fHOw

1

6

0

1

260

Daniel J

@djarosai

Last Seen Users on Sotwe

Trends for you

Most Popular Users