Saurabh Tiwari @ronin_codex - Twitter Profile

about 2 months ago

Agent evals are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time. This paper introduces AlphaEval, a production-grounded benchmark for evaluating agents as complete products. AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows, spanning six O*NET domains. It evaluates systems like Claude Code and Codex as commercial agent products, not just model APIs. The benchmark combines multiple evaluation paradigms: LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks. Why it matters: organizations need benchmarks that start from real production requirements, then become executable evals with minimal friction. Paper: https://t.co/cbTGgTWoNl Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. Agent evals are drifting away from production reality.

Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time.

This paper introduces AlphaEval, a production-grounded benchmark for evaluating agents as complete products.

AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows, spanning six O*NET domains. It evaluates systems like Claude Code and Codex as commercial agent products, not just model APIs.

The benchmark combines multiple evaluation paradigms: LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks.

Why it matters: organizations need benchmarks that start from real production requirements, then become executable evals with minimal friction.

Paper: https://t.co/cbTGgTWoNl

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

12

214

39

240

18K

ronin_codex retweeted

Quincy Larson

@ossia

about 2 months ago

https://t.co/nuIS7xJw5C

2

247

51

330

50K

ronin_codex retweeted

BURKOV

@burkov

about 2 months ago

In this ICLR 2026 paper, researchers from Google DeepMind and Johns Hopkins University demonstrate that current neural embedding models possess inherent architectural constraints that prevent them from accurately representing complex logical combinations of documents, highlighting a critical failure point that requires shifting toward more expressive retrieval designs. ChapterPal for learners: https://t.co/tiCXc1DKLt PDF: https://t.co/WMCqddeGKk

burkov's tweet photo. In this ICLR 2026 paper, researchers from Google DeepMind and Johns Hopkins University demonstrate that current neural embedding models possess inherent architectural constraints that prevent them from accurately representing complex logical combinations of documents, highlighting a critical failure point that requires shifting toward more expressive retrieval designs.

ChapterPal for learners: https://t.co/tiCXc1DKLt

PDF: https://t.co/WMCqddeGKk

2

243

41

237

20K

ronin_codex retweeted

Vivek Galatage

@vivekgalatage

about 2 months ago

An amazing write-up on "A memory allocator" by Doug Lea https://t.co/E7H2zN1CpG

2

153

12

120

8K

ronin_codex retweeted

Kirk Borne

@KirkDBorne

3 months ago

[Download 698-page PDF eBook] Everything You Always Wanted To Know About #Mathematics* (*But didn’t even know to ask) A Guided Journey Into the World of Abstract Mathematics, Theorems, and the Writing of Proofs: https://t.co/JLsDOmpP1q

KirkDBorne's tweet photo. [Download 698-page PDF eBook]

Everything You Always Wanted To Know About #Mathematics* (*But didn’t even know to ask)

A Guided Journey Into the World of Abstract Mathematics, Theorems, and the Writing of Proofs: https://t.co/JLsDOmpP1q https://t.co/f7JEHmYkdO

26

3K

477

4K

181K

ronin_codex retweeted

Leo Gao

@nabla_theta

7 months ago

Excited to share our latest work on untangling language models by training them with extremely sparse weights! We can isolate tiny circuits inside the model responsible for various simple behaviors and understand them unprecedentedly well. https://t.co/Isw1KYfdnA

40

494

59

285

249K

ronin_codex retweeted

Gergely Orosz

@GergelyOrosz

8 months ago

Today I learned that Pinterest sends out interview preparation advice to engineers interviewing with them. One of the resources they recommend is this one I wrote, about preparing for the systems design and coding interviews: https://t.co/OgLf9gBB6w

GergelyOrosz's tweet photo. Today I learned that Pinterest sends out interview preparation advice to engineers interviewing with them.

One of the resources they recommend is this one I wrote, about preparing for the systems design and coding interviews:
https://t.co/OgLf9gBB6w https://t.co/RtG0XpGon8

16

3K

298

4K

156K

ronin_codex retweeted

ℏεsam

@Hesamation

8 months ago

Stanford’s CS336 "Language Modeling from Scratch" is available for free on YouTube. Instead of wasting time on background, it goes straight into practical topics you rarely see explained elsewhere — PyTorch, MoE, Triton, parallelism, eval, scaling laws, alignment, and more.

Hesamation's tweet photo. Stanford’s CS336 "Language Modeling from Scratch" is available for free on YouTube.

Instead of wasting time on background, it goes straight into practical topics you rarely see explained elsewhere — PyTorch, MoE, Triton, parallelism, eval, scaling laws, alignment, and more. https://t.co/f5BW390w1r

8

440

55

392

17K

ronin_codex retweeted

Vivek Galatage

@vivekgalatage

8 months ago

📚 A Complete Guide to Standard C++ Algorithms by Šimon Tóth https://t.co/yroJ8v1H7P

4

1K

163

1K

42K

ronin_codex retweeted

ℏεsam

@Hesamation

9 months ago

a senior engineer at google just dropped a 400-page free book on docs for review: agentic design patterns. the table of contents looks like everything you need to know about agents + code: > advanced prompt techniques > multi-agent patterns > tool use and MCP > you name it

Hesamation's tweet photo. a senior engineer at google just dropped a 400-page free book on docs for review: agentic design patterns.

the table of contents looks like everything you need to know about agents + code:
> advanced prompt techniques
> multi-agent patterns
> tool use and MCP
> you name it https://t.co/DIIaDOpdGj

62

9K

1K

22K

1M

ronin_codex retweeted

Vivek Galatage

@vivekgalatage

9 months ago

Learning compilers? Check "A Compiler Writing Journey". https://t.co/fMZZ5Sxt9R Building a compiler from the ground up, documented with each step in detail.

vivekgalatage's tweet photo. Learning compilers? Check "A Compiler Writing Journey".

https://t.co/fMZZ5Sxt9R

Building a compiler from the ground up, documented with each step in detail. https://t.co/mA8DQHnf5r

7

1K

135

1K

87K

ronin_codex retweeted

Georgia Channing

@cgeorgiaw

9 months ago

Way too many people think that AlphaFold "solved" ML for proteins. It didn't. It did revolutionize protein structure prediction, but that’s just one part of a much bigger puzzle. This is Part 1 of a series on what AlphaFold did (and didn’t) solve—and what comes next. ⬇️

cgeorgiaw's tweet photo. Way too many people think that AlphaFold "solved" ML for proteins.

It didn't.
It did revolutionize protein structure prediction, but that’s just one part of a much bigger puzzle.

This is Part 1 of a series on what AlphaFold did (and didn’t) solve—and what comes next. ⬇️ https://t.co/EGvEPdRnqG

18

1K

135

863

87K

ronin_codex retweeted

Orit Peleg @oritpeleg

10 months ago

More on collective behavior: Our new Annual Review of Biophysics piece - with the stellar Danielle Chase - explores how animals sense, share information, and make group decisions. In honeybees and beyond 🐝 https://t.co/UcuG35gUu5

oritpeleg's tweet photo. More on collective behavior: Our new Annual Review of Biophysics piece - with the stellar Danielle Chase - explores how animals sense, share information, and make group decisions. In honeybees and beyond 🐝

https://t.co/UcuG35gUu5 https://t.co/qzwVCFyZkF

13

2K

437

2K

130K

ronin_codex retweeted

Dorsa

@dorsa_rohani

10 months ago

New fastest shortest-path algorithm in 41 years! Tsinghua researchers broke Dijkstra’s 1984 “sorting barrier,” achieving O(m log^(2/3) n) time. This means faster route planning, less traffic, cheaper deliveries, and more efficient networks - and a CS curriculum revamp =)

dorsa_rohani's tweet photo. New fastest shortest-path algorithm in 41 years!
Tsinghua researchers broke Dijkstra’s 1984 “sorting barrier,” achieving O(m log^(2/3) n) time. This means faster route planning, less traffic, cheaper deliveries, and more efficient networks - and a CS curriculum revamp =) https://t.co/MMuK1x8jRH

333

29K

3K

14K

2M

ronin_codex retweeted

Maryam Miradi, PhD

@MaryamMiradi

10 months ago

How to 𝗦𝗽𝗲𝗰𝗶𝗮𝗹𝗶𝘇𝗲 Your LLM — with Semantic Graphs — No RAG. I found this Unconventional 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗚𝗿𝗮𝗽𝗵𝘀 approach inside PromptQL’s 8-step blueprint ⬇️ In short, they rebuilt the LLM stack with: → Planning → Execution → Explanation → Powered by Semantic Graphs, Domain-Specific Languages, and Deterministic Runtimes. 》WHERE THINGS START TO BREAK Most AI agents collapse once you move beyond demos: ✗ Business logic is buried in unversioned prompts ✗ Retrieval is semantically close, not schema-correct ✗ Execution happens inside the LLM → no determinism ✗ Retry loops patch failures instead of solving them ✗ Schema drift silently breaks everything 》WHY THIS FAILURE PERSISTS Many teams still chase better prompts — or try Agentic RAG, memory chains, or “reasoning LLMs.” But they fail when asked to execute multi-step logic over real enterprise data with policies, joins, and constraints. Why? ✗ Agentic RAG ≠ schema-aware ✗ Reasoning LLMs can’t version or debug logic ✗ Tool-calls mid-prompt cause retries and errors Real reasoning needs: ▪ Semantic map of models, relationships, and rules ▪ Plans in DSL — not prompt tokens ▪ Execution outside the LLM, with guardrails ▪ Outputs that are explainable and reusable 》WHAT THIS ARCHITECTURE CHANGES ✸ Step 1️⃣: Build semantic metadata (models, commands, relationships, permissions) ✸ Step 2️⃣: LLM interprets user query into a structured plan ✸ Step 3️⃣: Plan is expressed in a Planning DSL (e.g. YAML + GraphQL-style) ✸ Step 4️⃣: Plan is parsed and executed by a runtime engine ✸ Step 5️⃣: Execution uses versioned APIs — no tool-calling inside prompts ✸ Step 6️⃣: Business logic enforced with guardrails + access policies ✸ Step 7️⃣: Full trace of all intermediate steps + logs ✸ Step 8️⃣: Plans become reusable → LLMs become domain-specific software generators 》REAL-WORLD RESULTS, NOT DEMOS ✸ Sales Drop Analysis ▪ User: “Why are Monday delivery sales down in Munich?” ▪ PromptQL: Joins across POS, delivery, weather, and region mappings → Generates one structured plan → Executes deterministically → No dashboards, no code ✸ Inventory Risk ▪ User: “Which SKUs are low due to returns?” ▪ PromptQL: Merges data from Snowflake, MySQL, and ML outputs → Detects schema drift → Validates joins → Automates analyst workflow ✸ CRM systems ▪ User: “Which deals haven’t moved in 30 days?” ▪ PromptQL: Builds a unified semantic graph across CRM systems → Makes it queryable in English → Answers in seconds, not weeks 》THE DEVELOPER ADVANTAGE ✸ DSL plans are versioned, testable, modular ✸ Execution is isolated from LLM randomness ✸ Logic is explainable and observable ✸ Semantic metadata evolves — no retraining ✸ Outputs are structured, validated, reusable ✸ Architecture forms a continuous learning layer ~~ 🙌 Huge thanks to @PromptQL for for this incredible collaboration. See how Semantic Graphs + DSLs can turn your LLM into a domain expert: https://t.co/giZ4axlPWF

6

216

39

316

18K

ronin_codex retweeted

Vivek Galatage

@vivekgalatage

10 months ago

The paper, "C++ Design Patterns for Low-Latency Applications, Including High-Frequency Trading," is an excellent read for those interested in performance engineering. https://t.co/vOpNKg5KNq

vivekgalatage's tweet photo. The paper, "C++ Design Patterns for Low-Latency Applications, Including High-Frequency Trading," is an excellent read for those interested in performance engineering.

https://t.co/vOpNKg5KNq https://t.co/zTMOBtxIuc

1

1K

170

1K

58K

ronin_codex retweeted

Udara

@TGUPJ

10 months ago

Research v0.85 is out with new tabs 🗂️

47

2K

125

2K

228K

ronin_codex retweeted

Nouha Dziri

@nouhadziri

12 months ago

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇 work w. @UCBerkeley @allen_ai A thread on what we learned 🧵

nouhadziri's tweet photo. 📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

We built a benchmark to find out → OMEGA Ω 📐

💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇

work w. @UCBerkeley @allen_ai

A thread on what we learned 🧵

23

715

152

673

184K

ronin_codex retweeted

Nouha Dziri

@nouhadziri

about 3 years ago

🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️ We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥

nouhadziri's tweet photo. 🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️

We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥 https://t.co/8caCE8zTf3

37

1K

328

877

503K

ronin_codex retweeted

Mehrdad Farajtabar @MFarajtabar

12 months ago

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected. In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️ 📄 https://t.co/PjnYpVRdX3 Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.

MFarajtabar's tweet photo. 🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?

The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected.

In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️
📄 https://t.co/PjnYpVRdX3

Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.

110

3K

567

4K

908K

Saurabh Tiwari

@ronin_codex

Last Seen Users on Sotwe

Trends for you

Most Popular Users