Agent evals are drifting away from production reality.
Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain knowledge, long-horizon deliverables, and expert judgment that evolves over time.
This paper introduces AlphaEval, a production-grounded benchmark for evaluating agents as complete products.
AlphaEval contains 94 tasks sourced from seven companies deploying AI agents in core business workflows, spanning six O*NET domains. It evaluates systems like Claude Code and Codex as commercial agent products, not just model APIs.
The benchmark combines multiple evaluation paradigms: LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, and domain-specific checks.
Why it matters: organizations need benchmarks that start from real production requirements, then become executable evals with minimal friction.
Paper: https://t.co/cbTGgTWoNl
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
In this ICLR 2026 paper, researchers from Google DeepMind and Johns Hopkins University demonstrate that current neural embedding models possess inherent architectural constraints that prevent them from accurately representing complex logical combinations of documents, highlighting a critical failure point that requires shifting toward more expressive retrieval designs.
ChapterPal for learners: https://t.co/tiCXc1DKLt
PDF: https://t.co/WMCqddeGKk
[Download 698-page PDF eBook]
Everything You Always Wanted To Know About #Mathematics* (*But didn’t even know to ask)
A Guided Journey Into the World of Abstract Mathematics, Theorems, and the Writing of Proofs: https://t.co/JLsDOmpP1q
Excited to share our latest work on untangling language models by training them with extremely sparse weights! We can isolate tiny circuits inside the model responsible for various simple behaviors and understand them unprecedentedly well. https://t.co/Isw1KYfdnA
Today I learned that Pinterest sends out interview preparation advice to engineers interviewing with them.
One of the resources they recommend is this one I wrote, about preparing for the systems design and coding interviews:
https://t.co/OgLf9gBB6w
Stanford’s CS336 "Language Modeling from Scratch" is available for free on YouTube.
Instead of wasting time on background, it goes straight into practical topics you rarely see explained elsewhere — PyTorch, MoE, Triton, parallelism, eval, scaling laws, alignment, and more.
a senior engineer at google just dropped a 400-page free book on docs for review: agentic design patterns.
the table of contents looks like everything you need to know about agents + code:
> advanced prompt techniques
> multi-agent patterns
> tool use and MCP
> you name it
Learning compilers? Check "A Compiler Writing Journey".
https://t.co/fMZZ5Sxt9R
Building a compiler from the ground up, documented with each step in detail.
Way too many people think that AlphaFold "solved" ML for proteins.
It didn't.
It did revolutionize protein structure prediction, but that’s just one part of a much bigger puzzle.
This is Part 1 of a series on what AlphaFold did (and didn’t) solve—and what comes next. ⬇️
More on collective behavior: Our new Annual Review of Biophysics piece - with the stellar Danielle Chase - explores how animals sense, share information, and make group decisions. In honeybees and beyond 🐝
https://t.co/UcuG35gUu5
New fastest shortest-path algorithm in 41 years!
Tsinghua researchers broke Dijkstra’s 1984 “sorting barrier,” achieving O(m log^(2/3) n) time. This means faster route planning, less traffic, cheaper deliveries, and more efficient networks - and a CS curriculum revamp =)
How to 𝗦𝗽𝗲𝗰𝗶𝗮𝗹𝗶𝘇𝗲 Your LLM — with Semantic Graphs — No RAG.
I found this Unconventional 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗚𝗿𝗮𝗽𝗵𝘀 approach inside PromptQL’s 8-step blueprint ⬇️
In short, they rebuilt the LLM stack with:
→ Planning → Execution → Explanation
→ Powered by Semantic Graphs, Domain-Specific Languages, and Deterministic Runtimes.
》WHERE THINGS START TO BREAK
Most AI agents collapse once you move beyond demos:
✗ Business logic is buried in unversioned prompts
✗ Retrieval is semantically close, not schema-correct
✗ Execution happens inside the LLM → no determinism
✗ Retry loops patch failures instead of solving them
✗ Schema drift silently breaks everything
》WHY THIS FAILURE PERSISTS
Many teams still chase better prompts — or try Agentic RAG, memory chains, or “reasoning LLMs.”
But they fail when asked to execute multi-step logic over real enterprise data with policies, joins, and constraints.
Why?
✗ Agentic RAG ≠ schema-aware
✗ Reasoning LLMs can’t version or debug logic
✗ Tool-calls mid-prompt cause retries and errors
Real reasoning needs:
▪ Semantic map of models, relationships, and rules
▪ Plans in DSL — not prompt tokens
▪ Execution outside the LLM, with guardrails
▪ Outputs that are explainable and reusable
》WHAT THIS ARCHITECTURE CHANGES
✸ Step 1️⃣: Build semantic metadata (models, commands, relationships, permissions)
✸ Step 2️⃣: LLM interprets user query into a structured plan
✸ Step 3️⃣: Plan is expressed in a Planning DSL (e.g. YAML + GraphQL-style)
✸ Step 4️⃣: Plan is parsed and executed by a runtime engine
✸ Step 5️⃣: Execution uses versioned APIs — no tool-calling inside prompts
✸ Step 6️⃣: Business logic enforced with guardrails + access policies
✸ Step 7️⃣: Full trace of all intermediate steps + logs
✸ Step 8️⃣: Plans become reusable → LLMs become domain-specific software generators
》REAL-WORLD RESULTS, NOT DEMOS
✸ Sales Drop Analysis
▪ User: “Why are Monday delivery sales down in Munich?”
▪ PromptQL: Joins across POS, delivery, weather, and region mappings
→ Generates one structured plan → Executes deterministically → No dashboards, no code
✸ Inventory Risk
▪ User: “Which SKUs are low due to returns?”
▪ PromptQL: Merges data from Snowflake, MySQL, and ML outputs
→ Detects schema drift → Validates joins → Automates analyst workflow
✸ CRM systems
▪ User: “Which deals haven’t moved in 30 days?”
▪ PromptQL: Builds a unified semantic graph across CRM systems
→ Makes it queryable in English → Answers in seconds, not weeks
》THE DEVELOPER ADVANTAGE
✸ DSL plans are versioned, testable, modular
✸ Execution is isolated from LLM randomness
✸ Logic is explainable and observable
✸ Semantic metadata evolves — no retraining
✸ Outputs are structured, validated, reusable
✸ Architecture forms a continuous learning layer
~~
🙌 Huge thanks to @PromptQL for for this incredible collaboration.
See how Semantic Graphs + DSLs can turn your LLM into a domain expert: https://t.co/giZ4axlPWF
The paper, "C++ Design Patterns for Low-Latency Applications, Including High-Frequency Trading," is an excellent read for those interested in performance engineering.
https://t.co/vOpNKg5KNq
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?
Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬
We built a benchmark to find out → OMEGA Ω 📐
💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇
work w. @UCBerkeley@allen_ai
A thread on what we learned 🧵
🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️
We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?
The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected.
In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️
📄 https://t.co/PjnYpVRdX3
Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.