Excited to share our #ACL2026 paper: "How Retrieved Context Shapes Internal Representations in RAG" π§΅
w/ @Samuel861025
RAG is everywhere, but I am often puzzled by what happens inside the model when we stuff retrieved documents into the context. Did it integrate the evidence? Ignore it? Get confused by it? To answer these questions, we need to look at what's happening inside.
We built a controlled framework to study how the hidden states of LLMs change under different retrieval conditions: relevant, distracting, and random documents.
π€ The counterintuitive finding: Relevant documents barely move the needle on internal representations. They largely reinforce what the model already knows rather than injecting decisive new information. Meanwhile, random, irrelevant documents cause large representation drift -- far larger than relevant or even distracting ones.
Why? Because LLMs internally recognize uninformative context and shift into a refusal mode. That representation drift is tightly linked to abstention behavior. The model is essentially saying: "I see this context, and I know it's useless."
Other key insights:
β Later transformer layers increasingly prioritize parametric knowledge over retrieved evidence, limiting how much external context can influence generation.
β In multi-document settings, a single relevant document can anchor the entire representation and suppress surrounding noise. This is good news for imperfect retrieval in practice
β Distracting documents (semantically similar but unhelpful) are the silent threat: they don't trigger the model's refusal mechanism the way random docs do, but they quietly degrade quality
The takeaway for practitioners: understanding RAG can't stop at output accuracy. The internal dynamics tell a richer and sometimes alarming story about when your system is actually integrating evidence vs. when it's ignoring it, fighting it, or shutting down because of it.
We hope our analysis offers mechanistic insights for more principled and reliable RAG system design.
π https://t.co/EYEYAo1CcK
Outcome reward models: cheap, but vulnerable to spurious shortcuts π£
Process reward models (PRMs): robust, but too expensive to build from scratch π«
What if you could get a ready-to-use PRM right after any RL post-training?
Introducing 'Progress Advantage' π§΅
Agent RL training can be fragile and far less stable than reasoning RL. In our latest work, we identify and explain a phenomenon called Cyclical Entropy Eruption: a recurring instability unique to agent RL where entropy erupts, recovers, and erupts again throughout training. π§΅
π https://t.co/ww3M5RHO7s (led by @Wendi_Li_ and @shawnim00)
We decompose it into three phases:
Phase 1 Entropy Descent. The model first learns the basics: how to call tools, satisfy schema constraints, and use the right format tokens. Probability mass shifts rapidly from invalid outputs to valid trajectories. Entropy drops fast.
Phase 2 Entropy Eruption. The core instability. Correct and incorrect agent trajectories end up extremely close in representation space, far more overlapping than in non-agent tasks. This causes gradient interference: when RL suppresses bad trajectories, it accidentally drags down the likelihood of good ones too. The policy flattens, entropy spikes, and degenerate patterns like sentence duplication and hallucination emerge.
Phase 3 Entropy Subsidence. The flatter distribution leads to more diverse sampling, which reduces representation similarity and eases interference. Training recovers... but then reconverges, similarity rises again, and the next eruption begins. A self-perpetuating cycle.
The damage compounds. Degenerate patterns acquired during eruption persist and accumulate across cycles. In the worst case (Llama3.2-1B on WebShop), a single eruption triggers complete training collapse.
Motivated by our analysis, we propose SEAL (Separation-Enhanced Agent Learning)--a lightweight auxiliary loss that pushes correct and incorrect trajectories apart in representation space, directly targeting the root cause.
SEAL stabilizes training and improves performance across AlfWorld, WebShop, and search-augmented QA, on both Qwen and Llama backbones, with GRPO and GIGPO. On a Llama run where vanilla GRPO completely collapsed (0% success), adding SEAL recovered performance to ~80%.
π‘The broader takeaway: agent RL is a fundamentally different optimization problem than reasoning RL. The multi-turn structure, tool interactions, and validity constraints create training dynamics that deserve their own analysis. We hope this work helps the community build more stable agent post-training pipelines.
Code is available in the paper!
Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?
Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)
π‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.
How ModeX works:
1β£ Build a similarity graph over N candidate generations
2β£ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster
3β£ Select the centroid of that cluster as the final output
No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.
π Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.
We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.
π Paper: https://t.co/QfPXgqmUHF
π» Code: https://t.co/YdCpupdnMp
Sometimes the best signal is already hiding in the samples; you just need to find the mode. π―
π€ How do you detect VLA failures during execution with only trajectory-level labels?
Excited to share our new paper:
"Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring"
I will present two papers at #ICLR this week!
1. HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection
4/24 Fri 10:00-13:00 P4-#3901
2. LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals
4/25 Sat 15:15-17:45 P4-#3816
Your LLM agent just mass-deleted a production database because it was confident it understood the task. It didn't. Avoiding these irreversible mistakes requires uncertainty quantification, a pressing open problem in the era of LLM agents.
Check out our #ACL2026 paper:
"Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities"
π Why this matters:
LLM agents now book flights, modify databases, and execute code autonomously. Yet most UQ research still measures a single-turn QA setup. In contrast, agents follow multi-turn trajectories in which they interact with users, call tools, and receive environmental feedback. The gap between how we study UQ and how agents actually operate is enormous.
βοΈ A unified formulation:
We present the first unified formulation of Agent UQ. It models the full trajectory (actions, observations, states) and decomposes uncertainty per turn via the chain rule. Under this formulation, single-step LLM UQ and multi-step reasoning UQ fall out as special cases.
π§ Challenges:
We identify four core challenges: from selecting the right UQ estimator when existing methods all break down in agentic settings to handling heterogeneous uncertainty sources (user, tools, environment) to the near-total lack of fine-grained agent benchmarks (we survey 44 and find that turn-level evaluation is extremely rare).
π Implications and open problems:
Agent UQ is the missing safety layer for healthcare agents triaging patients, SWE agents pushing code to prod, and agents controlling cyber-physical systems. We also surface open problems around solution multiplicity, multi-agent UQ, and self-evolving systems.
We release code and data to help the community build on this.
π Paper: https://t.co/IrvVUa3JgY
π Project: https://t.co/LkDaCZCZep
π» Code: https://t.co/6juIEPaLqJ
Huge shoutout to @changdaeoh, who spearheaded this effort. When we started the work, agent UQ was a loosely defined space with scattered ideas; Changdae brought the clarity, structure, and rigor that the field needed to move forward.
Also thanks to all the collaborators: @seongheon_96 , To Eun Kim, @JiatongLi0418, @Wendi_Li_ , @Samuel861025@xuefeng_du, Hamed Hassani, Paul Bogdan, Dawn Song
We've been in GRPO-tweaking mode for months (entropy bonuses, clipping hacks, length penalties). But what if the entire objective is wrong?
Today, we're releasing LAD (Learning Advantage Distributions), the most elegant rethink of RL for LLM reasoning I've seen this year. #ACL2026
Here's the idea, how it works, and why we think it changes things. π§΅
The problem we kept hitting
GRPO, DAPO, RLOO, and many other variants do the same thing at their core: maximize expected reward. And when you do that, your policy can collapse onto a single dominant reasoning path. Entrop regularization can act as a bolt onto the framework, but it doesn't fundamentally fix it from the ground up.
The key insight
π‘Stop maximizing. Start matching.
We reframe the policy update as a distribution matching problem. Instead of pushing toward the single best response, we make the policy's output distribution match the full advantage-weighted target distribution by minimizing an f-divergence between the two (see our theory in Section 3.1).
When you match the full advantage distribution, you naturally preserve probability mass across multiple valid reasoning paths. High-advantage responses get upweighted, yes, but the objective also suppresses overconfident probability growth on any single mode.
Collapse prevention isn't an afterthought.
What validated the theory
We tested six divergence families. The result that convinced us we were on the right track:
- Strict divergences (Total Variation, Hellinger, Jensen-Shannon) that enforce exact distributional matching consistently outperform weaker ones (such as KL).
- The more faithfully you learn the full advantage distribution, the better the reasoning. This is exactly what the framework predicts.
The results
- In a controlled bandit setting. LAD recovers multiple-mode advantage distributions (see plot below). GRPO fundamentally cannot. This is the clearest demonstration that the paradigm difference is real, not just theoretical
- In math and code reasoning tasks across multiple LLM backbones. LAD consistently outperforms GRPO on both accuracy AND generative diversity across benchmarks.
Why this matters beyond benchmarks
Pass@k scaling: If your model knows 5 valid reasoning paths instead of 1, sampling at inference becomes massively more effective.
Simplicity: Instead of stacking "GRPO + entropy hack," you get one principled objective. Diversity preservation comes by design.
Paper: https://t.co/Vs8TpzjiGH
Code is available; link in the paper.
Huge credit to my amazing student @Wendi_Li_, who drove this work, thinks boldly, and made things happen.
In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them!
We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4.
Here's what we found β¬οΈ
Introducing π¨ππππππππ πΉππππ ππππ: Rethinking depth-wise aggregation.
Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.
πΉ Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
πΉ Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
πΉ Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
πΉ Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.
πFull report:
https://t.co/u3EHICG05h
Excited to release PostTrainBench v1.0!
This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting.
We believe this is a first step toward tracking progress in recursive self-improvement π§΅:
In Agent RL, models suffer from Template Collapse.
They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information).
In other words, agent learn different ways to say nothing.
π Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. π§΅
π§΅ Excited to share our new paper at #ICLR2026: LUMINA: Detecting Hallucinations in RAG System with ContextβKnowledge Signals
With @tanwimallick and @SharonYixuanLi
Paper: https://t.co/wQ0xLHI1nI
Code: https://t.co/by25osWkfL
1/N
β We also statistically validate that our scores actually measure what they claim to measure β not just correlate with hallucination. All 4 hypothesis tests pass across Llama2, Llama3, and Mistral.
Check out the paper & code, and let us know what you think! π
7/N
π‘οΈ Robustness:
β’ LUMINA works even with noisy retrieved documents
β’ LUMINA doesn't require the same LLM for generation and detection (proxy LLM setting works!)
6/N