🚀Announcing MuRGAt!
MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯
We introduce a benchmark for Fact-Level Multimodal Attribution featuring:
✅ High-quality Human Annotations for validation.
✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment.
✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution.
🧵👇
Looking forward to giving a keynote at the Midwest Machine Learning Symposium (MMLS) 2026 (being held at Purdue University this year) & meeting folks from all the strong universities in the midwest, with their inspiring, long tradition of these exciting symposiums! 🙂
👇👇
🥳 Excited to share that MuRGAt is accepted to #ICML2026!
Even strong MLLMs hallucinate citations to multimodal sources (video, audio, charts). Our new Fact-Level Multimodal Attribution benchmark tackles this by:
🕐 Requiring fine-grained temporal & per-modality citations (vs. just source-level)
🔍 Distinguishing verifiable claims from reasoning steps to evaluate multi-step responses
We also introduce MuRGAt-SCORE, a reference-free, decomposed metric aligned with human judgment, and show that Programmatic Grounding substantially boosts attribution!
👇
🚀Announcing MuRGAt!
MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯
We introduce a benchmark for Fact-Level Multimodal Attribution featuring:
✅ High-quality Human Annotations for validation.
✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment.
✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution.
🧵👇
🎉 Excited to share that our work on intrinsic dimensionality of reasoning has been accepted to #ICML2026 as a ✨spotlight✨ (top 2.2%)!
We analyze the effectiveness of teaching a model how to reason via the lens of intrinsic dimensionality (the minimum effective capacity a model needs to solve the task) and find that effective reasoning chains are inherently compressive!
Across Gemma-3 1B and 4B, lower intrinsic dimensionality strongly predicts not only in-distribution accuracy (GSM8K), but also robustness on OOD benchmarks (GSM-Hard, GSM-Symbolic, GSM-IC) -- outperforming reasoning length, token perplexity, and KL divergence.
Stay tuned for more results and exciting updates in the camera-ready! 🚀
🎉 Happy to share our paper introducing GCMs for confidence estimation based on historical predictions has been accepted to #ICML2026!
We find that models are no better at learning to predict their own correctness than others', i.e., they don't have privileged self-access (given training).
Training smaller models to predict the correctness of many models generalizes and leads to better calibration than self-reported confidence from much larger models!
Check out 🧵 for more
Glad that GCMs for analyzing confidence estimation from historical predictions was accepted to #ICML2026!
We examine whether models have an advantage when predicting their own correctness and confidence and find that little usable privileged information exist for confidence prediction. This leads us to train Generalized Correctness Models to predict the calibrated confidence and correctness of models, outperforming the logit and verbalized confidences of much larger models! Thanks to @vaidehi_patil_, @hyunji_amy_lee, @EliasEskin, and @mohitban47!
See more in 🧵below!
🎉 Excited to share EPiC is accepted to #ICML2026!
We show that learning precise camera control for video diffusion doesn't need expensive 3D supervision or large-scale data. No camera or point cloud processing — just mask source videos based on visibility to construct precise training anchor videos, and learn a SoTA camera controller with only 30M params, trained >100× faster on >100× less data than prior work, while generalizing across both I2V and V2V camera control tasks.
Excited to share that Symbolic-MoE has been accepted to #ICML2026! 🎉
We find that adaptive instance-level "mixture-of-experts" yields +8.15% average gain over the best multi-agent baseline, while almost 2x faster + generalize to unseen tasks!
Existing multi-LLM/multi-agent setups can improve reasoning, but they use a fixed set of LLMs, making them sensitive to the choice of model + hard to scale due to the high cost of multi-round agent discussion.
Symbolic MoE instead adaptively recruits the most relevant expert for each instance based on the skills needed and the strengths of each model. In addition to beating baselines in performance, Symbolic-MoE is also more efficient: it skips the expensive multi-round discussion, and our novel batching method allows us to integrate 16 experts on a single GPU!
🧵👇
🚨 Excited to announce SAS: Stabilizing Efficient Reasoning with Step-Level Advantage Selection is accepted to #ACL2026 Findings.
LLMs can compress reasoning during post-training with short context alone, but silently destabilize training. We propose SAS to fix it by zeroing advantages at the step level: suppressing low-confidence steps in correct rollouts + shielding high-confidence ones in failed rollouts.
✨ Result: >30% shorter reasoning traces, higher accuracy, and stable training – all from a simple zero-advantage operation at the step level.
Our #ICLR2026 paper on inferring an executable, symbolic world model offline from one life in a hostile environment (One Life to Learn) will be presented by @EliasEskin (I couldn't make it to #ICLR2026)!
If you're interested in world modeling, open-ended exploration, or neurosymbolic methods, DM / email me or stop by and chat with Elias!
📅 Sat, Apr 25, 10:30AM — 1PM BRT
📍Pavilion 4 P4-#4916
https://t.co/FajRXXqALQ
Honored to receive the @NSF Graduate Research Fellowship! 🎉 A huge thank you to my mentors @mohitban47 and @EliasEskin for guiding me on my research journey, and to the brilliant MURGe Lab graduate students and postdocs who have supported me along the way during my time in @unccs.
My prior work has studied enhancing multi-agent reasoning through more advanced tool recruitment (DART) and also increasing model faithfulness (REMuL). I’ve been fortunate to apply these multi-agent ideas into impactful applications across education and healthcare. I hope to continue working on improving multimodal understanding and collaborative AI agent systems in the future!
I won’t be attending #ICLR2026, but @LucasPCaccia and @EliasEskin will be presenting our work, Gistify!
We study whether coding agents can truly understand a repository by extracting its gist: generating a single, self-contained, executable file that reproduces the behavior of a target command (e.g., a test or entrypoint). It is lightweight, broadly applicable evaluation of codebase-level reasoning!
I’d also love to connect online. Feel free to reach out!
📅 3:15 PM-5:45 PM Thu, Apr 23, 2026
📍 Pavilion 3, Poster 1020
Not able to attend #ICLR2026 in person, but would love to discuss NuRL (Nudging the Boundaries of LLM Reasoning) online!
If you're interested in "nudging" the upper bound of LLM reasoning using self-generated hints, feel free to ask any questions via the ICLR online chat / send me emails!
📅 Thu, Apr 23, 3:15 PM – 5:45 PM
📍 Pavilion 4 P4-#4604
📰 https://t.co/yglcjzTqNT
Honored to receive the @NSF Graduate Research Fellowship (GRFP) 🎉. I’m deeply grateful to have conducted research @unccs in the MURGe-Lab under the guidance of @mohitban47 and @EliasEskin, and for the support of the many PhD students and Postdocs who shaped my experience at UNC and helped me grow through my projects.
My previous work leveraged a scientific understanding / interpretability perspective to compress LLMs via quantization (Task-Circuit Quantization) and also to predict their correctness (General Correctness Models). More recently, I used RL in long horizon tasks to study ToM (AI Double Agents). These works have inspired me to focus on developing a better scientific understanding of- and improved memory for- foundation models in the future!
Had a great time visiting the University of Virginia last week for their CS Distinguished Speaker Series -- talked about skill discovery + improvement as well as some newer work in mixture-of-agents' skill profiling + matching + disagreement resolution, in context of building trustworthy (calibrated, controllable, collaborative) multimodal AI agents (incl. education and healthcare applications)!
Thanks again for the kind invitation and exciting discussions Yangfeng, Haiying, Aidong, Tom, Yen-Ling, Jundong, Rohan, et al. (and for the nice photos) 🙏
PS. Also found many similarities between UVA and UNC campuses w.r.t. the beautiful hilly terrain and drives, the big central quad/lawn, the traditional brick architecture (and the amazing farmers markets in both places) 🙂
Thanks @_akhaliq for sharing our work!
If you are interested in multimodal agents on open-web search, please see our thread for more details: https://t.co/uh9rr6hFcZ
🚨 Check out MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments --> diagnoses that even strong search agents e.g. Gemini 3/3.1, GPT-5.4 struggle (avg 22%, best 40%) -- and compared to humans, they check 50x more irrelevant pages, over-explore with repeated searches, and heavily favor text modality (87%)!
✅ Covers text, image, video, and audio
✅ Reflects real web noise and conflicts (not synthetic)
✅ Queries have NO explicit references to specific modality sources
✅ Multi-hop reasoning required
✅ Human-annotated with open-web search
🧵👇
Real-world web search 🔎 requires handling messy, multimodal information. It turns out current models – even ones that do well on text-based search – struggle on the complex, expert-vetted, multi-hop queries we introduce in MERRIN!
Agent search behavior differs substantially from humans: even the strongest agents perform poorly and search inefficiently, visiting 50x more sites than human searchers. MERRIN represents a new front for both search agent accuracy and efficiency, with plenty of room for improvement!
🧵👇
The web is full of noisy, multimodal information, and many user questions require multi-hop reasoning across various modalities without explicit modality cues.
Can models choose the right modalities and follow the correct reasoning path? To study this, we introduce MERRIN.
We find that even strong search agents struggle (avg 22.3%, best 40.1%). Compared to humans, they check ~50× more irrelevant pages, over-explore with repeated searches, and heavily favor text modality (87%).
This highlights the need for search agents that can reason efficiently and use the right modalities in a noisy web environment and MERRIN is a strong benchmark to evaluate them!
More details in the thread below ⬇️