🚨Introducing: Ego-MC-Bench (Mistake Corrections) benchmark and Ego-CoMist (Counterfactual Mistakes) dataset.
🎯Ego-MC-Bench: Where AI assistants need to intervene at the right time (when) and with the right feedback (what) to prevent mistakes.
👉https://t.co/zFNEZWAMpt
1/4
For more details,
📝Paper: https://t.co/HF8oYLEQci
🌐Project page: https://t.co/TcGmcSKizg (code coming soon!)
Work done in collab with @apratimbh, Daniel Dijkman, and @RolandMemisevic
n/n
📢Excited to be presenting our work on memory + VLAs at ICRA'26 this Thursday morning (poster 224).
We found that a super simple language-based scratchpad with spatial and temporal grounding goes a long way in imparting memory to VLAs.
1/n
We also extend our prior ClevrSkills benchmark with a memory-dependent split ClevrSkills-Mem including 5 non-markovian tasks to evaluate memory augmented VLAs.
3/n
🚨Blog: Why is interactive task guidance one of the cleanest benchmarks for real-world multimodal intelligence: https://t.co/EomcvTGLcB
TL;DR it combines many of the core challenges of real-world intelligence in a single setting.
🏆 @CVPR 2026 Challenges: https://t.co/SXnVkeivVs
How much data do we need to unlock compositional generalization remains a key question? 🤔
Meanwhile, we have released a easy to use benchmark to test compositional generalization here [NeurIPS 2024]:
🎯ClevrSkills: https://t.co/RhDJLdXJRk
🚨Have work in progress or an accepted @CVPR 2026 paper? Submit to the 2nd VAR Workshop!
🎯Topics include:
• Streaming VLMs
• Real-time activity understanding
• VLM grounding
• Egocentric video understanding
• Language & robot learning
👉https://t.co/jve18jCqUc
🚨We are presenting the Qualcomm Live Cooking Dataset accepted to NeurIPS 2025 at the ICBINB and MMIntelligence workshops @iclr_conf 2026.
🎯We are organizing the AI Coach: Cooking competition at the VAR Workshop @CVPR 2026.
👉Win exciting prizes: https://t.co/jve18jCqUc❗️
🚨🚨🚨 Introducing the AI Coach Challenge at the 2nd VAR Workshop @CVPR 2026
👉 Answers are passive; guidance is active. Don't just build a model that watches but one that intervenes.
Details: https://t.co/2ZlUrh8N1E
Transformers are data‑hungry in sequential tasks because they lack the right inductive bias.
It’s well known that for many sequential problems (from adding numbers to step‑by‑step agentic execution and multi‑hop reasoning), transformers fail to generalize to longer sequences than they were trained on. “Train short, test long” often fails.
The usual workaround is to "just train on whatever length you’ll need at test time".
---------
📉 But we show the consequence of this is data inefficiency:
• Transformers can learn tasks for a single fixed sequence length fairly efficiently, but learning across multiple lengths requires much more data.
• More importantly, transformers tend not to share mechanisms across tasks of different lengths; instead, they often learn isolated, length‑specific solutions.
---------
🧪 A simple way to test this:
Consider modular addition (with and without CoT). Train a model to add 2, 3, …, L numbers at once and measure the data needed. Then train separate models for each length (2, 3, …, L) and sum their data requirements.
💡The intuition:
If a model truly shares mechanisms across lengths, learning a distribution of lengths should require far fewer samples than learning each length separately.
This comes from amortizing the learning cost: data for length n also helps the model learn length n+k.
---------
📊 Results:
Sharing Factor κ = (sum of samples to learn each length separately) ÷ (samples to learn all lengths jointly)
- κ > 1: mechanism sharing and amortized learning.
- κ ≈ 1: learning length-specific solutions in isolation.
- κ < 1: destructive interference; length-specific solutions compete for model capacity.
Transformers showed low sharing factors, and even destructive interference with CoT.
---------
✨ Implications:
This suggests that end-to-end learning in applied agentic settings, like robotics or GUI control, could be even more challenging.
If data requirements grow unfavorably with sequence length, that might also help explain the persistent issues we see at large context lengths (e.g., context rot).
Standard attention mechanism appears inefficient for step-by-step tasks, and we may ultimately be better off with recurrent agents.
Excited to share our #ICLR2026 paper: "Enhancing Hallucination Detection through Noise Injection."
We turn SOTA LLMs into Bayesian models—without training. By injecting noise, we capture aleatoric & epistemic uncertainty for reliable hallucination detection with minimal cost.
🚨"Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?" #NeurIPS2025
Check out the Qualcomm Interactive Cooking Dataset for proactive mistake-aware task guidance.
📅Wed, Dec 3 11:00 AM – 2:00 PM PST
📌Exhibit Hall C,D,E #5403
Project page: https://t.co/MYLv1808Oa
🚨 “Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?” #NeurIPS2025 🚨
We explore three core capabilities for step-by-step task guidance: delivering correct instructions, recognizing successful completions, and providing corrective feedback when errors occur.
1/5
📢📢📢Our team at Qualcomm AI Research is hiring Research Interns for Summer 2026 in Toronto 🇨🇦 to work on multi-modal LLMs and embodied AI.
👉Apply here:
1) Embodied AI: https://t.co/JjjbQTVkva
2) Multi-modal LLMs:
https://t.co/itVz4WdTPS
📢Our team at Qualcomm AI Research is also hiring a Research Intern for Summer 2026 in Toronto 🇨🇦 to work on end-to-end Embodied AI.
Details: https://t.co/BnYgyfT281
Which data is best for training few-shot imitation policies for robot manipulation?
Some think it’s the data that looks similar, or has similar motion, or comes with related language labels. They are all right AND wrong: depending on the task, sometimes this similarity helps but sometimes it is detrimental.
Presenting Our #CoRL2025 work, COLLAGE 🎨, that adaptively combine data subsets efficiently for learning effective policies on target tasks. 🧵
Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers!
https://t.co/z8w4YJKTep
Call for Participation @CVPR : Multi-Modal LLMs - prepare to engage in a dynamic, face-to-face conversation with a real human user!
Details: https://t.co/SXnVkej3L0
🚨🚨🚨The winning teams will receive a prize and a contributed talk.
P.S. GPT-4o does not do too well.
🚀 How can we create interactive Physical Digital Twins from videos?
Thrilled to share our latest work: PhysTwin! 🌟 Using inverse physics optimization, we generate photo-realistic, physically accurate, and real-time interactive virtual replicas. 🔥
🔗https://t.co/Mmdm66fQ3s