Sanjay Haresh

about 1 month ago

For more details, 📝Paper: https://t.co/HF8oYLEQci 🌐Project page: https://t.co/TcGmcSKizg (code coming soon!) Work done in collab with @apratimbh, Daniel Dijkman, and @RolandMemisevic n/n

0

4

0

84

about 1 month ago

📢Excited to be presenting our work on memory + VLAs at ICRA'26 this Thursday morning (poster 224). We found that a super simple language-based scratchpad with spatial and temporal grounding goes a long way in imparting memory to VLAs. 1/n

1

19

5

7

2K

AI4Engineering, Prometheus, Autodesk, SFU, CMU

about 1 month ago

We also extend our prior ClevrSkills benchmark with a memory-dependent split ClevrSkills-Mem including 5 non-markovian tasks to evaluate memory augmented VLAs. 3/n

1

2

0

86

Who to follow

Sam 𝕏u

@SamXu03799145

Associate Professor of CS @ National Yang Ming Chiao Tung University working on human-centered physical AI

SanjayHaresh retweeted

2 months ago

🚨Blog: Why is interactive task guidance one of the cleanest benchmarks for real-world multimodal intelligence: https://t.co/EomcvTGLcB TL;DR it combines many of the core challenges of real-world intelligence in a single setting. 🏆 @CVPR 2026 Challenges: https://t.co/SXnVkeivVs

3

13

5

11

78K

SanjayHaresh retweeted

3 months ago

How much data do we need to unlock compositional generalization remains a key question? 🤔 Meanwhile, we have released a easy to use benchmark to test compositional generalization here [NeurIPS 2024]: 🎯ClevrSkills: https://t.co/RhDJLdXJRk

0

5

3

1

692

SanjayHaresh retweeted

3 months ago

🚨Have work in progress or an accepted @CVPR 2026 paper? Submit to the 2nd VAR Workshop! 🎯Topics include: • Streaming VLMs • Real-time activity understanding • VLM grounding • Egocentric video understanding • Language & robot learning 👉https://t.co/jve18jCqUc

apratimbh's tweet photo. 🚨Have work in progress or an accepted @CVPR 2026 paper? Submit to the 2nd VAR Workshop!

🎯Topics include:
• Streaming VLMs
• Real-time activity understanding
• VLM grounding
• Egocentric video understanding
• Language & robot learning

👉https://t.co/jve18jCqUc https://t.co/zt0QdlpTd8

0

21

7

3

5K

SanjayHaresh retweeted

3 months ago

🚨We are presenting the Qualcomm Live Cooking Dataset accepted to NeurIPS 2025 at the ICBINB and MMIntelligence workshops @iclr_conf 2026. 🎯We are organizing the AI Coach: Cooking competition at the VAR Workshop @CVPR 2026. 👉Win exciting prizes: https://t.co/jve18jCqUc❗️

apratimbh's tweet photo. 🚨We are presenting the Qualcomm Live Cooking Dataset accepted to NeurIPS 2025 at the ICBINB and MMIntelligence workshops @iclr_conf 2026.

🎯We are organizing the AI Coach: Cooking competition at the VAR Workshop @CVPR 2026.

👉Win exciting prizes: https://t.co/jve18jCqUc❗️ https://t.co/SgdfZZ3nbm

0

10

4

68K

SanjayHaresh retweeted

5 months ago

🚨🚨🚨 Introducing the AI Coach Challenge at the 2nd VAR Workshop @CVPR 2026 👉 Answers are passive; guidance is active. Don't just build a model that watches but one that intervenes. Details: https://t.co/2ZlUrh8N1E

1

9

4

2

5K

SanjayHaresh retweeted

Reza Ebrahimi

@rzebrahimi

4 months ago

Transformers are data‑hungry in sequential tasks because they lack the right inductive bias. It’s well known that for many sequential problems (from adding numbers to step‑by‑step agentic execution and multi‑hop reasoning), transformers fail to generalize to longer sequences than they were trained on. “Train short, test long” often fails. The usual workaround is to "just train on whatever length you’ll need at test time". --------- 📉 But we show the consequence of this is data inefficiency: • Transformers can learn tasks for a single fixed sequence length fairly efficiently, but learning across multiple lengths requires much more data. • More importantly, transformers tend not to share mechanisms across tasks of different lengths; instead, they often learn isolated, length‑specific solutions. --------- 🧪 A simple way to test this: Consider modular addition (with and without CoT). Train a model to add 2, 3, …, L numbers at once and measure the data needed. Then train separate models for each length (2, 3, …, L) and sum their data requirements. 💡The intuition: If a model truly shares mechanisms across lengths, learning a distribution of lengths should require far fewer samples than learning each length separately. This comes from amortizing the learning cost: data for length n also helps the model learn length n+k. --------- 📊 Results: Sharing Factor κ = (sum of samples to learn each length separately) ÷ (samples to learn all lengths jointly) - κ > 1: mechanism sharing and amortized learning. - κ ≈ 1: learning length-specific solutions in isolation. - κ < 1: destructive interference; length-specific solutions compete for model capacity. Transformers showed low sharing factors, and even destructive interference with CoT. --------- ✨ Implications: This suggests that end-to-end learning in applied agentic settings, like robotics or GUI control, could be even more challenging. If data requirements grow unfavorably with sequence length, that might also help explain the persistent issues we see at large context lengths (e.g., context rot). Standard attention mechanism appears inefficient for step-by-step tasks, and we may ultimately be better off with recurrent agents.

rzebrahimi's tweet photo. Transformers are data‑hungry in sequential tasks because they lack the right inductive bias.

It’s well known that for many sequential problems (from adding numbers to step‑by‑step agentic execution and multi‑hop reasoning), transformers fail to generalize to longer sequences than they were trained on. “Train short, test long” often fails.

The usual workaround is to "just train on whatever length you’ll need at test time".

---------
📉 But we show the consequence of this is data inefficiency:

• Transformers can learn tasks for a single fixed sequence length fairly efficiently, but learning across multiple lengths requires much more data.

• More importantly, transformers tend not to share mechanisms across tasks of different lengths; instead, they often learn isolated, length‑specific solutions.

---------
🧪 A simple way to test this:
Consider modular addition (with and without CoT). Train a model to add 2, 3, …, L numbers at once and measure the data needed. Then train separate models for each length (2, 3, …, L) and sum their data requirements.

💡The intuition:
If a model truly shares mechanisms across lengths, learning a distribution of lengths should require far fewer samples than learning each length separately.

This comes from amortizing the learning cost: data for length n also helps the model learn length n+k.

---------
📊 Results:

Sharing Factor κ = (sum of samples to learn each length separately) ÷ (samples to learn all lengths jointly)

- κ > 1: mechanism sharing and amortized learning.
- κ ≈ 1: learning length-specific solutions in isolation.
- κ < 1: destructive interference; length-specific solutions compete for model capacity.

Transformers showed low sharing factors, and even destructive interference with CoT.

---------
✨ Implications:
This suggests that end-to-end learning in applied agentic settings, like robotics or GUI control, could be even more challenging.

If data requirements grow unfavorably with sequence length, that might also help explain the persistent issues we see at large context lengths (e.g., context rot).

Standard attention mechanism appears inefficient for step-by-step tasks, and we may ultimately be better off with recurrent agents.

1

9

5

2

531

SanjayHaresh retweeted

Litian Liu @litianliuphd

5 months ago

Excited to share our #ICLR2026 paper: "Enhancing Hallucination Detection through Noise Injection." We turn SOTA LLMs into Bayesian models—without training. By injecting noise, we capture aleatoric & epistemic uncertainty for reliable hallucination detection with minimal cost.

litianliuphd's tweet photo. Excited to share our #ICLR2026 paper: "Enhancing Hallucination Detection through Noise Injection."

We turn SOTA LLMs into Bayesian models—without training. By injecting noise, we capture aleatoric & epistemic uncertainty for reliable hallucination detection with minimal cost. https://t.co/eLkuaGSY4K

1

15

5

2

2K

SanjayHaresh retweeted

7 months ago

🚨"Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?" #NeurIPS2025 Check out the Qualcomm Interactive Cooking Dataset for proactive mistake-aware task guidance. 📅Wed, Dec 3 11:00 AM – 2:00 PM PST 📌Exhibit Hall C,D,E #5403 Project page: https://t.co/MYLv1808Oa

1

7

3

1

452

7 months ago

Super excited to be presenting this work at #NeurIPS2025! Come visit our poster on Wednesday if you want to talk about live, situated assistants.

7 months ago

🚨 “Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?” #NeurIPS2025 🚨 We explore three core capabilities for step-by-step task guidance: delivering correct instructions, recognizing successful completions, and providing corrective feedback when errors occur. 1/5

3

8

7

1

1K

0

3

0

134

SanjayHaresh retweeted

9 months ago

📢📢📢Our team at Qualcomm AI Research is hiring Research Interns for Summer 2026 in Toronto 🇨🇦 to work on multi-modal LLMs and embodied AI. 👉Apply here: 1) Embodied AI: https://t.co/JjjbQTVkva 2) Multi-modal LLMs: https://t.co/itVz4WdTPS

3

153

27

120

9K

SanjayHaresh retweeted

Sateesh Kumar @sateeshk21

9 months ago

📢Our team at Qualcomm AI Research is also hiring a Research Intern for Summer 2026 in Toronto 🇨🇦 to work on end-to-end Embodied AI. Details: https://t.co/BnYgyfT281

1

63

6

53

5K

SanjayHaresh retweeted

9 months ago

Which data is best for training few-shot imitation policies for robot manipulation? Some think it’s the data that looks similar, or has similar motion, or comes with related language labels. They are all right AND wrong: depending on the task, sometimes this similarity helps but sometimes it is detrimental. Presenting Our #CoRL2025 work, COLLAGE 🎨, that adaptively combine data subsets efficiently for learning effective policies on target tasks. 🧵

1

13

4

3

4K

about 1 year ago

Okay this is nice. Happy to be recognized among the outstanding reviewers! :)

#CVPR2026 @CVPR

about 1 year ago

Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers! https://t.co/z8w4YJKTep

5

188

25

22

106K

1

10

0

469

SanjayHaresh retweeted