Mohamed @mohammad2012191 - Twitter Profile

Pinned Tweet

@mohammad2012191

26 days ago

🚨 New preprint: GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs 🔵TL;DR: new sub-quadratic training-free inference method for video VLMs. ➡️3.36× less compute, no accuracy loss, fully interpretable on the cell-level, adaptive frames selection using VLM reasoning rather than CLIP-like tricks! 🔥Spoiler alert: Surprisingly, question difficulty alone tells you how many frames a long-video VLM needs to answer it! We turn this into a closed-form rule!!

mohammad2012191's tweet photo. 🚨 New preprint: GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

🔵TL;DR: new sub-quadratic training-free inference method for video VLMs.
➡️3.36× less compute, no accuracy loss, fully interpretable on the cell-level, adaptive frames selection using VLM reasoning rather than CLIP-like tricks!

🔥Spoiler alert: Surprisingly, question difficulty alone tells you how many frames a long-video VLM needs to answer it! We turn this into a closed-form rule!!

2

11

2

1

945

mohammad2012191 retweeted

Xiuyu Li

@sheriyuo

1 day ago

https://t.co/aJEXibSlmN

15

1K

135

4K

196K

mohammad2012191 retweeted

Vivek

@itsreallyvivek

2 days ago

https://t.co/HV0irzdHax

16

892

46

2K

154K

Mohamed

@mohammad2012191

2 days ago

If you're attending #CVPR2026 in Denver, don't forget to pass by our teammate Ali Habibullah presenting our work on video retrieval, he'd love to chat about it! 📌 Vote-in-Context: VLMs as Explainable Zero-Shot Rank Fusers 📆 When: Saturday Morning, June 6 | 7:30 – 9:00 AM 📍 Location: Poster Session 2 — Exhibit Hall A | Poster #309 📄 Paper Link: https://t.co/M49wQGkCT4 Co-authors: Mohamed Eltahir, Ali Habibullah, Lama A., Tanveer Hussain, Naeemullah Khan. Focus: Utilizing Vision-Language Models as explainable, zero-shot rank fusers for video retrieval, combining ranked lists from multiple retrieval systems without any additional training. Saturating most of the existing video retrieval benchmarks!

mohammad2012191's tweet photo. If you're attending #CVPR2026 in Denver, don't forget to pass by our teammate Ali Habibullah presenting our work on video retrieval, he'd love to chat about it!

📌 Vote-in-Context: VLMs as Explainable Zero-Shot Rank Fusers
📆 When: Saturday Morning, June 6 | 7:30 – 9:00 AM
📍 Location: Poster Session 2 — Exhibit Hall A | Poster #309
📄 Paper Link: https://t.co/M49wQGkCT4

Co-authors: Mohamed Eltahir, Ali Habibullah, Lama A., Tanveer Hussain, Naeemullah Khan.

Focus: Utilizing Vision-Language Models as explainable, zero-shot rank fusers for video retrieval, combining ranked lists from multiple retrieval systems without any additional training. Saturating most of the existing video retrieval benchmarks!

0

7

0

204

Who to follow

ما أبعد السماء، وما أقرب الله ❤️

♡LOLO♡

@lo000lo0003612

إلى أحدهم أعجز عن رؤيتك لكنني لن أعجز عن الدعاء لك ♡ أمي ♡

mohammad2012191 retweeted

Dwarkesh Patel

@dwarkesh_sp

4 days ago

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my iPhone. The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made. Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

40

3K

173

3K

399K

mohammad2012191 retweeted

swyx

@swyx

5 days ago

probably the best reward function for reasoning efficiency i've seen

18

324

13

220

35K

mohammad2012191 retweeted

Benhao Huang

@huskydogewoof

4 days ago

Insightful work! ARC-AGI is really more of a vision or representation problem than a reasoning problem that requires step-by-step thinking. The success of HRM and TRM depends heavily on the 500M learned puzzle embeddings stored in `nn.Buffer` for each separate task. These embeddings essentially store task-specific representations, without which 7M/27M backbone will fail completely. Another evidence comes from the dynamics of the learned computation steps assigned by the Adaptive Computation head. ARC and Sudoku exhibit very different behaviors. - On ARC, the number of steps smoothly collapses to nearly 1 - whereas on Sudoku, the drop in steps shows a transition between two stages, with different slopes.

huskydogewoof's tweet photo. Insightful work! ARC-AGI is really more of a vision or representation problem than a reasoning problem that requires step-by-step thinking.

The success of HRM and TRM depends heavily on the 500M learned puzzle embeddings stored in `nn.Buffer` for each separate task. These embeddings essentially store task-specific representations, without which 7M/27M backbone will fail completely.

Another evidence comes from the dynamics of the learned computation steps assigned by the Adaptive Computation head. ARC and Sudoku exhibit very different behaviors.
- On ARC, the number of steps smoothly collapses to nearly 1
- whereas on Sudoku, the drop in steps shows a transition between two stages, with different slopes.

2

58

7

51

11K

mohammad2012191 retweeted

Grant Stenger (hiring)

@GrantStenger

8 days ago

Local minima are rare in high dimensions because a strict local minimum has to curve upward in every direction, so all Hessian eigenvalues must be positive. In a D-dimensional toy model where eigenvalue signs are independent, that’s a 2^(-D) event. In GOE-like random matrix models, positive definiteness is even rarer, roughly exp(-cD^2). So as dimension grows, random critical points are much more likely to be saddles than minima. This is one reason high-dimensional optimization is often a saddle-escape problem, not a bad-local-minimum problem. Wrote up some of the math here: https://t.co/vkaVqVD64N

33

2K

191

2K

300K

mohammad2012191 retweeted

kalomaze

@kalomaze

8 days ago

the broader implication is that there's abandoned architecture research from before Muon that failed because the empirical optimizers that worked in practice were, both literally and conceptually, stuck in element-wise local minima

9

332

26

154

40K

mohammad2012191 retweeted

Silvio Martinico @SilvioMartinico

11 days ago

Late-interaction retrieval is incredibly powerful, but scaling it is computationally challenging. k-means is a huge bottleneck. Our new architecture, TACHIOM, is fully open-source and tackles this problem: up to 247x faster clustering and 9.8x faster retrieval. ⚡ 🧵👇

3

117

21

80

8K

mohammad2012191 retweeted

Lucas Maes

@lucasmaes_

10 days ago

Would you like to join the research effort on JEPA and World Models easily? After a full year of hard work, we’re excited to finally release stable-worldmodel: an open-source, scalable platform built to accelerate JEPA & World Model research! 📄: https://t.co/gnxGvens5A

lucasmaes_'s tweet photo. Would you like to join the research effort on JEPA and World Models easily?

After a full year of hard work, we’re excited to finally release stable-worldmodel:

an open-source, scalable platform built to accelerate JEPA & World Model research!

📄: https://t.co/gnxGvens5A

38

2K

272

2K

112K

mohammad2012191 retweeted

Sattam

@SattamAltwaim

10 days ago

🚨 [New Paper] The Adam optimizer is a zombie algorithm... It senses and adapts the learning rate, sure. But the update rule itself? Fixed, frozen. Decided before even the training starts. It works in some regions of the loss landscape and fails in others. What if the optimizer itself was an agent, free to learn its own trajectory through the landscape and adjust its own update rule at every step? and maybe transfer its learned policy to train models on unseen datasets! Introducing: PILOT (Policy-Informed Learned OpTimizer) 📄Preprint: https://t.co/vRljBd0AF8 🧵TLDR 👇

6

69

13

36

13K

mohammad2012191 retweeted

François Chollet

@fchollet

15 days ago

If you can learn one thing that's genuinely novel to you, you can learn anything.

80

1K

85

171

57K

mohammad2012191 retweeted

Sungjin Ahn

@SungjinAhn_

18 days ago

🧠We introduce "Generative Recursive Reasoning"! Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic — same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor. Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width — parallel trajectory sampling. And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x). With only 10M params: • Sudoku-Extreme: 97.0% (TRM 87.4%) • ARC-AGI-1: 52.0% • ARC-AGI-2: 11.1% • N-Queens coverage: 90%+ 📄 Paper: https://t.co/JC7EyXYc9Y 🌐 Project page: https://t.co/LRT1dQiWLZ w/ Junyeob Baek @JunyeobB (KAIST), Mingyu Jo @pyross0000 (KAIST), Minsu Kim @minsuuukim (KAIST & Mila), Mengye Ren @mengyer (NYU), Yoshua Bengio @Yoshua_Bengio (Mila), Sungjin Ahn @SungjinAhn_ (KAIST)

SungjinAhn_'s tweet photo. 🧠We introduce "Generative Recursive Reasoning"!

Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic — same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor.

Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width — parallel trajectory sampling.

And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x).

With only 10M params:
• Sudoku-Extreme: 97.0% (TRM 87.4%)
• ARC-AGI-1: 52.0%
• ARC-AGI-2: 11.1%
• N-Queens coverage: 90%+

📄 Paper: https://t.co/JC7EyXYc9Y
🌐 Project page: https://t.co/LRT1dQiWLZ

w/
Junyeob Baek @JunyeobB (KAIST),
Mingyu Jo @pyross0000 (KAIST),
Minsu Kim @minsuuukim (KAIST & Mila),
Mengye Ren @mengyer (NYU),
Yoshua Bengio @Yoshua_Bengio (Mila),
Sungjin Ahn @SungjinAhn_ (KAIST)

31

1K

209

1K

182K

mohammad2012191 retweeted

Ben Clavié

@bclavie

17 days ago

Information Retrieval is about making knowledge accessible. Late Interaction is the best way to do that today. But now that we have a new kind of users, it's time to zoom out so we can plan the future of retrieval. I gave a talk about this at @ir_tsukuba https://t.co/q0kd4WSeqA

3

138

29

107

19K

mohammad2012191 retweeted

Benhao Huang

@huskydogewoof

19 days ago

HRM-Text paper is here: https://t.co/P5PqSFTYXY Just finished reading it as a deeper dive. I went in with a connected set of researcher-style questions: Is the gain really from HRM? To answer that, we first need to separate out the objective: how much comes from computing loss only on response tokens? Then, how much comes from PrefixLM? Finally, if we remove PrefixLM, how strong is causal-only HRM? What I appreciate is that the paper gives enough ablations to answer this chain pretty directly. 1/ First, there is real architectural signal. ------------------------------------------------- Table 3 compares model architectures, objectives, and attention masks, under same FLOPs budget. The average scores are: Transformer, P(x), causal 41.9 HRM, P(x), causal 51.5 Transformer, P(a|q), causal 57.6 HRM, P(a|q), causal 62.3 Transformer, P(a|q), PrefixLM 65.3 HRM, P(a|q), PrefixLM 73.4 So even under causal attention, HRM still wins over the matched Transformer. 2/ That said, I would read the final headline number carefully. It is not “HRM architecture alone.” It is: ------------------------------------------------- HRM architecture + response-only objective + PrefixLM + instruction/reasoning-heavy data 2.1/ PrefixLM is a big piece. In PrefixLM, prompt tokens can attend bidirectionally, while answer tokens are still generated autoregressively. So the prompt side becomes somewhat encoder-like, while the answer side stays decoder-style. Empirically: Transformer: causal response-only -> PrefixLM 57.6 -> 65.3 (+7.7) HRM: causal response-only -> PrefixLM 62.3 -> 73.4 (+11.1) This is a strong improvement, but it also raised my first deployment concern. In multi-turn chat, bidirectional prompt attention means you need special mask / KV-cache handling. You cannot simply treat it as the usual append-only causal cache in AR models. To their credit, the paper explicitly discusses this in Sec. 5.3. I appreciate that they state this explicitly 2.2/ The objective also matters a lot. > Standard LM objective: learn P(x) > Task-completion objective: learn P(answer | question) In practice, this means: do not spend loss predicting the prompt. Train on response tokens. This alone moves the average: Transformer: 41.9 -> 57.6 HRM: 51.5 -> 62.3 3/ Then, the next sharp question for me was: What if we take causal-only HRM from Table 3 and compare it to the open models in Table 4? ------------------------------------------------- Not the final PrefixLM HRM. Just causal-only HRM. That gives a less flashy, but more informative comparison. Against Table 4 models, causal-only HRM roughly looks like this: HRM causal avg: 62.3 vs Llama3.2 3B: +3.1 vs Gemma3 4B: +5.7 vs Qwen3.5 2B: +4.1 vs Huginn 3.5B: +21.2 vs Ouro 1.4B: -1.4 vs OLMo3 7B: -7.3 So causal-only HRM is still quite competitive for a 1B low-budget model (actually impressive if you look at FLOPs used compared to others in Table 4!). 4/ FLOPs fairness is important, and the paper makes a serious attempt here. ------------------------------------------------- For the internal Transformer, TRM, and HRM comparisons, they match estimated training FLOPs, not just token count. Because HRM spends more computation per token, the Transformer gets more training tokens under the same FLOPs budget. I think this is a serious attempt at fairness, though still estimate-level. -/ In summary, HRM-Text is a solid work to me. The ablations show real architectural signal, in addition to recipe choices separate from arch. That is more interesting than just a one-line architecture claim, and more useful for researchers to follow up on. Congratulations on the team!

huskydogewoof's tweet photo. HRM-Text paper is here:
https://t.co/P5PqSFTYXY

Just finished reading it as a deeper dive. I went in with a connected set of researcher-style questions:

Is the gain really from HRM? To answer that, we first need to separate out the objective: how much comes from computing loss only on response tokens? Then, how much comes from PrefixLM? Finally, if we remove PrefixLM, how strong is causal-only HRM?

What I appreciate is that the paper gives enough ablations to answer this chain pretty directly.

1/ First, there is real architectural signal.
-------------------------------------------------

Table 3 compares model architectures, objectives, and attention masks, under same FLOPs budget. The average scores are:

Transformer, P(x), causal 41.9
HRM, P(x), causal 51.5

Transformer, P(a|q), causal 57.6
HRM, P(a|q), causal 62.3

Transformer, P(a|q), PrefixLM 65.3
HRM, P(a|q), PrefixLM 73.4

So even under causal attention, HRM still wins over the matched Transformer.

2/ That said, I would read the final headline number carefully. It is not “HRM architecture alone.” It is:
-------------------------------------------------

HRM architecture
+ response-only objective
+ PrefixLM
+ instruction/reasoning-heavy data

2.1/ PrefixLM is a big piece. In PrefixLM, prompt tokens can attend bidirectionally, while answer tokens are still generated autoregressively.

So the prompt side becomes somewhat encoder-like, while the answer side stays decoder-style.

Empirically:

Transformer:
causal response-only -> PrefixLM
57.6 -> 65.3 (+7.7)

HRM:
causal response-only -> PrefixLM
62.3 -> 73.4 (+11.1)

This is a strong improvement, but it also raised my first deployment concern.

In multi-turn chat, bidirectional prompt attention means you need special mask / KV-cache handling. You cannot simply treat it as the usual append-only causal cache in AR models. To their credit, the paper explicitly discusses this in Sec. 5.3. I appreciate that they state this explicitly

2.2/ The objective also matters a lot.

> Standard LM objective: learn P(x)

> Task-completion objective: learn P(answer | question)

In practice, this means: do not spend loss predicting the prompt. Train on response tokens.

This alone moves the average:

Transformer: 41.9 -> 57.6
HRM: 51.5 -> 62.3

3/ Then, the next sharp question for me was: What if we take causal-only HRM from Table 3 and compare it to the open models in Table 4?
-------------------------------------------------

Not the final PrefixLM HRM. Just causal-only HRM. That gives a less flashy, but more informative comparison.

Against Table 4 models, causal-only HRM roughly looks like this:

HRM causal avg: 62.3

vs Llama3.2 3B: +3.1
vs Gemma3 4B: +5.7
vs Qwen3.5 2B: +4.1
vs Huginn 3.5B: +21.2

vs Ouro 1.4B: -1.4
vs OLMo3 7B: -7.3

So causal-only HRM is still quite competitive for a 1B low-budget model (actually impressive if you look at FLOPs used compared to others in Table 4!).

4/ FLOPs fairness is important, and the paper makes a serious attempt here.
-------------------------------------------------

For the internal Transformer, TRM, and HRM comparisons, they match estimated training FLOPs, not just token count.

Because HRM spends more computation per token, the Transformer gets more training tokens under the same FLOPs budget.

I think this is a serious attempt at fairness, though still estimate-level.

-/ In summary, HRM-Text is a solid work to me. The ablations show real architectural signal, in addition to recipe choices separate from arch.

That is more interesting than just a one-line architecture claim, and more useful for researchers to follow up on.

Congratulations on the team!

6

115

14

109

26K

mohammad2012191 retweeted

Ben Cohen

@blc_16

20 days ago

MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad training data. It is a similar concept to teaching someone backprop. Say you have a tiny computation graph: z = wx + b a = ReLU(z) L = (a - y)² If you already understand backprop, you can jump straight to the gradient: dL/dw = 2(a - y) · 1[z > 0] · x The answer is correct but it skips the reasoning process. To get there, you need to break the computation into local pieces: dL/da = 2(a - y) da/dz = 1[z > 0] dz/dw = x Then backprop is just composing those local derivatives backward through the graph: dL/dw = dL/da · da/dz · dz/dw = 2(a - y) · 1[z > 0] · x Showing a student the final gradient does not teach them how to find gradients on new graphs. Even telling them “just use the chain rule” may be too large of a jump if they do not understand how to decompose the computation into intermediate nodes and local derivatives. Reasoning RL has the same failure mode. A rollout can pass the verifier while containing one step the student model basically never would have taken. The trajectory gets the answer right, but the learning signal is brittle because the path is too far from the student’s current policy. Pedagogical RL trains a privileged teacher that knows the answer, then rewards it for producing trajectories that stay learnable for the student. The trick is to use a spike-aware reward. It penalizes single huge surprise gaps in the trajectory, even when the average likelihood of the trajectory looks fine. Then the student learns with surprisal-gated imitation, where teacher tokens that are still too surprising get downweighted. The teacher is learning how to teach at the student’s current level. Pedagogical RL makes RL more efficient by efficiently selecting trajectories the student is most ready to learn from. Less waiting for the model to get lucky rollouts. More training signal from examples that meet the student where it is. Full blog in comments

blc_16's tweet photo. MIT just released a new RL method called Pedagogical RL.

The main lesson -> correct reasoning traces can still be bad training data.

It is a similar concept to teaching someone backprop.

Say you have a tiny computation graph:

z = wx + b
a = ReLU(z)
L = (a - y)²

If you already understand backprop, you can jump straight to the gradient:

dL/dw = 2(a - y) · 1[z > 0] · x

The answer is correct but it skips the reasoning process.

To get there, you need to break the computation into local pieces:

dL/da = 2(a - y)
da/dz = 1[z > 0]
dz/dw = x

Then backprop is just composing those local derivatives backward through the graph:

dL/dw = dL/da · da/dz · dz/dw = 2(a - y) · 1[z > 0] · x

Showing a student the final gradient does not teach them how to find gradients on new graphs.

Even telling them “just use the chain rule” may be too large of a jump if they do not understand how to decompose the computation into intermediate nodes and local derivatives.

Reasoning RL has the same failure mode.

A rollout can pass the verifier while containing one step the student model basically never would have taken.

The trajectory gets the answer right, but the learning signal is brittle because the path is too far from the student’s current policy.

Pedagogical RL trains a privileged teacher that knows the answer, then rewards it for producing trajectories that stay learnable for the student.

The trick is to use a spike-aware reward. It penalizes single huge surprise gaps in the trajectory, even when the average likelihood of the trajectory looks fine.

Then the student learns with surprisal-gated imitation, where teacher tokens that are still too surprising get downweighted.

The teacher is learning how to teach at the student’s current level.

Pedagogical RL makes RL more efficient by efficiently selecting trajectories the student is most ready to learn from.

Less waiting for the model to get lucky rollouts. More training signal from examples that meet the student where it is.

Full blog in comments

12

461

67

431

43K

Mohamed

@mohammad2012191

22 days ago

@cwolferesearch @cwolferesearch Check VideoAtlas. While it looks like a video understanding framework, i believe it is an excellent multimodal agentic benchmark (check the end of the thread particulalry, there is another thread with interesting details) https://t.co/1izjZyMhlq

Mohamed

@mohammad2012191

3 months ago

What if understanding a video was more like navigating a map?🤔 And what if that made compute scale logarithmically (not linearly) with video length?! New preprint🎉: 🗺️VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

mohammad2012191's tweet photo. What if understanding a video was more like navigating a map?🤔

And what if that made compute scale logarithmically (not linearly) with video length?!

New preprint🎉:
🗺️VideoAtlas: Navigating Long-Form Video in Logarithmic Compute https://t.co/Vk5B6bqSr7

5

66

6

54

31K

0

1

0

132

Mohamed

@mohammad2012191

22 days ago

@Lama_s1 @KAUST_Academy @Dr_S_Albarakati Some additional experiments that show even more interesting things :) https://t.co/FzFbKeGmh7

Mohamed

@mohammad2012191

27 days ago

Been running extra experiments on VideoAtlas in the previous days, and the headline finding turned out wilder than I expected: 🔥@googlegemma Gemma 4 which is officially limits to 60 seconds of video is now matching long-context models on 30-60 min videos! Thread 🧵

mohammad2012191's tweet photo. Been running extra experiments on VideoAtlas in the previous days, and the headline finding turned out wilder than I expected:
🔥@googlegemma Gemma 4 which is officially limits to 60 seconds of video is now matching long-context models on 30-60 min videos!
Thread 🧵 https://t.co/P6l0RGjNBb

1

7

0

591

0

166

Mohamed

@mohammad2012191

3 months ago

What if understanding a video was more like navigating a map?🤔 And what if that made compute scale logarithmically (not linearly) with video length?! New preprint🎉: 🗺️VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

5

66

6

54

31K

Mohamed

@mohammad2012191

3 months ago

This wouldn't have been possible without an incredible team: Ali Habibullah, Yazan Alshuaibi, and @Lama_s1 , and the guidance of our amazing supervisors Dr. Tanveer Hussain and Prof. Naeemullah Khan. Special thanks to the @KAUST_Academy and @Dr_S_Albarakati for their support.

2

12

0

618

Mohamed

@mohammad2012191

22 days ago

@Rafa_Schwinger @Rafa_Schwinger Guess what https://t.co/1izjZyMhlq

Mohamed

@mohammad2012191

3 months ago

What if understanding a video was more like navigating a map?🤔 And what if that made compute scale logarithmically (not linearly) with video length?! New preprint🎉: 🗺️VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

5

66

6

54

31K

0

1

0

207

Mohamed

@mohammad2012191

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users