Harris Zhang @HyperStorm9682 - Twitter Profile

9 days ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

55

2K

365

2K

852K

HyperStorm9682 retweeted

Yuyin Zhou

@yuyinzhou_cs

8 days ago

🧠Your VLM didn't fail because it didn't think long enough. It failed because it looked wrong: We found #Qwen3-VL-8B's wrong answers trace back to a perception error — not a reasoning one 📉. 💡Our fix: a capability curriculum — a brand-new curriculum dimension that trains perception before reasoning. 🔍➡️🤔 Excited to share our new @icmlconf paper: From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models 🌐 Project: https://t.co/bG15lh0pGv 📄 Paper: https://t.co/B3avcAl8Tb 💻 Code: https://t.co/NTy3s8vtaw

5

85

23

64

13K

Harris Zhang @HyperStorm9682

10 days ago

💻 Stop discarding the fine-grained local evidence in your token sequences! SMART gives you the efficiency of a single-vector retriever with the richness of multi-vector. Code and weights are fully open-sourced: https://t.co/Saw18HlaF6 https://t.co/KJJMJmp46p

0

3

1

246

Harris Zhang @HyperStorm9682

10 days ago

🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 https://t.co/UBpQ2y4sXU

HyperStorm9682's tweet photo. 🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 https://t.co/UBpQ2y4sXU https://t.co/J899phnS14

1

79

18

57

17K

Harris Zhang @HyperStorm9682

10 days ago

📉 3. LoRA Finetune: Full multi-vector training is expensive. SMART acts as a highly efficient finetuning technique. By leveraging LoRA, you can convert ANY single-vector model into a multi-vector variant while saving at least 20% of compute! 🏆

HyperStorm9682's tweet photo. 📉 3. LoRA Finetune: Full multi-vector training is expensive. SMART acts as a highly efficient finetuning technique. By leveraging LoRA, you can convert ANY single-vector model into a multi-vector variant while saving at least 20% of compute! 🏆 https://t.co/2FtdqNwopj

1

3

1

0

286

HyperStorm9682 retweeted

Jaden Park

@_jadenpark

about 2 months ago

We all knew LLM agents struggle to explore, but we had to eyeball it 👀. We couldn't measure exploration errors. Until now. 🗺️🤖 We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents. Spoiler: Exploration error is what kills📉 agent performance in our setting 👇🧵(1/8)

_jadenpark's tweet photo. We all knew LLM agents struggle to explore, but we had to eyeball it 👀. We couldn't measure exploration errors. Until now. 🗺️🤖

We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents.

Spoiler: Exploration error is what kills📉 agent performance in our setting 👇🧵(1/8)

1

31

17

5

2K

Harris Zhang @HyperStorm9682

2 months ago

@baifeng_shi Great paper Baifeng! I actually also have a recent paper Spatio-Temporal Token Scoring https://t.co/WUCMriEXAj where we also prune tokens both in the ViT and the LLM. I'm astounded by how much you can save in the number of tokens! I've learned a lot from this work.

1

2

1

0

150

Harris Zhang @HyperStorm9682

3 months ago

Paper link: https://t.co/WUCMriEXAj Huge thanks to the people of PRIOR team at Ai2! This paper would not have been done without you all!

0

1

0

1

163

Harris Zhang @HyperStorm9682

3 months ago

New paper out! 🚨 Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! 🧵👇

HyperStorm9682's tweet photo. New paper out! 🚨 Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! 🧵👇 https://t.co/67GBy0cMZJ

1

18

4

11

5K

Harris Zhang @HyperStorm9682

3 months ago

The final pruning figure shows the result—static, redundant background tokens are dropped, while key actions are perfectly preserved. ✂️ By filtering out the noise, STTS significantly speeds up inference while maintaining high performance. Code is open-sourced! 🔥

HyperStorm9682's tweet photo. The final pruning figure shows the result—static, redundant background tokens are dropped, while key actions are perfectly preserved. ✂️ By filtering out the noise, STTS significantly speeds up inference while maintaining high performance. Code is open-sourced! 🔥 https://t.co/uBilKqFyY2

1

0

248

Harris Zhang @HyperStorm9682

6 months ago

Super glad to be a part of the Molmo2 project! Was able to train a couple of variants and experiment with modeling along the way. What a great effort from our team!

Ai2 @allen_ai

6 months ago

Molmo 2 doesn't just answer questions about clips—it searches & points. The model returns coordinates & timestamps over videos + images, powering QA, counting, dense captioning, artifact detection, & subtitle-aware analysis. You can see exactly how it reasoned.

4

112

18

57

68K

0

4

0

297

HyperStorm9682 retweeted

Zhengzhong Tu

@_vztu

9 months ago

Dear @NeurIPSConf PCs, I don't understand why we still need reviewers and area chairs if PCs are finally going to take over and overturn the AC decision without providing any reason, whereby our weeks of effort spent on rebuttals (both authors and reviewers) have been ignored.

_vztu's tweet photo. Dear @NeurIPSConf PCs, I don't understand why we still need reviewers and area chairs if PCs are finally going to take over and overturn the AC decision without providing any reason, whereby our weeks of effort spent on rebuttals (both authors and reviewers) have been ignored. https://t.co/TKfz1KcqbC

7

223

25

24

31K

HyperStorm9682 retweeted

Yong Jae Lee @yong_jae_lee

9 months ago

Here is the final decision for one of our NeurIPS D&B ACs-accepted-but-PCs-rejected papers, with the vague message mentioning some kind of ranking. Why was the ranking necessary? Venue capacity? If so, this sets a concerning precedent. @NeurIPSConf

yong_jae_lee's tweet photo. Here is the final decision for one of our NeurIPS D&B ACs-accepted-but-PCs-rejected papers, with the vague message mentioning some kind of ranking. Why was the ranking necessary? Venue capacity? If so, this sets a concerning precedent. @NeurIPSConf https://t.co/cbsloIbL9p

1

46

5

8K

HyperStorm9682 retweeted

Mu Cai

@MuCai7

over 1 year ago

1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from reaching the human performance 67.9. With high-quality human annotations, our TemporalBench investigates 1). Action order (change the order); (2). Action frequency (1 times v.s. two times); (3). Action type (put v.s. pull); (4). Motion magnitude (slightly v.s. intensively); (5). Motion Direction/Orientation (forward v.s. Backward, circular v.s. back-and-forth). (6). Action effector (cutting with left hand v.s. cutting with right hand) Explore TemporalBench: https://t.co/Jv4iZ29gj8

MuCai7's tweet photo. 1/N) Are current large multimodal models like #GPT4o really good at video understanding?

🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs!

Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from reaching the human performance 67.9.

With high-quality human annotations, our TemporalBench investigates
1). Action order (change the order);
(2). Action frequency (1 times v.s. two times);
(3). Action type (put v.s. pull);
(4). Motion magnitude (slightly v.s. intensively);
(5). Motion Direction/Orientation (forward v.s. Backward, circular v.s. back-and-forth).
(6). Action effector (cutting with left hand v.s. cutting with right hand)

Explore TemporalBench: https://t.co/Jv4iZ29gj8

1

59

15

22

25K

HyperStorm9682 retweeted

Mu Cai

@MuCai7

over 1 year ago

1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground https://t.co/AHa87DZkd2

MuCai7's tweet photo. 1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground https://t.co/AHa87DZkd2 https://t.co/u4cj4dMCMh

2

128

27

73

17K

Harris Zhang

@HyperStorm9682

Last Seen Users on Sotwe

Trends for you

Most Popular Users