Zhiheng Liu @__Johanan - Twitter Profile

Pinned Tweet

6 months ago

Huge thanks to @_akhaliq for sharing our work! We introduce TUNA, a unified multimodal model that handles both image/video understanding and generation/editing. The key is a unified, end-to-end learned visual representation.

AK

@_akhaliq

6 months ago

Meta presents TUNA Taming Unified Visual Representations for Native Unified Multimodal Models

2

166

20

88

33K

2

41

4

16

13K

__Johanan retweeted

Jiatao Gu@CVPR2026

@thoma_gu

23 days ago

Excited to share STARFlow2 from Apple MLR : 🥨Bridging Language Models and Normalizing Flows for Unified Multimodal Generation. One model to understand, reason, and generate continuous images with a single unified autoregressive mechanism? Paper: https://t.co/IA1pJ5AtOX 1/9

thoma_gu's tweet photo. Excited to share STARFlow2 from Apple MLR :
🥨Bridging Language Models and Normalizing Flows for Unified Multimodal Generation.

One model to understand, reason, and generate continuous images with a single unified autoregressive mechanism?

Paper: https://t.co/IA1pJ5AtOX
1/9

12

271

48

93

518K

__Johanan retweeted

YUCHAO GU @YuchaoGu

21 days ago

🚀 We are excited to announce the release of AnyFlow, the first any-step video diffusion on-policy distillation (OPD) framework. By leveraging Flow Map distillation, AnyFlow significantly enhances model inference efficiency by reducing sample steps. (Code, models, and demos are now open-source!) Key Highlights: ⚡ Any-Step Generation: Unlike traditional distilled models tied to fixed step budgets, AnyFlow enables a single model to adapt to arbitrary inference budgets. It achieves high-quality few-step generation while providing stable improvements as more sampling steps are added. 🔀 Multiple Architectures: AnyFlow supports any-step distillation for both causal and bidirectional video diffusion models. 🎬 Multiple Tasks: AnyFlow supports Text-to-Video, Image-to-Video, and Video-to-Video generation within one causal video diffusion model. 📈 Scalable Performance: AnyFlow is validated from 1.3B up to 14B parameters. 📄 Paper: https://t.co/Qqik8l29oB 💻 Code: https://t.co/KOMv9RtuWu 🎨 Pre-trained Models: https://t.co/Br1MNllUu8 🎬 Demo: https://t.co/hxbl56lPFU

4

175

33

104

23K

Zhiheng Liu @__Johanan

20 days ago

@HanshengCh Excellent work!

0

2

0

467

Who to follow

Minghua Liu @ CVPR26

@MinghuaLiu_

PhD student @UCSD_CSE. AI, 3D Vision, Embodied AI | research intern @NVIDIA | past: @Qualcomm @Waymo @Adobe @Tsinghua_Uni

Head of Design @Cognition Prev @NotionHQ

__Johanan retweeted

Hansheng Chen @HanshengCh

20 days ago

New paper: AsymFlow🔥 JiT x0-prediction is not enough for pixel generation. Better keep velocity in a low-rank subspace: - 1.57 FID on ImageNet (best pixel flow model) - Finetunes FLUX.2 klein into pixel space, beats the original on HPSv3/DPG/GenEval (#1 overall on HPSv3) 1/7

HanshengCh's tweet photo. New paper: AsymFlow🔥

JiT x0-prediction is not enough for pixel generation. Better keep velocity in a low-rank subspace:

- 1.57 FID on ImageNet (best pixel flow model)
- Finetunes FLUX.2 klein into pixel space, beats the original on HPSv3/DPG/GenEval (#1 overall on HPSv3)

1/7 https://t.co/FSz46hrJHj

20

278

54

197

53K

Zhiheng Liu @__Johanan

23 days ago

Why is 5t5 always the one getting beaten up?🥲

🚨 AI News | TestingCatalog

@testingcatalog

23 days ago

GOOGLE 🔥: An upcoming Gemini Omni video model from Google is expected to be much more advanced in video editing, capable of completing tasks like removing watermarks, replacing objects in the video, and more. It is also likely that Google will release 2 versions of this model, including a Pro variant. And I assume what we see isn't Pro? Anime sample 👀

96

545

32

177

178K

0

45

__Johanan retweeted

BURKOV

@burkov

24 days ago

Most current models that handle both image understanding and image generation rely on separate pretrained components: a vision encoder (a network like CLIP that turns images into feature vectors) for understanding, and a VAE — a compression network mapping images into a smaller latent representation — for generation. The authors of this paper remove both and feed raw pixel patches directly into a single transformer trained end to end. The results show that pretrained vision encoders are not necessary for multimodal modeling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception. Read with an AI tutor: https://t.co/FCC0qeeHWV PDF: https://t.co/M2866qdBNS

burkov's tweet photo. Most current models that handle both image understanding and image generation rely on separate pretrained components: a vision encoder (a network like CLIP that turns images into feature vectors) for understanding, and a VAE — a compression network mapping images into a smaller latent representation — for generation.

The authors of this paper remove both and feed raw pixel patches directly into a single transformer trained end to end.

The results show that pretrained vision encoders are not necessary for multimodal modeling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

Read with an AI tutor: https://t.co/FCC0qeeHWV

PDF: https://t.co/M2866qdBNS

6

103

19

54

5K

__Johanan retweeted

Thinking Machines

@thinkymachines

23 days ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

461

16K

2K

12K

8M

__Johanan retweeted

Sander Dieleman

@sedielem

26 days ago

Great thread about multimodal models. Just because a single Transformer can do it all, that doesn't mean doing things this way makes the most sense economically. (Although it probably will, eventually!)

4

123

6

105

19K

__Johanan retweeted

Gabriele Berton

@gabriberton

27 days ago

Cool paper from Meta suggesting that future MLLMs will be Native Multimodal Models (NMM), hence no vision encoders anymore But I disagree I actually think we'll go in the other direction (what? more encoders? yes! read on...) All you need to know about the future of MLLMs 🧵

gabriberton's tweet photo. Cool paper from Meta suggesting that future MLLMs will be Native Multimodal Models (NMM), hence no vision encoders anymore

But I disagree

I actually think we'll go in the other direction (what? more encoders? yes! read on...)

All you need to know about the future of MLLMs 🧵 https://t.co/eX6tmANJGp

10

190

24

200

69K

__Johanan retweeted

Alexander Whedon

@alex_whedon

29 days ago

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

1K

23K

3K

19K

13M

__Johanan retweeted

Jiawei Yang

@JiaweiYang118

about 1 month ago

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

JiaweiYang118's tweet photo. Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.

Now it is 0.75, and can be even lower.

Many wonder how.

I thought it might end as a small FID prank: simple and deliberate.

It started with one question: can FID be optimized directly, and what does it reveal?

Introducing FD-loss.

55

925

157

591

214K

Zhiheng Liu @__Johanan

about 1 month ago

@bdsqlsz Thanks for sharing!

0

171

Zhiheng Liu @__Johanan

about 1 month ago

@felixudr @_akhaliq @liuziwei7 Thanks for sharing!🫡

0

1

0

66

Zhiheng Liu @__Johanan

about 1 month ago

@JiaweiYang118 Thanks Jiawei! I learned a lot from your work!

1

0

197

__Johanan retweeted

AK

@_akhaliq

about 1 month ago

Meta presents Tuna-2 Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation paper: https://t.co/OonewX4iAg

_akhaliq's tweet photo. Meta presents Tuna-2

Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

paper: https://t.co/OonewX4iAg https://t.co/n1sR30JULW

14

297

40

176

64K

__Johanan retweeted

Yuren Cong

@CongYuren

about 1 month ago

1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: https://t.co/rk0tIB4tbt Code: https://t.co/OSAos8k33x Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

CongYuren's tweet photo. 1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation!

We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation.

Paper: https://t.co/rk0tIB4tbt
Code: https://t.co/OSAos8k33x

Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

11

88

11

43

85K

Zhiheng Liu @__Johanan

about 1 month ago

@rosinality Thanks for sharing our work! I remember that you also shared tuna 1. Thank you for your attention to our work!🥳

1

3

0

609

__Johanan retweeted

Rosinality @rosinality

about 1 month ago

Pixel-based unified understanding and generation model using JiT. Uses MAE for representation learning.

4

337

53

264

23K

__Johanan retweeted

Tianbao Xie

@TianbaoX

about 2 months ago

Nicely done. I would also shamelessly recommend my collection on test time training, which has overlap with online learning topic. https://t.co/DxiqczlDyh

0

36

3

28

5K

__Johanan retweeted

Yuwei Niu

@purshow04

about 2 months ago

https://t.co/y4GLO4WmOh

1

66

15

42

11K

Zhiheng Liu

@__Johanan

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users