Thanh Tran Van Trong @jonnyjackk - Twitter Profile

13 days ago

I'm joining OpenAI next week!🥹 The job search turned out to be really challenging but also super rewarding, so I wrote a small blog to share what I learned along the way and hopefully make the process a little less mysterious for the next person. https://t.co/6FigSBdenD

507

14K

1K

19K

5M

jonnyjackk retweeted

Xuanchi Ren

@xuanchi13

about 1 month ago

The latent-vs-pixel debate misses the point. GPT Image 2 shows what users notice: pixel-level fidelity. Latent models show what scales: compact semantic structure. We connect them by replacing VAE/RAE decoders with a Pixel Diffusion Decoder. Code and Model available: https://t.co/JjtecJzF0W 🧵(1/N)

16

408

63

305

669K

jonnyjackk retweeted

Jia-Bin Huang

@jbhuang0604

about 2 months ago

Modern Transformer - Complete Guide Interested in learning the recent advances in transformers? After 13 videos, I've finally completed this series! 🥳🥳🥳 Check out the course here: https://t.co/CsujxlWigC

jbhuang0604's tweet photo. Modern Transformer - Complete Guide

Interested in learning the recent advances in transformers?

After 13 videos, I've finally completed this series!
🥳🥳🥳

Check out the course here:
https://t.co/CsujxlWigC https://t.co/Q5m7RE7axm

11

1K

159

944

47K

jonnyjackk retweeted

Floor Eijkelboom (@ICML2026 🇰🇷)

@FEijkelboom

2 months ago

Flow-LLM Blogpost :D https://t.co/0HiyNPJHsk In the last few weeks, a bunch of work on flows for language came out 🌊 That is exciting, because it makes truly parallel text generation feel real: generation where models can keep refining the whole response during inference, instead of committing token by token. I wrote an intuitive and animated introduction to the area — why autoregression has a structural ceiling, why discrete diffusion only partly escapes it, and why flows may be the first genuinely parallel alternative. Here's an overview of the key parts of the blog - and let's chat at #ICLR2026 :)

FEijkelboom's tweet photo. Flow-LLM Blogpost :D https://t.co/0HiyNPJHsk

In the last few weeks, a bunch of work on flows for language came out 🌊

That is exciting, because it makes truly parallel text generation feel real: generation where models can keep refining the whole response during inference, instead of committing token by token.

I wrote an intuitive and animated introduction to the area — why autoregression has a structural ceiling, why discrete diffusion only partly escapes it, and why flows may be the first genuinely parallel alternative.

Here's an overview of the key parts of the blog - and let's chat at #ICLR2026 :)

5

355

62

319

46K

Who to follow

Nishant Mishra

@mnishant2

AI researcher, NLP, PhDing in Responsible AI at University of Amsterdam. Talk about AI safety and Interp. Prev @ Servicenow|McGill University

debashish 🦘

@derangineer

multimodal ml systems | gpus are great ml research engineer @JohnsHopkins 🇺🇸, beng mcomp @ouranu 🇦🇺.

Chris Levy

@cleavey1985

PhD applied math. Left academia for industry. Working with AI and software.

jonnyjackk retweeted

Yuwei Niu ✈️ ICML

@purshow04

3 months ago

https://t.co/y4GLO4WmOh

1

69

15

42

12K

jonnyjackk retweeted

Tommie Kerssies

@tommiekerssies

3 months ago

World models are heavy. They don't need to be. Each frame is encoded as 1024 spatial tokens. What if it were just 1? In our #CVPR2026 Highlight from Amazon FAR, we compress frames into "delta" tokens for efficient generative world modeling. Paper, code & models below ↓ (1/7)

tommiekerssies's tweet photo. World models are heavy. They don't need to be.

Each frame is encoded as 1024 spatial tokens. What if it were just 1?

In our #CVPR2026 Highlight from Amazon FAR, we compress frames into "delta" tokens for efficient generative world modeling.

Paper, code & models below ↓

(1/7) https://t.co/Id3cenxRkT

12

601

76

459

56K

jonnyjackk retweeted

Manu Gaur

@gaur_manu

3 months ago

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

gaur_manu's tweet photo. Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat").
In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same
🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

13

903

135

666

150K

jonnyjackk retweeted

Sander Dieleman

@sedielem

3 months ago

"Diffusability" is all about the spectrum. https://t.co/nb4i8tDJl3 If you enjoyed my blog post about diffusion as spectral autoregression, and are wondering how this relates to latent diffusion, give this paper a read!

sedielem's tweet photo. "Diffusability" is all about the spectrum.
https://t.co/nb4i8tDJl3

If you enjoyed my blog post about diffusion as spectral autoregression, and are wondering how this relates to latent diffusion, give this paper a read! https://t.co/DPXjoLK5X4

7

455

70

375

23K

jonnyjackk retweeted

Baifeng

@baifeng_shi

3 months ago

Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. 📄 https://t.co/GhbWZwMAg7 🌐 https://t.co/mEJ991MAIR 🤗 https://t.co/FOfc2QRThi (1/n)🧵

47

2K

202

1K

160K

jonnyjackk retweeted

alphaXiv

@askalphaxiv

3 months ago

Yann LeCun and his team can't stop cooking "LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels" One of the biggest bottlenecks of JEPA is they are hard to train, and this new research changes that. They propose LeWorldModel, which shows that a small model can learn a usable world model directly from raw pixels end-to-end. Sitting at 15M parameters, they made it without needing heuristics and avoiding anti-collapse hacks while staying competitive and planning up to 48x faster. Making JEPA based modeling much more accessible, cheaper, and stabler.

askalphaxiv's tweet photo. Yann LeCun and his team can't stop cooking

"LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels"

One of the biggest bottlenecks of JEPA is they are hard to train, and this new research changes that.

They propose LeWorldModel, which shows that a small model can learn a usable world model directly from raw pixels end-to-end.

Sitting at 15M parameters, they made it without needing heuristics and avoiding anti-collapse hacks while staying competitive and planning up to 48x faster.

Making JEPA based modeling much more accessible, cheaper, and stabler.

41

2K

238

1K

197K

jonnyjackk retweeted

alphaXiv

@askalphaxiv

3 months ago

"Exclusive Self Attention" This paper proposed Exclusive Self-Attention (XSA), which is a tiny two-line change that stops attention from looking at itself. This forces it to focus on the rest of the sequence, and can make transformers more effective! This improves the performance at long context at almost no extra cost.

askalphaxiv's tweet photo. "Exclusive Self Attention"

This paper proposed Exclusive Self-Attention (XSA), which is a tiny two-line change that stops attention from looking at itself.

This forces it to focus on the rest of the sequence, and can make transformers more effective!

This improves the performance at long context at almost no extra cost.

15

827

137

535

44K

jonnyjackk retweeted

Peter Holderrieth

@peholderrieth

4 months ago

We are also releasing self-contained lecture notes that explain flow matching and diffusion models from scratch. This goes from "zero" to the state-of-the-art in modern Generative AI. 📖 Read the notes here: https://t.co/RULWDgn9pm Joint work with @EErives40101.

38

6K

642

7K

475K

jonnyjackk retweeted

Ethan Weber @ethanjohnweber

4 months ago

I made a Claude Code skill that generates conference posters 🛠️ Instead of a static PDF, it outputs a single HTML file — drag to resize columns, swap sections, adjust fonts, then give your layout back to Claude. 🔁 🔗 Skill 👉 https://t.co/KhYV8anbxL

30

2K

329

3K

187K

jonnyjackk retweeted

Kimi.ai @Kimi_Moonshot

4 months ago

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: https://t.co/u3EHICG05h

Kimi_Moonshot's tweet photo. Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation.

Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.

🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.

🔗Full report:
https://t.co/u3EHICG05h

334

13K

2K

10K

5M

jonnyjackk retweeted

Alif Munim (d/acc)

@alifmunim

4 months ago

Since @karpathy kicked off recursive self-improvement a few days ago, I've been thinking about how we can automate interpretability research. I asked Claude to train a sparse autoencoder on Gemma3-1B. It recovered 96% of Gemma's behaviors from interpretable features overnight.

alifmunim's tweet photo. Since @karpathy kicked off recursive self-improvement a few days ago, I've been thinking about how we can automate interpretability research.

I asked Claude to train a sparse autoencoder on Gemma3-1B. It recovered 96% of Gemma's behaviors from interpretable features overnight. https://t.co/7AXOGU8WOU

17

446

38

351

42K

jonnyjackk retweeted

Black Forest Labs @bfl_ai

4 months ago

We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models. Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Self-Flow addresses this with self-supervised flow matching that scales efficiently across modalities. Results: • Up to 2.8x faster convergence across modalities. • Improved temporal consistency in video • Sharper text rendering and typography This is foundational research for our path towards multimodal visual intelligence.

bfl_ai's tweet photo. We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models.

Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Self-Flow addresses this with self-supervised flow matching that scales efficiently across modalities.

Results:
• Up to 2.8x faster convergence across modalities.
• Improved temporal consistency in video
• Sharper text rendering and typography

This is foundational research for our path towards multimodal visual intelligence.

15

902

136

514

147K

jonnyjackk retweeted

Robin Rombach

@robrombach

4 months ago

New paper out! We present a training method for multimodal generative models, called Self-Flow, which combines classic flow matching and representation learning. Why? Unlike most representation alignment methods, our new approach does not require external, pretrained models and thus scales gracefully to joint multimodal training on images, videos and audio. How? It combines per-timestep flow matching with dual-timestep representation learning, improving the models' internal representations. This approach outperforms prior methods and shows promising scaling behavior in multimodal pretraining. It also enables downstream applications such as action prediction for embodied AI. webpage+paper: https://t.co/qzGQGj8JYk code: https://t.co/edhfdVEqSf Credit to @hila_chefer, @pess_r, Dominik, @dustin_podell, Vikash, @Vinh_Suhi and Antonio. If you enjoy doing open research like this, come and join BFL! We are actively hiring🌲

robrombach's tweet photo. New paper out! We present a training method for multimodal generative models, called Self-Flow, which combines classic flow matching and representation learning.

Why? Unlike most representation alignment methods, our new approach does not require external, pretrained models and thus scales gracefully to joint multimodal training on images, videos and audio.

How? It combines per-timestep flow matching with dual-timestep representation learning, improving the models' internal representations.

This approach outperforms prior methods and shows promising scaling behavior in multimodal pretraining. It also enables downstream applications such as action prediction for embodied AI.

webpage+paper: https://t.co/qzGQGj8JYk
code: https://t.co/edhfdVEqSf

Credit to @hila_chefer, @pess_r, Dominik, @dustin_podell, Vikash, @Vinh_Suhi and Antonio.

If you enjoy doing open research like this, come and join BFL! We are actively hiring🌲

5

309

36

145

28K

jonnyjackk retweeted

Vuk Rosić 武克

@VukRosic99

4 months ago

2/2 Full Muon optimizer guide - https://t.co/97EfJaiOyV

1

37

7

46

4K

jonnyjackk retweeted

Max Zhaoshuo Li 李赵硕

@mli0603

4 months ago

I've been debugging RoPE recently and kept getting tripped up by details that most explanations gloss over. So I wrote a deep dive. "Understanding RoPE: From Rotary Embeddings to Context Extension" https://t.co/yDZqzcqSk5 The blog covers: • Full RoPE derivation from rotation matrices • A clean proof of why RoPE's attention decays with distance (and when it breaks) • The π boundary (RoPE's Nyquist limit) • NTK-aware scaling derivation • Dynamic NTK • YaRN's frequency ramp + attention scaling • Reference PyTorch code Hope it helps! Feedback welcome!

8

537

58

753

61K

jonnyjackk retweeted

Kawin Ethayarajh

@ethayarajh

4 months ago

I would recommend reading this whole blogpost. It makes some very good points: https://t.co/Rzd8647aUP

1

61

6

63

4K

Thanh Tran Van Trong

@jonnyjackk

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users