I’m honoured to share that a research paper from my time at @Ubisoft has been accepted to the AI for Creative Visual Content Generation, Editing, and Understanding (workshop at CVPR 2025)!
1/A 13B model in FP32 takes 52 GB of VRAM. The same model in INT4 takes 7 GB.
That's the difference between renting an H100 and running it on the 4090 sitting under your desk.
Wrote a practical guide on post-training memory reduction for PyTorch inference.
6/The post also covers applying these to a custom nn.Module:
— in-place swap of nn.Linear → bnb.nn.Linear4bit— FX graph-mode quantization for CPU / ARM targets — ONNX export with custom symbolic ops — torch.compile for memory-efficient inference
I’m excited to share what our team has been building at @NVIDIAAI since I joined: Cosmos 3, an omnimodal world model for Physical AI.
Project: https://t.co/HTCR8JSzdW
HF: https://t.co/19p3c6pfZ0
Code: https://t.co/G6fuUOWFNk
If you are fine-tuning Qwen3-VL and copying your config from a Qwen2.5-VL tutorial, stop.
The patch arithmetic changed. Qwen3-VL uses a 32 × 32 token grid, not 28 × 28. Training will run. Images will be silently mis-sized.
Full guide → https://t.co/qbzLS3cuSP
(6/6)
The takeaway: for prediction and planning, generating pixels is wasted compute. Read my full breakdown of the JEPA family, from the core recipe to open problems, in my new Medium post: https://t.co/bChKbvDnJR #MachineLearning#AI#ComputerVision
(1/6)
Are we wasting compute predicting pixels that don't matter? 🤔 Most self-supervised vision models reconstruct every pixel or use heavy contrastive augmentations. Enter JEPA: predicting the future in latent space! A thread on my latest deep dive 🧵👇
(5/6)
Hate complex heuristics? LeJEPA proves you might not need EMA or stop-gradients at all. It introduces a single regularizer (SIGReg) to enforce an isotropic Gaussian distribution, replacing the collapse-prevention machinery in ~50 lines of code.
I just tried https://t.co/LxKqY8Db1r built by Alibaba Group's Token Hub (ATH) group.
It is a streaming generative world model with native multimodal input. Text, images, and roaming instructions are condition variables injected online at any node during generation, without reset
- (2) Directing Mode: Direct is the more interesting one. You describe a scenario; generation starts; and, mid-rollout, you inject a new prompt to redirect the story. The model responds at the current node without restarting. Online conditioning with mid-rollout intervention.