Vikas Chandra

@vikasc

Senior Director of #AI Research @Meta | CMU Ph.D. | Ex visiting faculty at Stanford

Menlo Park, CA

Joined April 2009

183 Following

624 Followers

347 Posts

Vikas Chandra @vikasc

3 days ago

Vision Language Models are native 3D learners!

Zhipeng Cai

@cai_zhipeng

3 days ago

🎇Thrilled to release VLM^3! Most 3D vision papers nowadays still spend months/years designing complex archs/losses/augmentations for different tasks. Are they necessary? VLM^3 shows that most designs that you think are important for 3D vision are [not] important at all!

cai_zhipeng's tweet photo. 🎇Thrilled to release VLM^3! Most 3D vision papers nowadays still spend months/years designing complex archs/losses/augmentations for different tasks. Are they necessary? VLM^3 shows that most designs that you think are important for 3D vision are [not] important at all! https://t.co/DAXy69ZrZx

221

724

10M

Vikas Chandra @vikasc

6 days ago

We just released MobileMoE, first sub-B-active-parameter MoE language model family. MoE isn't just for 100B+ parameter models on servers. At sub-B scale, sparse expert routing lets you match dense models at 2-4x fewer FLOPs while fitting in mobile DRAM. https://t.co/ZvwJBVs778

vikasc's tweet photo. We just released MobileMoE, first sub-B-active-parameter MoE language model family.

MoE isn't just for 100B+ parameter models on servers. At sub-B scale, sparse expert routing lets you match dense models at 2-4x fewer FLOPs while fitting in mobile DRAM.

https://t.co/ZvwJBVs778 https://t.co/R9ZmgvjFTi

177

vikasc retweeted

Kelly Greer

@kellyjgreer

10 days ago

2/ the work by @vikasc and others at Meta AI shows that Trajectory Reduction Policy Optimization (dTPRO) significantly reduces the training cost of diffusion LLMs (dLLMs), a key step forward as scaled training has been a hurdle for dLLM to date. watch out for what comes from Meta here https://t.co/l1GjjhmM3C

Vikas Chandra @vikasc

15 days ago

Grateful to @sallywf and @EETimes for the thoughtful writeup of my Embedded Vision Summit keynote. The thesis in one line: the next decade of AI won't be won by the biggest model, but by the smartest, most efficient one that lives on the devices you wear! https://t.co/JBKwfRI6jZ

405

Who to follow

Brandon Amos

@brandondamos

🧙 RL @Reflection_AI past: @MetaAi @GoogleDeepmind @SCSatCMU @Cornell_Tech

Joseph Spisak

@joespeez

AI Product Director @Meta (again) leading Generative AI open source ex-Google, ex @PyTorch leader ex-Amazon - Love building community around AI and Open Source.

Sebastian Gehrmann

@sebgehr

Making AI trustworthy as Head of Responsible AI in the CTOs office @Bloomberg. Formerly LLMs @ Google Brain / PhD @ Harvard. views my own

Vikas Chandra @vikasc

about 1 month ago

Audio is the most ignored perception modality in on-device AI. Every smart glass, robot, and drone has a mic. Almost none fuse audio + vision at perception time. Vision-only is the vibe-coded version of multimodal perception.

240

vikasc retweeted

Kelly Greer

@kellyjgreer

about 1 month ago

the market loves to talk hardware because stock prices go brrr but the most interesting thing happening right now is in alternative model architectures

kellyjgreer's tweet photo. the market loves to talk hardware because stock prices go brrr but

the most interesting thing happening right now is in alternative model architectures https://t.co/HsdJV1a6u4

Vikas Chandra @vikasc

about 1 month ago

3/ Even Qwen2.5 Omni hits just 27.3% on foreground and 39.5% on background sound accuracy. Audio hallucination is widespread in today's AV-LLMs, and robust evaluation has to be a first-class metric for AR/wearable use cases. 📄 https://t.co/KrAKvZOGHA

122

Vikas Chandra @vikasc

about 1 month ago

1/ New @ieeeICASSP 2026 (Oral): "Exploring Audio Hallucination in Egocentric Video Understanding." Audio-visual LLMs often "hear" things they didn't, inferring sounds from visual cues alone. We built a benchmark to quantify it.

658

Vikas Chandra @vikasc

about 1 month ago

2/ Setup: 300 egocentric videos, 1,000 sound-focused Q/As, with a taxonomy that separates foreground action sounds (from the user's activity) and background ambient sounds.

Vikas Chandra @vikasc

about 1 month ago

@MingchenZhuge @AilingZeng81332 @tikgiau @shirleyrz_ @sherryyangML @sthuyan @_yunzhong @SchmidhuberAI @Wenyi_AI_Wang @dmitrii_tech @PiotrPiekosAI Thanks @MingchenZhuge for all your hard work in both organizing and running this awesome workshop!

648

Vikas Chandra @vikasc

about 1 month ago

Long post pulling the field together, leaning on my group's work (EUPE, EfficientSAM, Efficient Track Anything, EdgeTAM, LongVU, EgoAVU, VideoAuto-R1, DepthLM, ParetoQ) placed against the parallel work in each section. https://t.co/hg2gsiATke

Vikas Chandra @vikasc

about 1 month ago

Efficient Video Intelligence in 2026 🧵 Five years ago video understanding meant action recognition on Kinetics-400. Now VLMs reason over hour-long footage, foundation-grade tracking runs at 16 FPS on a phone, and one sub-100M backbone replaces four specialized encoders.

279

Vikas Chandra @vikasc

about 1 month ago

What's still hard is mostly deployment: streaming at hour-plus durations, sub-watt AR glasses, open-set anomaly detection, cross-camera reasoning, spatial grounding through cuts, closed-loop eval. The bottleneck moved from models to the stack around them.

115

Vikas Chandra @vikasc

about 2 months ago

Diffusion models couldn't reason because RL was too expensive, not because the architecture was wrong. dTRPO collapses trajectory computation to one forward pass. On a 7B model: +9.6% GPQA, +4.3% HumanEval+. The architecture question is open again. Paper: https://t.co/lOYHzWu2ck

Vikas Chandra @vikasc

about 2 months ago

Standard approach to long video: more frames, bigger context. Tempo flips it. A small VLM reads the question first, then compresses the video around it. 6B params. 8K visual tokens. Outperforms GPT-4o and Gemini 1.5 Pro on hour-long videos. https://t.co/RSU3e81yJb

189

vikasc retweeted

AVB

@neural_avb

about 2 months ago

Been thinking about what this paper really means. "Video diffusion" and "World Models" are becoming synonymous. Neural Computers are basically video diffusion world models for terminal envs and GUI. Lots of talk last week about automating Manim videos. In theory, we should be able to train these world models on a 10000 hours of diverse manim videos and "see where it goes" If a NC can generate outputs to terminal commands, it should be able to generate videos like this directly from prompt too. Without writing code.

680

850

120K

Vikas Chandra @vikasc

about 2 months ago

Classical computers run programs. Agents wrap models around programs. Neural Computers ask: what if the model is the program, the memory, and the machine? New paper exploring fully learned runtimes where computation emerges from weights alone. Paper: https://t.co/6s7JvVD6YQ

vikasc retweeted

Kelly Greer

@kellyjgreer

about 2 months ago

the market is reacting to the memory shortage by buying Sandisk assuming that labs continue to run less optimized versions of their own models and overspend on hardware but the longer term signal to take from the memory wall is to focus on the continued scaling down of models deployed compute has scaled 3x every 2 years while memory bandwidth has only scaled 1.6x over 20 years - every new GPU generation widens this gap - but we've also been grossly overusing memory via the common training mode. APOLLO showed that AdamW, the standard LLM optimizer, stores redundant state for every single parameter, and that coarser gradient approximations achieve the same training quality with a fraction of the memory, enabling model pre-training on a 1/8 the GPU capacity. many more proof points are coming to light re: optimizing algorithms and efficiency of smaller models. in fact small models can outperform much larger models by spending more compute at inference time good read on this from @vikasc

$kellyjgreer's tweet photo. the market is reacting to the memory shortage by buying Sandisk assuming that labs continue to run less optimized versions of their own models and overspend on hardware but the longer term signal to take from the memory wall is to focus on the continued scaling down of models deployed compute has scaled 3x every 2 years while memory bandwidth has only scaled 1.6x over 20 years - every new GPU generation widens this gap - but we've also been grossly overusing memory via the common training mode. APOLLO showed that AdamW, the standard LLM optimizer, stores redundant state for every single parameter, and that coarser gradient approximations achieve the same training quality with a fraction of the memory, enabling model pre-training on a 1/8 the GPU capacity. many more proof points are coming to light re: optimizing algorithms and efficiency of smaller models. in fact small models can outperform much larger models by spending more compute at inference time good read on this from @vikasc$

vikasc retweeted

Jürgen Schmidhuber

@SchmidhuberAI

about 2 months ago

Neural Computers https://t.co/XVk9bCLGbm

268

300K

Vikas Chandra

@vikasc

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users