D Naveen Reddy @dnaveenr - Twitter Profile

dnaveenr retweeted

about 1 year ago

Thrilled to share our latest advances in video understanding 📽️: Gemini 2.5 Pro is a truly magical model to play with, excelling in traditional video analysis and unlocking new use cases I could not imagine a few months ago🪄 More in 🧵 and @Google blog: https://t.co/4993sJmBpG

11

373

50

164

126K

dnaveenr retweeted

Andi Marafioti

@andimarafioti

about 1 year ago

Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models. 🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago! Here are the coolest insights from our experiments: ✨ Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost! ✨ Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size! ✨ Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, achieving the same performance with sequences 16x shorter! ✨ Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy. ✨ System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLM’s performance—especially for video tasks. ✨ Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb ✨ Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks. 🌟 State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizes—256M, 500M, and 2.2B parameters—each setting new SOTA benchmarks for their hardware constraints in image and video understanding. 📱 Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera! 🌐 Browser-based Inference? Yep! We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models! If you’re into efficient multimodal models, you’ll love this one.

andimarafioti's tweet photo. Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models.
🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago!

Here are the coolest insights from our experiments:
✨ Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost!
✨ Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size!
✨ Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, achieving the same performance with sequences 16x shorter!
✨ Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy.
✨ System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLM’s performance—especially for video tasks.
✨ Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb
✨ Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks.

🌟 State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizes—256M, 500M, and 2.2B parameters—each setting new SOTA benchmarks for their hardware constraints in image and video understanding.
📱 Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera!
🌐 Browser-based Inference? Yep! We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models!

If you’re into efficient multimodal models, you’ll love this one.

7

469

110

314

66K

D Naveen Reddy @dnaveenr

over 1 year ago

@cursivekeys Hey, I have a GA FRONT ticket. You can DM for details.

0

20

dnaveenr retweeted

TwelveLabs (twelvelabs.io)

@twelve_labs

almost 2 years ago

The webinar recording with @JacobChalkie & @huh_jaesung, @HRaajesh & @dnaveenr, and @rodosingh23 & @sridhruv is up! Watch here: https://t.co/T69FhHXqKZ 📺 They discussed: - Time-interval machine - Movie identity-aware captioning - TV Story Summarization Enjoy!

0

3

2

0

234

Who to follow

Francesco Bottoni

@bot_fra

Coder and @fastdotai International Fellow

James Reed

@jamesr66a

Deep Learning, Systems, Compilers, Signals, Languages (Natural & PL). @FireworksAI_HQ prev PyTorch @Meta. Tweets my own, mostly animal pics

Cecelia Shao 🐀

@CeceliaShao

founder @ Childish Concepts LLC helping with product strategy, research, and enablement. product @ Bowery Farming, Comet ML, and IBM Watson. 🥟 on the side

dnaveenr retweeted

TwelveLabs (twelvelabs.io)

@twelve_labs

almost 2 years ago

In the 56th session of #MultimodalWeekly, we have three exciting presentations across different video understanding tasks: action recognition, video description, and video summarization.

twelve_labs's tweet photo. In the 56th session of #MultimodalWeekly, we have three exciting presentations across different video understanding tasks: action recognition, video description, and video summarization. https://t.co/Jz9lQwDLWN

1

4

0

719

dnaveenr retweeted

TwelveLabs (twelvelabs.io)

@twelve_labs

almost 2 years ago

✅ @HRaajesh and @dnaveenr will discuss Movie-Identity Captioner (MICap) - which is a new single stage approach that can seamlessly switch between id-aware caption generation or fill-in-the-blanks when given a caption with blanks. https://t.co/PXTunmgYdR

1

2

0

142

dnaveenr retweeted

Makarand Tapaswi @MakarandTapaswi

almost 2 years ago

Thanks to the organizers (@davmoltisanti +) for an opportunity to share my thoughts at the amazing @CVPR Workshop "What Is Next in Video Understanding" https://t.co/je25r8fwbQ Slides summarizing some of our work on video and open challenges here: https://t.co/KuyRJ1TpPi

MakarandTapaswi's tweet photo. Thanks to the organizers (@davmoltisanti +) for an opportunity to share my thoughts at the amazing @CVPR Workshop "What Is Next in Video Understanding" https://t.co/je25r8fwbQ

Slides summarizing some of our work on video and open challenges here: https://t.co/KuyRJ1TpPi https://t.co/G6iI76SZki

2

86

11

24

3K

dnaveenr retweeted

Makarand Tapaswi @MakarandTapaswi

about 2 years ago

@iiit_hyderabad @rodosingh23 @sridhruv @HRaajesh @dnaveenr @zeeshank95 Presenting both papers tomorrow morning @CVPR (Thu AM). Meet us at poster 385 and 422 if you want to see how we continue to make improvements in machine understanding of stories!

0

10

2

0

717

D Naveen Reddy @dnaveenr

about 2 years ago

We'll be presenting our work today at #CVPR2024 at Arch 4A-E, 10:30-12:00 PT. Meet us at poster 422 to know more. @CVPR

Makarand Tapaswi @MakarandTapaswi

about 2 years ago

Given multiple short movie clips, can models generate coherent identity-aware descriptions? 🤔 Turns out, this is a complicated task as it requires linking identities and what they are doing over time. Our @CVPR 2024 work improves this: https://t.co/K4TQFm6GYg 🧵1/7

4

59

14

6K

0

124

dnaveenr retweeted

Makarand Tapaswi @MakarandTapaswi

about 2 years ago

Given multiple short movie clips, can models generate coherent identity-aware descriptions? 🤔 Turns out, this is a complicated task as it requires linking identities and what they are doing over time. Our @CVPR 2024 work improves this: https://t.co/K4TQFm6GYg 🧵1/7

4

59

14

6K

D Naveen Reddy @dnaveenr

over 2 years ago

Big thanks to @MakarandTapaswi Sir for being so supportive, understanding and an amazing guide/mentor. Being a part-time RA along with a full-time job was quite tough but Sir was so supportive throughout. 🙌🙏

Makarand Tapaswi @MakarandTapaswi

over 2 years ago

Very proud of @dnaveenr who persevered as a part-time research assistant with a part-time advisor 😉and @HRaajesh, a 3rd year undergrad who ramped up in 4 months 🚀 and ran bulk of the experiments, a week before his end semester exams. Thanks @zeeshank95 for the support🤝 3/4

2

10

0

6K

0

2

0

153

D Naveen Reddy @dnaveenr

over 2 years ago

Happy to share that our paper - Identity-aware video captioning has been accepted to #CVPR2024 . 2+years of efforts and persistence has finally paid off. With rockstar teammates - @HRaajesh @zeeshank95 and under amazing guidance of @MakarandTapaswi Sir.

Makarand Tapaswi @MakarandTapaswi

over 2 years ago

📢Happy to announce two #CVPR2024 papers from our Katha AI group @iiit_hyderabad! 🎉🎞️🔥 1. On 📺TV episode story summarization using recaps with @rodosingh23, @sridhruv. 2. On identity-aware video captioning with @HRaajesh, @dnaveenr, @zeeshank95. Paper 🧵and arXiv soon. 1/4

4

95

9

7

7K

0

6

0

262

dnaveenr retweeted

Jeremy Howard

@jeremyphoward

almost 4 years ago

After 2 years, Practical Deep Learning for Coders v5 is finally ready! 🎊 This is a from-scratch rewrite of our most popular course. It has a focus on interactive explorations, & covers @PyTorch, @huggingface, DeBERTa, ConvNeXt, @Gradio & other goodies 🧵 https://t.co/nzv7pek0iq

107

5K

1K

2K

0

D Naveen Reddy @dnaveenr

about 4 years ago

@debarghya_das Great. Looking forward. Haha, that's possible. But you can probably link this thread there, it'll be a good comparison.

0

1

0

dnaveenr retweeted

Quentin Lhoest 🤗 @lhoestq

about 4 years ago

We just released 🤗Datasets 2.3 ! 📚New datasets: - Load ImageNet easily (no more manual download!) - BigBench🔥, QuickDraw, TruthfulQA, etc. ⚡️New features: - More optimized to_tf_dataset() to play with @TensorFlow - Stream datasets in parallel with the @PyTorch DataLoader

lhoestq's tweet photo. We just released 🤗Datasets 2.3 !

📚New datasets:
- Load ImageNet easily (no more manual download!)
- BigBench🔥, QuickDraw, TruthfulQA, etc.

⚡️New features:
- More optimized to_tf_dataset() to play with @TensorFlow
- Stream datasets in parallel with the @PyTorch DataLoader https://t.co/zWK1N2mOOd

2

190

37

43

0

D Naveen Reddy @dnaveenr

about 4 years ago

Thanks to @MarioSasko and @qlhoest for all the support in helping me add this dataset. :)

0

2

0

D Naveen Reddy @dnaveenr

about 4 years ago

Biwi Kinect Head Pose Database is now available on @huggingface🤗hub. The dataset by @CVL_ETH consists of over 15K images of 20 people recorded with a Kinect while turning their heads around freely. 🧵⬇️ Try it now : https://t.co/Ms1xa9xTbs

1

9

6

0

D Naveen Reddy @dnaveenr

about 4 years ago

The dataset is primarily used for the task of head pose estimation. You can easily load it as follows :

1

0

D Naveen Reddy @dnaveenr

about 4 years ago

@SharathWhat Already sold out Anna. 😢😢

0

D Naveen Reddy

@dnaveenr

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users