vislang.ai @vislang - Twitter Profile

vislang retweeted

Rice University Computer Science @RiceCompSci

2 months ago

The Rice Workshop on LLMs, organized by @hanjie_chen and @bluevincent, explored AI safety, interpretability & human-AI collaboration. Key takeaway: progress depends on reliability, transparency, and stronger human alignment. https://t.co/lhCONBK7l4

RiceCompSci's tweet photo. The Rice Workshop on LLMs, organized by @hanjie_chen and @bluevincent, explored AI safety, interpretability & human-AI collaboration. Key takeaway: progress depends on reliability, transparency, and stronger human alignment. https://t.co/lhCONBK7l4 https://t.co/hbc3MBagF0

0

9

3

0

1K

vislang retweeted

Moayed Haji Ali @moayedhajiali

3 months ago

Not all pixels are equally hard, but DiTs still allocate compute uniformly across pixels, wasting efforts on easy regions. ELIT adds two lightweight cross-attention layers to focus compute where it matters, cutting FID by 53%. ELIT: https://t.co/zwu8mkHkmf

moayedhajiali's tweet photo. Not all pixels are equally hard, but DiTs still allocate compute uniformly across pixels, wasting efforts on easy regions. ELIT adds two lightweight cross-attention layers to focus compute where it matters, cutting FID by 53%.
ELIT: https://t.co/zwu8mkHkmf https://t.co/HjzWWcaEk9

4

162

22

109

13K

vislang retweeted

Guilherme Favaron

@guifav

3 months ago

Diffusion transformers waste compute by treating every pixel equally, regardless of content complexity. ELIT (Elastic Latent Interface Transformer) fixes this with a simple idea: insert a variable length set of latent tokens that learn where to spend computation. Two lightweight cross attention layers (Read/Write) route information between spatial tokens and latents. The model learns importance ordering during training by randomly dropping tail latents, so earlier tokens capture global structure while later ones handle fine details. Results on ImageNet 1K at 512px: 35.3% better FID, 39.6% better FDD scores, ~33% cheaper classifier free guidance. Works across DiT, UViT, HDiT, and MMDiT architectures with no changes to the training objective. By Moayed Haji Ali, @vislang (Rice University), @SergeyTulyakov, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov and team at @Snap. Accepted at CVPR 2026.

guifav's tweet photo. Diffusion transformers waste compute by treating every pixel equally, regardless of content complexity. ELIT (Elastic Latent Interface Transformer) fixes this with a simple idea: insert a variable length set of latent tokens that learn where to spend computation.

Two lightweight cross attention layers (Read/Write) route information between spatial tokens and latents. The model learns importance ordering during training by randomly dropping tail latents, so earlier tokens capture global structure while later ones handle fine details.

Results on ImageNet 1K at 512px: 35.3% better FID, 39.6% better FDD scores, ~33% cheaper classifier free guidance. Works across DiT, UViT, HDiT, and MMDiT architectures with no changes to the training objective.

By Moayed Haji Ali, @vislang (Rice University), @SergeyTulyakov, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov and team at @Snap.

Accepted at CVPR 2026.

1

5

2

0

305

vislang retweeted

Zilin Xiao

@ZilinXiao2

4 months ago

🚀 Two papers accepted to #ICLR2026 on test-time scaling for vision-language systems (retrieval + reasoning)! 1) MetaEmbed (Oral Presentation): Meta Tokens + Matryoshka multi-vector training → flexible late interaction, choose #vectors at test time for accuracy↔efficiency. Paper: https://t.co/Ry7uYogtHB Work done at @AIatMeta with amazing collaborators: Qi Ma, @Mengting_Gu, Jason Chen, Xintao Chen, @vislang and @MohanVijaimohan! 2) ProxyThinker: training-free test-time guidance from small “slow-thinking” visual reasoners → self-verification / self-correction via distribution-level guidance. Paper: https://t.co/hA8llmYKT3 Work done with @JaywonK17250, @Siru_Ouyang, @jefehern, @yumeng0818 and @vislang! While I won't be able to travel to Brazil🇧🇷, please say Hi to the team :-) #MultimodalRetrieval #VisualReasoning #VisionLanguage #TestTimeCompute #Embeddings

4

91

20

86

20K

Who to follow

Dhruv Batra

@DhruvBatra_

Co-founder & Chief Scientist @yutori_ai. Prev: Senior Director leading FAIR Embodied AI @MetaAI and Professor @GeorgiaTech.

Greg Yang

@TheGregYang

xai cofounder. fighting lyme

Sanja Fidler

@FidlerSanja

Associate Professor @UofT, Vice President of AI Research @nvidia, founding member of @VectorInst. Computer vision, deep learning, 3D. Opinions are my own.

vislang retweeted

AK

@_akhaliq

9 months ago

MetaEmbed Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

1

34

6

16

11K

vislang retweeted

Aran Komatsuzaki

@arankomatsuzaki

9 months ago

Meta Superintelligence Labs presents MetaEmbed: Scalable multimodal retrieval • Flexible late interaction via Meta Tokens • Test-time scaling: trade off retrieval accuracy vs efficiency • SOTA on MMEB + ViDoRe, robust up to 32B models • Matryoshka training → coarse-to-fine multi-vector embeddings

arankomatsuzaki's tweet photo. Meta Superintelligence Labs presents MetaEmbed: Scalable multimodal retrieval

• Flexible late interaction via Meta Tokens
• Test-time scaling: trade off retrieval accuracy vs efficiency
• SOTA on MMEB + ViDoRe, robust up to 32B models
• Matryoshka training → coarse-to-fine multi-vector embeddings

6

316

42

207

54K

vislang retweeted

AK

@_akhaliq

over 4 years ago

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations abs: https://t.co/1l9BZixLNF CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet

_akhaliq's tweet photo. CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations
abs: https://t.co/1l9BZixLNF

CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC classification, and
a +22.1% top-1 accuracy gain on ImageNet https://t.co/PNeuNHh4M1

0

99

10

25

0

vislang retweeted

Zilin Xiao

@ZilinXiao2

12 months ago

Looking for a new (image) re-ranking paradigm? Check this out! LoCoRe (Long-Context Reranker) is trained with a long-context sequence model and token-level supervision to achieve **one-pass** re-ranking for all image candidates. Catch us at #CVPR Poster Session 2 #401 on Friday, 4pm-6pm!

ZilinXiao2's tweet photo. Looking for a new (image) re-ranking paradigm? Check this out! LoCoRe (Long-Context Reranker) is trained with a long-context sequence model and token-level supervision to achieve **one-pass** re-ranking for all image candidates.

Catch us at #CVPR Poster Session 2 #401 on Friday, 4pm-6pm!

1

12

3

0

762

vislang.ai @vislang

over 1 year ago

Check our new work on cross-modal audio-video generation. Our work produces audio with the best alignment we have seen with respect to actions happening on video. Particularly useful in the era of astounding progress in generative video models.

Moayed Haji Ali @moayedhajiali

over 1 year ago

Can pretrained diffusion models connect for cross-modal generation? 📢 Introducing AV-Link ♾ Bridging unimodal diffusion models in one framework to enable: 📽️ ➡️ 🔊 Video-to-Audio 🔊 ➡️ 📽️ Audio-to-Video 🌐: https://t.co/Q8UPwwPM9V 📄: https://t.co/Wmz0RoY8ue ⤵️ Results

2

23

12

5

6K

0

3

1

561

vislang retweeted

Moayed Haji Ali @moayedhajiali

over 1 year ago

Can pretrained diffusion models connect for cross-modal generation? 📢 Introducing AV-Link ♾ Bridging unimodal diffusion models in one framework to enable: 📽️ ➡️ 🔊 Video-to-Audio 🔊 ➡️ 📽️ Audio-to-Video 🌐: https://t.co/Q8UPwwPM9V 📄: https://t.co/Wmz0RoY8ue ⤵️ Results

2

23

12

5

6K

vislang retweeted

Reginald DesRoches

@RDesRoches

over 1 year ago

Rice is shaping the future of AI! Our researchers are working on groundbreaking methods to eliminate the "weird" or distorted images that AI sometimes generates. This innovation could lead to more accurate and realistic visuals created by artificial intelligence. The future of AI-generated imagery is looking clearer and brighter thanks to our researchers! 🔍💡Read more about this fascinating research and its potential impact: https://t.co/BQjr3J5oIQ #RiceU #FutureOfAI #Innovation

RDesRoches's tweet photo. Rice is shaping the future of AI! Our researchers are working on groundbreaking methods to eliminate the "weird" or distorted images that AI sometimes generates. This innovation could lead to more accurate and realistic visuals created by artificial intelligence. The future of AI-generated imagery is looking clearer and brighter thanks to our researchers! 🔍💡Read more about this fascinating research and its potential impact: https://t.co/BQjr3J5oIQ #RiceU #FutureOfAI #Innovation

0

12

1

0

750

vislang retweeted

Zilin Xiao

@ZilinXiao2

over 1 year ago

I am excited to share that two of our research works will be presented at ECCV 2024. #ECCV2024 They focus on augmenting language models with fine-grained visual recognition ability. AutoVER made successful attempts at generative visual recognition. It was accepted to the ECCV 2024 main conference and was invited to the ILR Workshop as an oral presentation. Collaboration w/ @pcascanteb @vislang #Microsoft Extractive Reranker was accepted to the ILR Workshop as a poster. We explored how the long-context sequence modeling ability of language models can benefit image retrieval, a fundamental computer vision problem.

ZilinXiao2's tweet photo. I am excited to share that two of our research works will be presented at ECCV 2024. #ECCV2024 They focus on augmenting language models with fine-grained visual recognition ability.
AutoVER made successful attempts at generative visual recognition. It was accepted to the ECCV 2024 main conference and was invited to the ILR Workshop as an oral presentation. Collaboration w/ @pcascanteb @vislang #Microsoft
Extractive Reranker was accepted to the ILR Workshop as a poster. We explored how the long-context sequence modeling ability of language models can benefit image retrieval, a fundamental computer vision problem.

0

8

4

0

1K

vislang retweeted

Rice University Computer Science @RiceCompSci

over 1 year ago

Rice CS welcomes Zhengzhong Tu, Texas A&M assistant professor, next Tuesday, 9/24 at 4pm in Duncan Hall 3076. Dr. Tu will discuss Democratizing Diffusion Models for Controllable & Efficient Computational Imaging. PLEASE RSVP: https://t.co/FPeDlfUe1o @_vztu @vislang

RiceCompSci's tweet photo. Rice CS welcomes Zhengzhong Tu, Texas A&M assistant professor, next Tuesday, 9/24 at 4pm in Duncan Hall 3076. Dr. Tu will discuss Democratizing Diffusion Models for Controllable & Efficient Computational Imaging.

PLEASE RSVP: https://t.co/FPeDlfUe1o

@_vztu @vislang https://t.co/IFZ7IoV3XY

0

8

4

1

1K

vislang retweeted

Rice University Computer Science @RiceCompSci

over 1 year ago

GenAI has struggled to create consistent images, but research from Rice CS' @vislang lab could make weird AI images a thing of the past. Moayed Haji Ali and Vicente Ordónez-Román have developed a way to improve the performance of AI diffusion models. https://t.co/d7qfrxxOQz

RiceCompSci's tweet photo. GenAI has struggled to create consistent images, but research from Rice CS' @vislang lab could make weird AI images a thing of the past. Moayed Haji Ali and Vicente Ordónez-Román have developed a way to improve the performance of AI diffusion models.
https://t.co/d7qfrxxOQz https://t.co/mPv0tFQwbw

0

3

1

0

380

vislang retweeted

Rice University Computer Science @RiceCompSci

almost 2 years ago

Rice CS' @cathyrzhe presented her paper, Improved Visual Grounding through Self-Consistent Explanations, at @CVPR 2024. SelfEQ helps computers ‘see’ more accurately and consistently. She is advised by faculty member Vicente Ordóñez-Román. https://t.co/f1xCOqDh0K @vislang

RiceCompSci's tweet photo. Rice CS' @cathyrzhe presented her paper, Improved Visual Grounding through Self-Consistent Explanations, at @CVPR 2024. SelfEQ helps computers ‘see’ more accurately and consistently. She is advised by faculty member Vicente Ordóñez-Román. https://t.co/f1xCOqDh0K @vislang https://t.co/QO8qA9QvTK

0

7

3

0

1K

vislang.ai @vislang

almost 2 years ago

Check out the work by @cathyrzhe who is representing our group at #CVPR2024 at the poster session tomorrow morning. Poster #334. #CVPR.

Ruozhen Catherine He @cathyrzhe

almost 2 years ago

(1/4) Excited to share our latest work at #CVPR2024 @CVPR!🔥 Join us tomorrow, Thursday, June 20, from 10:30am to noon at Poster Session 3, # 334, to learn about "Improved Visual Grounding through Self-Consistent Explanations" with @pcascanteb, Ziyan, @alexandercberg, @vislang.

cathyrzhe's tweet photo. (1/4) Excited to share our latest work at #CVPR2024 @CVPR!🔥

Join us tomorrow, Thursday, June 20, from 10:30am to noon at Poster Session 3, # 334, to learn about "Improved Visual Grounding through Self-Consistent Explanations" with @pcascanteb, Ziyan, @alexandercberg, @vislang. https://t.co/veePUPs7JT

1

5

2

2K

0

2

0

736

vislang retweeted

Ruozhen Catherine He @cathyrzhe

almost 2 years ago

(1/4) Excited to share our latest work at #CVPR2024 @CVPR!🔥 Join us tomorrow, Thursday, June 20, from 10:30am to noon at Poster Session 3, # 334, to learn about "Improved Visual Grounding through Self-Consistent Explanations" with @pcascanteb, Ziyan, @alexandercberg, @vislang.

1

5

2

2K

vislang retweeted

harpreet

@DataScienceHarp

almost 2 years ago

Chatted with @cathyrzhe and @pcascanteb from @vislang at @RiceUniversity about the paper they had accepted to @CVPR Their paper introduces Self-Consistency Equivalence Tuning (SelfEQ) to improve visual grounding in vision-and-language models using paraphrases. The Problem: Models struggle with precise object localization when textual descriptions vary. Current methods need detailed annotations and show inconsistency with varied texts. The Solution: SelfEQ uses paraphrases generated by a large language model and finetunes the model with GradCAM for consistent visual explanations. How It Works 1) Start with an existing method: Uses ALBEF model without object location annotations. 2) Improvements by SelfEQ: Generates paraphrases and ensures consistent visual attention maps. Why It's Better • Expanded Vocabulary: Handles more textual descriptions. •Improved Localization: Precise and consistent without bounding box annotations. •Efficiency: Reduces need for detailed annotations. Key Contributions • Introduces SelfEQ for better visual grounding. • Uses large language models for paraphrases. • Improves performance on benchmarks (Flickr30k, ReferIt, RefCOCO+). #CVPR2024 #CVPR

2

8

6

0

1K

vislang retweeted

MOAYED HAJi ALi @MoayedHaji

over 2 years ago

Great news from #CVPR2024 🎉🎉🎉 Happy to share that our paper ElasticDiffusion: Training-free Arbitrary Size Image Generation was accepted @CVPR. Big thanks to my collaborators @bluevincent and Guha Balakrishnan. Checkout more details from here: https://t.co/Ch8uelO4Mf

MoayedHaji's tweet photo. Great news from #CVPR2024 🎉🎉🎉

Happy to share that our paper ElasticDiffusion: Training-free Arbitrary Size Image Generation was accepted @CVPR.
Big thanks to my collaborators @bluevincent and Guha Balakrishnan.
Checkout more details from here: https://t.co/Ch8uelO4Mf https://t.co/iraEk6ZExm

2

21

5

1

2K

vislang retweeted

Jaspreet Ranjit

@jaspreetranjit_

over 2 years ago

How do biases change before and after finetuning large scale visual recognition models? Our @afciworkshop paper incorporates sets of canonical images to highlight changes in biases for an array of off-the-shelf pretrained models. #NeurIPS2023 Link: https://t.co/641qNJULqI

jaspreetranjit_'s tweet photo. How do biases change before and after finetuning large scale visual recognition models? Our @afciworkshop paper incorporates sets of canonical images to highlight changes in biases for an array of off-the-shelf pretrained models. #NeurIPS2023

Link: https://t.co/641qNJULqI https://t.co/uOZ80LpYME

1

10

1

3

4K

vislang.ai

@vislang

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users