@JunhongShen1 2. Since the ability of tokenizer for different f could be acquired by just training with different compression ratio without a predefined f. Do you try such setting. Thanks!
@JunhongShen1 Hello! Very impressive work! I have a few questions:
1. How could you get the compression ratio of DiT-CAT in Table.4 for each eval image.
📢 My team at Meta (including @lipmanya and @RickyTQChen) is hiring a postdoctoral researcher to help us build the next generation of flow, transport, and diffusion models! Please apply here and message me:
https://t.co/mi42SVJAQD
Our team at Google DeepMind is looking for student researcher candidates working on multimodal reasoning! If you are excited about building next generation personalized multimodal agents that interactively reason with human, and would like to pursue it through rigorous hypothesis testing, controlled experiments, and solid engineering, please send an email to [email protected]
We look forward to creating the future together with you
✨ Exciting Opportunity at Google DeepMind Tokyo! ✨
We're seeking a brilliant Research Scientist to join our team. Are you passionate about audio and generative models? Apply now and help us push the boundaries of AI!
#Google#DeepMind#Audio#GenerativeModels#Tokyo#Hiring
📢✨ I am recruiting 1-2 PhD students at Virginia Tech this cycle.
If you are interested in efficient model development (including model merging, parameter-efficient fine-tuning & transfer learning), instruction tuning, advanced reasoning, LLMs-as-judges, etc., please apply!!
Excited to share that I'll be joining University of California at Irvine as a CS faculty in '25!🌟
Faculty apps: @_krishna_murthy, @liuzhuang1234 & I share our tips: https://t.co/ySaBIGB3aF
PhD apps: I'm looking for students in vision, robot learning, & AI4Science. Details👇
We’re hiring a PhD student, fully-funded (UK&Overseas) at School of Computer Science, University of Sheffield, UK, starting October 2025!
https://t.co/NbbiQzkZFa
📢 I’ll be admitting PhD students to Columbia CS in the heart of NYC 🗽—the most vibrant city in the world! 🌆
If you're passionate about advancing robot learning and envision a future where robots 🤖 are part of our daily lives, apply to join my group: https://t.co/xz22EUZZor
I’m hiring PhD students in computer science at Columbia!
Our lab will tackle core challenges in understanding and controlling neural models that interact with language.
for example,
- methods for LLM control
- discoveries of LLM properties
- pretraining for understanding
@jbhuang0604 Relaying on image-text data is because we need to evaluate the models in natural language. If we could evaluate the models byimage nature (like MAE could predict an object by seeing its shadow), vision self-supervised models would have interesting properties like LLM.
@sainingxie@jbhuang0604 I totally agree. Self-supervised pretraining in text is actually supervised learning. Such pretraining data could be collected like self-supervised training data without annotation. However, vision task cannot work in such way.
📢Excited to share our recent work on Large Multimodal Models: ConvLLaVA. Without the encoding multiple image patches and multiple encoders, we use a hierarchical backbone, ConvNeXt, realizing high resolution understanding.
https://t.co/eIEYDE76mV
Key ideas:
1. optimizing the representation of ConvNeXt. We find simply updating it is good enough.
2. Training a successive stage for ConvNeXt to further compress the visual tokens.
ConvLLaVA compresses visual features by 64x, compared with 14x of LLaVA-1.5.