🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute?
We explored this through the lens of MoEs:
Want to send Trump a message before he escalates the war with Iran? Buy gas today. High oil prices are his kryptonite. Do it now. 🛢️ #PumpUpThePressure#NoWarCrimeslnIran
با دقت توجه کنید
مطمئن باشید که جاسوسهای اسرائیل در ایران تحصیلکردهٔ آمریکا نیستند. آنها تظاهر میکنند انقلابی هستند و حتی در زمان جنگ هم با دامن زدن به فتنه، خودشان را لو میدهند.
اشک غیرت
دکتر رحیمی مدیر پروژه بزرگترین پل خاورمیانه در کرج:
این پروژه پروژه ای بود برای مردم و عمومی بود
نه استفاده نظامی میشد ونه هیچ یک از دروغ هایی که رسانه ها گفتند مصداق دارد. این کارگاه مهندسی است.
من ناراحت این هستم که بد قول در برابر مردم شدم.
ولی این آرزو را به گور خواهند برد که این پل بلا استفاده شود.
آخرش من با این گریه م گرفت. آقا بزارین ما بیایم ایران با خیال راحت کار و زندگیمون رو بکنیم. تخصصمون رو بزنیم به زخم مملکت خودمون. ما نه وطن فروشیم نه خائن، هر مخالفتی هم با شما داریم به عشق ایرانه. چند نفر از دوستانم رو باید ببینیم که برگشتن تا برای ایران کار کنند و به اتهام جاسوسی کار و خانه و زندگیشون رو به باد دادید و دوباره آواره کردید؟ یکی برای آوردن واکسن HPV، یکی برای آوردن استاد بین الملل تئاتر، یکی برای چاپ کتاب ... این چه سرنوشتیه؟ متنفرم از زندگی در کشوری که داره پدر مادرم رو بمباران میکنه! میفهمید یعنی چی؟
#نه_به_جنگ
سازمان منابع طبیعی و آبخیزداری استان تهران که در بمباران امروز رسماً پودر شد. اینجا هم داشتن موشک و بمب اتم میساختن هموطن؟ چرا هرجای دنیا که درباره ویرانیهای ایران خبر کار شده یک ایرانی نوشته اینجا پاسدارها مخفی شدن/حتما اینجا موشک میساختن؟ بس کنید این بساط وقاحت و پستی و رذالت رو
For years I watched children of Gaza suffer. Helplessly all my poor heart could do was to study their faces, and every word, promising them I'd remember them forever.
And that's how I know what Israel is planning for my home country and our children now.
So I understand that was unexpected for a lot of people, @Apple MLR has released a protein folding model! https://t.co/n6qpEvwByS. Here’s a summary of what SimpleFold is and what it represents:
- What is SimpleFold? A generative model that essentially treats protein folding almost exactly as if it were a text-to-image or text-to-3D problem.
- What are we sharing? A research paper and a codebase under an MIT license https://t.co/JnehdmQilR (looking forward to people contributing to it!). We are also releasing pre-trained checkpoints of different sizes so that researchers can best tradeoff performance for efficiency.
- Why protein folding? We are doing this work largely because protein folding is an excellent benchmark for structured data generation and multi-modality. Protein folding is a very interesting problem from a generative modeling perspective and we do research on generative modeling :)
- Why is it interesting? IMO SimpleFold is interesting because I believe in finding recipes (architectures, training objectives, etc.) that generalize across the board to many different data modalities. Let’s say you are an ML expert in text-to-image or text-to-3D, now you can apply your latest and greatest architectural blocks or efficient samplers to protein folding with SimpleFold. I believe this is a net benefit for ML research and science in general.
Now getting more into the technical details:
- Our architecture is very simple (hence the name), just a stack of transformer blocks with time-step conditioning. This is important because it makes the model efficient at inference time. You can run SimpleFold directly on your Mac and get results quickly without data ever leaving your laptop.
- SimpleFold is not necessarily a model that “rejects” inductive biases, it just doesn’t enforce them directly on the architecture. For example, we apply rotation augmentation to all the protein structures during training. This makes the model “softly” invariant to this symmetry.
- There were some concerns online about data leakage from AFESM and that driving performance of SimpleFold or making it overfit. We filtered AFESM data so that the CASP14 sequences are not seen during training. As a matter of fact we distilled structures from AF2/ESMFold models, which have the same cutoff data as SimpleFold for PDB data. Both AF2 and ESMFold train on self-distilled datasets, we just train SimpleFold on a bigger set of distilled data.
I want to thank my awesome team of collaborators, they are all rockstars.
That’s all, for now :)
Last year at @Apple MLR, we published a number of interesting papers like AIM, AIMv2, and Scaling laws for: Sparsity, Native Multimodal Models, Data mixing.
Today the team has open-sourced the training codebase we used for conducting this research!
https://t.co/WNvOWMkgm3
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders !
Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
We have two awesome new videos on MLX at #WWDC25 this year.
- Learn all about MLX.
- Learn all about running LLMs locally with MLX.
@angeloskath, @shshnkp, myself, and others worked super hard to make these. Check them out and hope you find them useful!
Check out this post that has information about research from Apple that will be presented at ICLR 2025 in 🇸🇬 this week.
I will be at ICLR and will be presenting some of our work (led by @samira_abnar) at SLLM @sparseLLMs workshop.
Happy to chat about JEPAs as well!
Our work on fine-grained control of LLMs and diffusion models via Activation Transport will be presented @iclr_conf as spotlight✨Check out our new blog post https://t.co/dAJQtcETNX
We release a large scale study to answer the following:
- Is late fusion inherently better than early fusion for multimodal models?
- How do native multimodal models scale compared to LLMs.
- How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
Training and scaling large multimodal models from scratch? This is the thread for you. In this new paper, we provide an extensive study with hundreds of runs, fitting scaling laws for early/late fusion models, MoEs, and exploring different data mixtures. Tons of cool findings.
We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥
Paper: https://t.co/YpU6T8Pr9p
Repo: https://t.co/g1LO5rE5Y0
Model Gallery: https://t.co/j3jZ8TEtf5
@RamonDarioIT@DrewSteinman Our cost function is based only on FLOPs, we need to incorporate these factors(memory and communication costs) based on hardware specifications and some other details such as sharding in the cost function…
🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute?
We explored this through the lens of MoEs:
@RamonDarioIT@DrewSteinman One thing to note is that besides stability related issues as we increase sparsity, in practice memory and communication costs are not negligible. Specifically at larger scales, these would become bottlenecks and hardware constraints would determine the efficient sparsity level.