My research lab @Mila_Quebec is recruiting multiple PhD positions next Fall working on #biofoundational models. In particular, we are looking for students to work on:
(1) Generative models (diffusion models, flow matching) for protein science ( in particular on protein dynamics and design);
(2) LLMs for multiomics (genomics, single-cell RNA seq, proteomics);
(3) Building #generative_agents for biomedical applications with LLMs.
Please DM or email me through [email protected]
😎 Can't be more thankful to the awesome @itsbautistam for the warm hosting and mentoring me with @YuyangW95; also appreciate my brilliant collaborators @YizheZhangNLP, @thoma_gu, Navdeep and @jmsusskind for helping me quite a lot. Gonna miss this time
Wrapped up my intern at @Apple MLR and heading back to school. Really lucky to learn from & work with amazing folks here. Excited to see SimpleFold (https://t.co/bFESZSr0RA) out and moving forward 🔥 — proud to contribute, and make minuscular effort to the community w/ the team.
So I understand that was unexpected for a lot of people, @Apple MLR has released a protein folding model! https://t.co/n6qpEvwByS. Here’s a summary of what SimpleFold is and what it represents:
- What is SimpleFold? A generative model that essentially treats protein folding almost exactly as if it were a text-to-image or text-to-3D problem.
- What are we sharing? A research paper and a codebase under an MIT license https://t.co/JnehdmQilR (looking forward to people contributing to it!). We are also releasing pre-trained checkpoints of different sizes so that researchers can best tradeoff performance for efficiency.
- Why protein folding? We are doing this work largely because protein folding is an excellent benchmark for structured data generation and multi-modality. Protein folding is a very interesting problem from a generative modeling perspective and we do research on generative modeling :)
- Why is it interesting? IMO SimpleFold is interesting because I believe in finding recipes (architectures, training objectives, etc.) that generalize across the board to many different data modalities. Let’s say you are an ML expert in text-to-image or text-to-3D, now you can apply your latest and greatest architectural blocks or efficient samplers to protein folding with SimpleFold. I believe this is a net benefit for ML research and science in general.
Now getting more into the technical details:
- Our architecture is very simple (hence the name), just a stack of transformer blocks with time-step conditioning. This is important because it makes the model efficient at inference time. You can run SimpleFold directly on your Mac and get results quickly without data ever leaving your laptop.
- SimpleFold is not necessarily a model that “rejects” inductive biases, it just doesn’t enforce them directly on the architecture. For example, we apply rotation augmentation to all the protein structures during training. This makes the model “softly” invariant to this symmetry.
- There were some concerns online about data leakage from AFESM and that driving performance of SimpleFold or making it overfit. We filtered AFESM data so that the CASP14 sequences are not seen during training. As a matter of fact we distilled structures from AF2/ESMFold models, which have the same cutoff data as SimpleFold for PDB data. Both AF2 and ESMFold train on self-distilled datasets, we just train SimpleFold on a bigger set of distilled data.
I want to thank my awesome team of collaborators, they are all rockstars.
That’s all, for now :)
SimpleFold is out! I'm excited to see how the community receives this work from our research group at Apple, and I hope to see people trying out protein folding on their own laptops with MLX!
Huge thanks to all my gorgeous mentors and collaborators at Apple MLR: @YuyangW95 (project lead), Navdeep Jaitly, @jmsusskind, and @itsbautistam . 🚀 One of (a strike thru) the COOLEST stuff during my internship this summer!!
Excited to share our recent work on protein folding with "deliberately" minimalist (not small, scale up to 3B params) non-equivariant model built on general-purpose transformers + FM. How about folding proteins on your M-chip Mac in secs? Try it now! 😎
New preprint & open-source! 🚨 “SimpleFold: Folding Proteins is Simpler than You Think” (https://t.co/9f1bzk5crS). We ask: Do protein folding models really need expensive and domain-specific modules like pair representation? We build SimpleFold, a 3B scalable folding model solely built on general-purpose transformers + flow matching, and is trained on 9M structures. SimpleFold supports easy deployment and efficient inference on consumer-level hardware with PyTorch/MLX (try it on your MacBook!) (1/n)
SimpleFold: Folding Proteins is Simpler than You Think
"we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term."
New preprint & open-source! 🚨 “SimpleFold: Folding Proteins is Simpler than You Think” (https://t.co/9f1bzk5crS). We ask: Do protein folding models really need expensive and domain-specific modules like pair representation? We build SimpleFold, a 3B scalable folding model solely built on general-purpose transformers + flow matching, and is trained on 9M structures. SimpleFold supports easy deployment and efficient inference on consumer-level hardware with PyTorch/MLX (try it on your MacBook!) (1/n)
Wrapping up #ICML2025 on a high note — thrilled (and pleasantly surprised!) to win the Best Paper Award at @genbio_workshop 🎉
Big shoutout to the team that made this happen!
Paper: Forward-Only Regression Training of Normalizing Flows (https://t.co/2dMjkvF4qX)
@Mila_Quebec
What makes a great scientist? Most AI scientist benchmarks miss the key skill: designing and analyzing experiments.
🧪 We're introducing SciGym: the first simulated lab environment to benchmark #LLM on experimental design and analysis capabilities.
#AI4SCIENCE#ICML25
@jakublala Congrats Jakub, super fun to see the animation! How do you see the difference between BAGEL and the good old Rosetta for protein design, I guess they share the core of idea? what shows more promises for BAGEL?