🔓🧬First big unlock from vibe science-ing: rapid access to publicly available datasets. Sounds basic. It isn't.
If you've ever tried to pull raw data from a paper you care about, you know: the metadata is a mess, the supplementary tables are unstructured, the file formats don't match, and by the time you've got it working you've lost half a day. For people without strong bioinformatics skills, it's often a dead end entirely. 1/🧵
@Oliver__Hahn Incredible tweet! I too have found huge value in paper associated datasets, but felt the pain of cleaning them up. Nice that this might be over to a large extent.
@ChrisHayduk IMO the problematic part is the generation of good quality data, not the analysis of said data. The costs involved often make experimentation difficult to justify.
@adamlewisgreen Really interesting, thanks for sharing. How did you come across the 2022 statistics paper, and how did you know it was worth building on?
I'm rebuilding AlphaFold2 from scratch in pure PyTorch.
No frameworks on top of PyTorch. No copy-paste from DeepMind's repo. Just nn.Linear, einsum, and the 60-page supplementary paper.
The project is called minAlphaFold2, inspired by Karpathy's minGPT. The idea is simple: AlphaFold2 is one of the most important neural networks ever built, and there should be a version of it that a single person can sit down and read end-to-end in an afternoon.
Where it stands today:
- ~3,500 lines across 9 modules
- Full forward pass works: input embedding → Evoformer → Structure Module → all-atom 3D coordinates
- Every loss function from the paper (FAPE, torsion angles, pLDDT, distogram, structural violations)
- Recycling, templates, extra MSA stack, ensemble averaging — all implemented
- 50 tests passing
- Every module maps 1-to-1 to a numbered algorithm in the AF2 supplement
The Structure Module was the most satisfying part to build. Invariant Point Attention is genuinely beautiful — it does attention in 3D space using local reference frames so the whole thing is SE(3)-equivariant, and the math fits in about 150 lines of PyTorch.
What's next:
- Build the data pipeline (PDB structures + MSA features)
- Write the training loop
- Train on a small set of proteins and see what happens
The repo is public. If you've ever wanted to understand how AlphaFold2 actually works at the level of individual tensor operations, this is meant for you.
Repo: https://t.co/k25vl5th1y
In January, @jonhoo, @jjgort, and I returned to @MIT_CSAIL to teach Missing Semester, a class on topics missing from most CS programs—tools and techniques that everyone should know, like Bash, Git, CI/CD, and AI tools. Today, we’re releasing the course for free online!
🎁 We have a gift for you!
You've heard about skrub and would like to discover more? Or you never heard about it, but struggle with data preprocessing?
📽️ Riccardo Cappuzzo did an awesome video at PyData that has been recorded: you can have a look here 👉 https://t.co/Cgm8r6wLxH
@jeremyphoward Read his paper 'Data analysis and statistics: an expository overview' a year ago and it blew my mind. Taught me that looking at residuals of a fitted model is a core part of data analysis. Amazing that something written so long ago remains so relevant!
@simonw The live coding approach is much more useful. I basically learned to program from watching programmers do this on youtube. Seeing people getting stuck and how they get unstuck is a goldmine of insight.
Just finished a post explaining how to use the Union-Find algorithm for preparing protein structure data for ML model training😊
https://t.co/UGiiIynZAK