Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.
I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.
This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.
Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!
I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.
Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.
At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.
Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)
a 🧵
Congrats on the release @arcee_ai@latkins !!!
I couldn’t be more excited to see how far our customers are able to push the boundaries of frontier capabilities by leveraging our training data curation pipeline and pre-training with high quality data 🫡
Today we're releasing Trinity-Large-Thinking.
Available now on the Arcee API, with open weights on Hugging Face under Apache 2.0.
We built it for developers and enterprises that want models they can inspect, post-train, host, distill, and own.
New Datology Research: We expose "The Finetuner's Fallacy"
The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table.
Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model:
◾ 1.75x fewer tokens to reach the same domain loss
◾ 1B SPT model outperforms a 3B finetuned-only model
◾ +6pts MATH accuracy at 200B pretraining tokens
◾ Less forgetting of general knowledge
Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric.
Led by @_christinabaek and @pratyushmaini, with the full Datology team.
1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it.
Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.
We cut VLM eval compute by >10× while INCREASING signal.
The secret? Most benchmark samples are noise:
→ 70% solvable without the image
→ 42% mislabeled or ambiguous
→ MCQ formats hide 35-point capability gaps
Presenting: DatBench
🧵 1/n
Scaling laws for robotics: large amounts of diverse but high-quality pretraining data allows for significant improvements in the low-data post-training regime.
Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification.
How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features.
We've raised $100M from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA.
Today we're introducing Sonic-3 - the state-of-the-art model for realtime conversation.
What makes Sonic-3 great:
- Breakthrough naturalness - laughter and full emotional range
- Lightning fast -
1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc. etc.) 👇
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance