Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.
I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.
This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.
Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!
I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.
Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.
At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.
Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)
a 🧵
@nickfrosst I mean, your health (and life expectancy being one indicator) is a function of your dietary and exercise patterns as well. The American healthcare system is by no means perfect, but somehow ignoring other causal factors along the way seems myopic
@trq212@trq212 I'm on my fifth try with /ultraplan - every time I get a "ultraplan needs your input" and open the web session, it's in some stale state. Example below
We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI.
By applying DatologyAI’s data curation pipeline for legal domain adaptation mid-training, the results were clear:
- +5% improvement on legal benchmarks and +2.5% on general-purpose evaluations after mid-training
- >2.5x amplification in post-training gains on Thomson Reuters’ private legal evals
- Achieved with <1% of the original pre-training token budget
These gains demonstrate that better data doesn’t just improve models, but multiplies the effectiveness of everything built on top of them.
As @schwarzjn_ , Head of AI Research at Thomson Reuters, put it:
“DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations…demonstrating the strength and generalizability of their approach.”
This partnership shows what’s possible when proprietary data and advanced data curation come together — not just incremental gains, but compounding advantages across the entire model lifecycle.
We’re excited to continue building with Thomson Reuters to push the boundaries of domain AI.
#AI #MachineLearning #LegalTech #DataCuration #Partnerships
this exercise seems more like a test of how the models handle ambiguous/nonsensical messages (a normal human would probably push back and say 'wth'), but the modelss politeness makes them try to reason around whatever the user is trying to say. seems completely detached from their supposed ability to automate work imo 🤷♂️