Pretraining data curation alone — no SFT, no RL — within 1.8pp of Qwen3-VL-2B at ~87× less train compute. New VLM research from our team. https://t.co/zGLEjLspN9
all the memes, all at once.
@datologyai back with another banger = frontier VLM data curation this time!
beating internVL3.5, and nearly matching SOTA open source VLM qwen3.5 at 150x less training compute
with DATA CURATION alone
I promise you this is worth your time - the paper & thread below has a lot more interesting & exiting findings!
This is one unexpected result that I'm especially proud of.
With reasoning models / agents etc., serving models is getting INSANELY expensive.
What we show is that data curation doesn't just save you money once at training time, but keeps saving you money every day - with each model inference being more efficient.
You can imagine how this advantage compounds steadily - cheaper model outputs, cheaper test-time scaling, cheaper agentic loops - all just with the DATA.
Excited to announce @datologyai's VLM curation results!
Only varying data:
• +15.4pp across IID tasks
• +7.2pp across diverse fully OOD tasks
• Within spitting distance of Qwen3.5 w/ ~150x less train compute (and without post-training)
• Beats InternVL3.5 by ~10pp at ~17x less train compute
Better data ➡️ better VLMs
Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.
I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.
This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.
Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!
I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.
Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.
At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.
Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)
a 🧵
We got VLMs closer to 20/20 vision with data curation alone.
No new architecture. No SFT. No RLHF/RLVR. Same recipe. Same compute.
WITH DATA CURATION ALONE
+11.7pp on 20 public evals
+11.3pp on DatBench
up to ~150× less training compute
🧵 1/23
Even if they don't get it right immediately, the success of Claude Code shows they will eventually.
This is the prime example of why every software company now needs to become an AI company. The same way nearly every company became an internet company.
And to do this in a sustainable, defensible way, you have to train your models. Start with fine-tuning if you must, but find a way to leverage the years of hard work you've put into building your product before this wave of AI.
That is your DATA.
Train on your data. Anthropic doesn't have it.
Train, or be trained on.
We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI.
By applying DatologyAI’s data curation pipeline for legal domain adaptation mid-training, the results were clear:
- +5% improvement on legal benchmarks and +2.5% on general-purpose evaluations after mid-training
- >2.5x amplification in post-training gains on Thomson Reuters’ private legal evals
- Achieved with <1% of the original pre-training token budget
These gains demonstrate that better data doesn’t just improve models, but multiplies the effectiveness of everything built on top of them.
As @schwarzjn_ , Head of AI Research at Thomson Reuters, put it:
“DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations…demonstrating the strength and generalizability of their approach.”
This partnership shows what’s possible when proprietary data and advanced data curation come together — not just incremental gains, but compounding advantages across the entire model lifecycle.
We’re excited to continue building with Thomson Reuters to push the boundaries of domain AI.
#AI #MachineLearning #LegalTech #DataCuration #Partnerships
Grateful to be working with @schwarzjn_ and his team at Thomson Reuters to help leverage their proprietary data to mid-train the world's best legal models!
Mid-training on domain specific data can massively improve specialized performance without sacrificing general capabilities. And because the mid-trained model understands the domain better, post-training becomes far more effective.
Check out our case study linked below to learn more. And if you want to leverage your own proprietary data to build strong, domain specific models where accuracy and reliability are key, please reach out to us @datologyai!
Two things I'm particularly proud of here:
1. The pretraining data are derived entirely from publicly-available tokens.
2. No closed-source models were used in any part of the pretraining data curation pipeline.
insanely cool to the see RL infra we’ve been building with verifiers/prime-rl powering a true open frontier model
massive congrats to @arcee_ai
american open source is so unbelievably back :)
Trinity-Large-Thinking achieves state of the art results on Tau2 airline, and is at frontier level on Tau2 telecom.
It's also the #2 model on PinchBench, just behind Opus 4.6, and we're among the giants on BCFLv4
We have been so privileged to partner with Arcee since the beginning of their model building journey last year with AFM-4.5B through Trinity-Large, trained on 17T public tokens curated by @datologyai.
With the first thinking release today, Arcee is now at the frontier.
Most of the work (and cost) of building models is all the experiments beforehand, many of which are on data.
We've done the experiments for you, so that we can get the data right on the first try.
This means you can build models for a fraction of what you think it costs.
This is precisely "The Finetuner’s Fallacy".
We are seeing a consistent trend of businesses moving from:
[API➡️Finetune➡️Pretrain]
Time to stop paying The Finetuner's Tax. Read more here: https://t.co/7Lr41imX4e