Parth Doshi

27 days ago

all the memes, all at once. @datologyai back with another banger = frontier VLM data curation this time! beating internVL3.5, and nearly matching SOTA open source VLM qwen3.5 at 150x less training compute with DATA CURATION alone I promise you this is worth your time - the paper & thread below has a lot more interesting & exiting findings!

sjoshi804's tweet photo. all the memes, all at once.

@datologyai back with another banger = frontier VLM data curation this time!

beating internVL3.5, and nearly matching SOTA open source VLM qwen3.5 at 150x less training compute

with DATA CURATION alone

I promise you this is worth your time - the paper & thread below has a lot more interesting & exiting findings!

0

19

4

7

2K

parthjdoshi retweeted

28 days ago

This is one unexpected result that I'm especially proud of. With reasoning models / agents etc., serving models is getting INSANELY expensive. What we show is that data curation doesn't just save you money once at training time, but keeps saving you money every day - with each model inference being more efficient. You can imagine how this advantage compounds steadily - cheaper model outputs, cheaper test-time scaling, cheaper agentic loops - all just with the DATA.

1

18

2

1

1K

parthjdoshi retweeted

28 days ago

Excited to announce @datologyai's VLM curation results! Only varying data: • +15.4pp across IID tasks • +7.2pp across diverse fully OOD tasks • Within spitting distance of Qwen3.5 w/ ~150x less train compute (and without post-training) • Beats InternVL3.5 by ~10pp at ~17x less train compute Better data ➡️ better VLMs

0

26

7

5

2K

parthjdoshi retweeted

28 days ago

Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them. I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas. This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold. Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!! I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again. Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year. At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B. Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :) a 🧵

sjoshi804's tweet photo. Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.

I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.

This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.

Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!

I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.

Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.

At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.

Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)

a 🧵

10

335

34

144

791K

parthjdoshi retweeted

28 days ago

We got VLMs closer to 20/20 vision with data curation alone. No new architecture. No SFT. No RLHF/RLVR. Same recipe. Same compute. WITH DATA CURATION ALONE +11.7pp on 20 public evals +11.3pp on DatBench up to ~150× less training compute 🧵 1/23

3

35

4

11

20K

parthjdoshi retweeted

28 days ago

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

0

58

13

32

14K

parthjdoshi retweeted

about 2 months ago

Even if they don't get it right immediately, the success of Claude Code shows they will eventually. This is the prime example of why every software company now needs to become an AI company. The same way nearly every company became an internet company. And to do this in a sustainable, defensible way, you have to train your models. Start with fine-tuning if you must, but find a way to leverage the years of hard work you've put into building your product before this wave of AI. That is your DATA. Train on your data. Anthropic doesn't have it. Train, or be trained on.

1

14

5

9

3K

parthjdoshi retweeted

DatologyAI @datologyai

2 months ago

We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI. By applying DatologyAI’s data curation pipeline for legal domain adaptation mid-training, the results were clear: - +5% improvement on legal benchmarks and +2.5% on general-purpose evaluations after mid-training - >2.5x amplification in post-training gains on Thomson Reuters’ private legal evals - Achieved with <1% of the original pre-training token budget These gains demonstrate that better data doesn’t just improve models, but multiplies the effectiveness of everything built on top of them. As @schwarzjn_ , Head of AI Research at Thomson Reuters, put it: “DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations…demonstrating the strength and generalizability of their approach.” This partnership shows what’s possible when proprietary data and advanced data curation come together — not just incremental gains, but compounding advantages across the entire model lifecycle. We’re excited to continue building with Thomson Reuters to push the boundaries of domain AI. #AI #MachineLearning #LegalTech #DataCuration #Partnerships

datologyai's tweet photo. We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI.

By applying DatologyAI’s data curation pipeline for legal domain adaptation mid-training, the results were clear:

- +5% improvement on legal benchmarks and +2.5% on general-purpose evaluations after mid-training
- >2.5x amplification in post-training gains on Thomson Reuters’ private legal evals
- Achieved with <1% of the original pre-training token budget

These gains demonstrate that better data doesn’t just improve models, but multiplies the effectiveness of everything built on top of them.

As @schwarzjn_ , Head of AI Research at Thomson Reuters, put it:
“DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations…demonstrating the strength and generalizability of their approach.”

This partnership shows what’s possible when proprietary data and advanced data curation come together — not just incremental gains, but compounding advantages across the entire model lifecycle.

We’re excited to continue building with Thomson Reuters to push the boundaries of domain AI.

#AI #MachineLearning #LegalTech #DataCuration #Partnerships

1

35

5

9

10K

parthjdoshi retweeted

Aldo Gael Carranza @agcrnz

2 months ago

Grateful to be working with @schwarzjn_ and his team at Thomson Reuters to help leverage their proprietary data to mid-train the world's best legal models! Mid-training on domain specific data can massively improve specialized performance without sacrificing general capabilities. And because the mid-trained model understands the domain better, post-training becomes far more effective. Check out our case study linked below to learn more. And if you want to leverage your own proprietary data to build strong, domain specific models where accuracy and reliability are key, please reach out to us @datologyai!

2

39

9

5

3K

parthjdoshi retweeted

TechCrunch

@TechCrunch

2 months ago

I can’t help rooting for tiny open source AI model maker Arcee https://t.co/mYzp4NWx0r

14

150

24

22

28K

parthjdoshi retweeted

Matthew Leavitt

@leavittron

2 months ago

Two things I'm particularly proud of here: 1. The pretraining data are derived entirely from publicly-available tokens. 2. No closed-source models were used in any part of the pretraining data curation pipeline.

14

400

29

63

27K

parthjdoshi retweeted

will brown

@willccbb

2 months ago

insanely cool to the see RL infra we’ve been building with verifiers/prime-rl powering a true open frontier model massive congrats to @arcee_ai american open source is so unbelievably back :)

willccbb's tweet photo. insanely cool to the see RL infra we’ve been building with verifiers/prime-rl powering a true open frontier model

massive congrats to @arcee_ai

american open source is so unbelievably back :) https://t.co/22vmnuSWkL

8

364

28

71

20K

parthjdoshi retweeted

Pratyush Maini

@pratyushmaini

2 months ago

A great model coming out of a great team. It's been a privilege to have our tokens be eaten by this beast 🫡

0

39

3

1

3K

parthjdoshi retweeted

2 months ago

@arcee_ai, our bastion of open source in the western frontline

0

8

1

0

108

parthjdoshi retweeted

Lucas Atkins

@latkins

2 months ago

Trinity-Large-Thinking achieves state of the art results on Tau2 airline, and is at frontier level on Tau2 telecom. It's also the #2 model on PinchBench, just behind Opus 4.6, and we're among the giants on BCFLv4

latkins's tweet photo. Trinity-Large-Thinking achieves state of the art results on Tau2 airline, and is at frontier level on Tau2 telecom.
It's also the #2 model on PinchBench, just behind Opus 4.6, and we're among the giants on BCFLv4 https://t.co/JU0ePfgP2L

1

60

7

2

4K

parthjdoshi retweeted

Lucas Atkins

@latkins

2 months ago

This April Fool's Day, we decided to stop joking around. Trinity-Large-Thinking is out now.

20

379

37

45

69K

parthjdoshi retweeted

2 months ago

We have been so privileged to partner with Arcee since the beginning of their model building journey last year with AFM-4.5B through Trinity-Large, trained on 17T public tokens curated by @datologyai. With the first thinking release today, Arcee is now at the frontier.

arimorcos's tweet photo. We have been so privileged to partner with Arcee since the beginning of their model building journey last year with AFM-4.5B through Trinity-Large, trained on 17T public tokens curated by @datologyai.

With the first thinking release today, Arcee is now at the frontier. https://t.co/bh12HFz95H

1

88

12

6

8K

parthjdoshi retweeted