Kaleigh Mentzer @KaleighMentzer - Twitter Profile

28 days ago

Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them. I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas. This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold. Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!! I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again. Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year. At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B. Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :) a 🧵

sjoshi804's tweet photo. Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.

I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.

This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.

Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!

I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.

Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.

At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.

Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)

a 🧵

10

335

34

144

791K

Kaleigh Mentzer @KaleighMentzer

3 months ago

Meme courtesy of @iamgroot42 🫡

0

3

0

124

Kaleigh Mentzer @KaleighMentzer

3 months ago

KaleighMentzer's tweet photo. https://t.co/l5wZH5rjl9

Christina Baek

@_christinabaek

3 months ago

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

_christinabaek's tweet photo. Models are typically specialized to new domains by finetuning on small, high-quality datasets.

We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵 https://t.co/stFslu9Mv7

19

618

80

521

95K

1

19

3

3K

KaleighMentzer retweeted

Ari Morcos

@arimorcos

3 months ago

Love seeing the timeline wake up to the fact that data is the most underinvested area in ML. But let’s set the record straight: the world’s premier data research company isn't hypothetical. It already exists. It’s called @datologyai, and we’ve been building it for 2.5 years. 🧵

9

127

25

65

28K

Who to follow

Hamsa Bastani

@hamsabastani

Prof @Wharton @Penn; machine learning for health & social good; foodie, gamer, homebody

Sarah Cen

@cen_sarah

Asst Prof @CarnegieMellon. Passionate about AI accountability, safety, and security. Previously @Stanford @MIT @oxfordrobots @Princeton

Hannah Li

@hannahq_li

Assistant prof @Columbia_Biz | Previously @Stanford @MIT @PomonaCollege | Thinking about online platforms, A/B testing, and education

Kaleigh Mentzer @KaleighMentzer

4 months ago

🌎Making your model multilingual doesn't have to sacrifice English performance—you just need better data. @agcrnz, @RicardoMonti9, and I have been working on curating the best possible multilingual data with the team @datologyai, and it works! Check out the results 👇

KaleighMentzer's tweet photo. 🌎Making your model multilingual doesn't have to sacrifice English performance—you just need better data.

@agcrnz, @RicardoMonti9, and I have been working on curating the best possible multilingual data with the team @datologyai, and it works! Check out the results 👇 https://t.co/GEgwlEpeZT

Ricardo Monti @RicardoMonti9

4 months ago

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

RicardoMonti9's tweet photo. 1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it.

Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens. https://t.co/mVCWogFTYd

7

153

30

67

39K

0

31

13

2

3K

KaleighMentzer retweeted

Cody Blakeney

@code_star

4 months ago

Excited to announce the return of American OSS with Arcee Trinity Large. This model couldn't have been possible without the awesome collaboration of Modeling @arcee_ai , Infra @PrimeIntellect , and Data @datologyai I can't say enough about how talented the whole team at Arcee is being able to scale from their first MoE to a big boy like this in such a short time. Since the last data mix we have been in the lab pushing our midtraining and synthetic data to the limits. For Trinity Large we generated over 800B tokens of high quality synthetic code and 6.5T(!!!) tokens overall. We also added multilingual curation. This was a massive effort from the whole Datology family. From scaling up the rephrasing workflows to support heterogenous clusters to scale efficiently (@isabelle226ku, @JackUrbs, @parthjdoshi @haakonmongstad @alvind319), pushing out midtraining and mixing (@_BrettLarsen) , innovating on new code synthetic data (@amrokamal1997) new math synthetic data (David Schwab), multilingual curation (@KaleighMentzer, @agcrnz, @RicardoMonti9) and of course built on our great foundation of synthetic data (@pratyushmaini Vineeth Dorna)

9

123

19

9

13K

KaleighMentzer retweeted

Haoli Yin

@HaoliYin

5 months ago

We cut VLM eval compute by >10× while INCREASING signal. The secret? Most benchmark samples are noise: → 70% solvable without the image → 42% mislabeled or ambiguous → MCQ formats hide 35-point capability gaps Presenting: DatBench 🧵 1/n

HaoliYin's tweet photo. We cut VLM eval compute by >10× while INCREASING signal.
The secret? Most benchmark samples are noise:
→ 70% solvable without the image
→ 42% mislabeled or ambiguous
→ MCQ formats hide 35-point capability gaps
Presenting: DatBench
🧵 1/n https://t.co/4tJJnmgjvS

8

208

37

91

37K

KaleighMentzer retweeted

Luke Merrick @lukemerrick_

6 months ago

Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification. How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features.

lukemerrick_'s tweet photo. Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification.

How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features. https://t.co/IXfpZCVcgt

12

200

30

143

89K

KaleighMentzer retweeted

JosH100

@josh_wills

8 months ago

1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc. etc.) 👇

2

70

15

17

27K

KaleighMentzer retweeted

Pratyush Maini

@pratyushmaini

10 months ago

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

pratyushmaini's tweet photo. 1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance https://t.co/MUittjMqOO

22

721

126

583

187K

KaleighMentzer retweeted

Ari Morcos

@arimorcos

12 months ago

Congratulations to our friends and partners @arcee_ai on the release of AFM-4.5B! With data powered by @datologyai, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

0

48

10

7

6K

Kaleigh Mentzer @KaleighMentzer

12 months ago

Huge improvements from data alone! My teammates at @datologyai dropped some big results today:

Ricardo Monti @RicardoMonti9

about 1 year ago

. @datologyai is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in the 🧵 thread below 👇

RicardoMonti9's tweet photo. . @datologyai is back: state of the art CLIP model performance using data curation alone 🚀

✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2
✅ 8x training efficiency gains
✅ 2x inference efficiency gains
✅ Public model release

Details in the 🧵 thread below 👇

6

149

21

54

37K

0

13

3

1

920

KaleighMentzer retweeted

Rahul Ponnala @RahulPonnala

over 2 years ago

We're incredibly honored to be named a Cool Vendor in the August 2023 Gartner® Cool Vendors™ in Cloud That Drive Business Disruption report. The @granica_ai efficiency platform enables organizations to drive rapid innovation & become disruptors. https://t.co/HUVwj0ijtt

RahulPonnala's tweet photo. We're incredibly honored to be named a Cool Vendor in the August 2023 Gartner® Cool Vendors™ in Cloud That Drive Business Disruption report.

The @granica_ai efficiency platform enables organizations to drive rapid innovation & become disruptors.

https://t.co/HUVwj0ijtt https://t.co/bb326VgOL4

0

11

7

0

742

Kaleigh Mentzer @KaleighMentzer

almost 3 years ago

Now available here! #WiDSWorkshops https://t.co/JpEPr2RFRM

0

118

Kaleigh Mentzer @KaleighMentzer

almost 3 years ago

Join us to chat about public sector research collaborations and assigning kids to schools!

1

6

0

2

433

Kaleigh Mentzer @KaleighMentzer

almost 3 years ago

Come learn visualization tools for data science with me!

0

3

0

291

Kaleigh Mentzer @KaleighMentzer

about 3 years ago

Unrelated to my current work, but happy to announce that my summer work at @Livermore_Lab a few years back is finally published! Check out how we used neural networks with "phases" to model equations of state for inertial confinement fusion simulations. https://t.co/ZJdqkJ68Y4

0

2

0

175

Kaleigh Mentzer @KaleighMentzer

almost 4 years ago

Code here: https://t.co/uQmK17NNhs

0

2

0

Kaleigh Mentzer @KaleighMentzer

almost 4 years ago

A little bit of fun with NetworkX, OSMnx, and the YelpAPI... The shortest bike route to the top 10 coffee roasteries on Yelp in SF #DataVisualization #NetworkX #OSMnx #OpenStreetMap #python #maps #dataviz

KaleighMentzer's tweet photo. A little bit of fun with NetworkX, OSMnx, and the YelpAPI...

The shortest bike route to the top 10 coffee roasteries on Yelp in SF

#DataVisualization #NetworkX #OSMnx #OpenStreetMap #python #maps #dataviz https://t.co/CwpOdL9Xv2

2

10

2

0

KaleighMentzer retweeted

Nick Arnosti @NickArnosti

almost 4 years ago

This is a great example of market design: https://t.co/kxyjaEjzNr It is also a great example of the failed econ publishing process. The "new" mechanism started in 2005. Data from 2005-2011. I saw Canice's talk in 2013. What good does it do for the JPE to "publish" it in 2022?!?

7

190

31

35

0

Kaleigh Mentzer

@KaleighMentzer

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users