alvin @alvind319 - Twitter Profile

alvin @alvind319

28 days ago

Excited to share our latest iteration on VLM data curation :)

Siddharth Joshi

@sjoshi804

28 days ago

Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them. I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas. This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold. Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!! I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again. Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year. At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B. Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :) a 🧵

sjoshi804's tweet photo. Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.

I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.

This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.

Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!

I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.

Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.

At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.

Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)

a 🧵

10

335

34

144

791K

0

14

1

0

569

alvin @alvind319

2 months ago

@code_star

0

1

0

22

alvin @alvind319

2 months ago

Congrats on the release @arcee_ai @latkins !!! I couldn’t be more excited to see how far our customers are able to push the boundaries of frontier capabilities by leveraging our training data curation pipeline and pre-training with high quality data 🫡

Arcee.ai

@arcee_ai

2 months ago

Today we're releasing Trinity-Large-Thinking. Available now on the Arcee API, with open weights on Hugging Face under Apache 2.0. We built it for developers and enterprises that want models they can inspect, post-train, host, distill, and own.

101

2K

241

1K

715K

0

8

1

0

367

alvind319 retweeted

DatologyAI @datologyai

3 months ago

New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model: ◾ 1.75x fewer tokens to reach the same domain loss ◾ 1B SPT model outperforms a 3B finetuned-only model ◾ +6pts MATH accuracy at 200B pretraining tokens ◾ Less forgetting of general knowledge Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric. Led by @_christinabaek and @pratyushmaini, with the full Datology team.

datologyai's tweet photo. New Datology Research: We expose "The Finetuner's Fallacy"

The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table.

Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model:

◾ 1.75x fewer tokens to reach the same domain loss
◾ 1B SPT model outperforms a 3B finetuned-only model
◾ +6pts MATH accuracy at 200B pretraining tokens
◾ Less forgetting of general knowledge

Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric.

Led by @_christinabaek and @pratyushmaini, with the full Datology team.

4

231

28

174

57K

Who to follow

阿橡

@oakvale5

Rest at the edge of chaos. .... . .-.. .--. -....- -- . God plays dice. Maybe that's a feature, not a bug. 一切都会好起来。

titus

@TitusTeatus

computational beauty. interfaces for generative ai. founding engineer @krea_ai prev @UCBerkeley, @MSFTResearch, @helmholtz_en

Rohit Datta

@dattascience

working on a newco, cofounded @formspree previously @apple

alvin @alvind319

3 months ago

@code_star not you catching onto this meme bahahahah

0

2

0

92

alvind319 retweeted

Ricardo Monti @RicardoMonti9

4 months ago

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

RicardoMonti9's tweet photo. 1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it.

Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens. https://t.co/mVCWogFTYd

7

153

30

67

39K

alvin @alvind319

5 months ago

@leavittron @datologyai join @datologyai and yap

0

3

0

44

alvind319 retweeted

Haoli Yin

@HaoliYin

5 months ago

We cut VLM eval compute by >10× while INCREASING signal. The secret? Most benchmark samples are noise: → 70% solvable without the image → 42% mislabeled or ambiguous → MCQ formats hide 35-point capability gaps Presenting: DatBench 🧵 1/n

HaoliYin's tweet photo. We cut VLM eval compute by >10× while INCREASING signal.
The secret? Most benchmark samples are noise:
→ 70% solvable without the image
→ 42% mislabeled or ambiguous
→ MCQ formats hide 35-point capability gaps
Presenting: DatBench
🧵 1/n https://t.co/4tJJnmgjvS

8

208

37

91

37K

alvind319 retweeted

Chris Paxton

@chris_j_paxton

6 months ago

Scaling laws for robotics: large amounts of diverse but high-quality pretraining data allows for significant improvements in the low-data post-training regime.

3

92

7

41

10K

alvind319 retweeted

Luke Merrick @lukemerrick_

6 months ago

Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification. How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features.

lukemerrick_'s tweet photo. Just dropped a new text embedding methodology. Fast as heck on CPU only and still great for document similarity analysis, clustering, and classification.

How? Use a tiny ReLU network to approximate a big transformer from lexical (term frequency / bag of words) features. https://t.co/IXfpZCVcgt

12

200

30

143

89K

alvin @alvind319

6 months ago

Huge congrats to @latkins @stochasticchasm @arcee_ai on the trinity launch!! Was awesome to deliver and learn from our very own mixologists @code_star @_BrettLarsen 🫡

1

25

2

3

2K

alvin @alvind319

7 months ago

@code_star

0

3

0

62

alvin @alvind319

7 months ago

@code_star the only bubble that matters

1

10

0

2K

alvin @alvind319

7 months ago

@code_star this tweet goes non-trivial

1

0

40

alvind319 retweeted

JosH100

@josh_wills

7 months ago

Great talk on @datologyai trillion token synthetic data pipeline by @hurrycane and Fan Pan at #raysummit!

8

67

8

27

18K

alvin @alvind319

7 months ago

@code_star it's actually tensorflow 32.0

1

11

0

704

alvind319 retweeted

Karan Goel

@krandiash

8 months ago

We've raised $100M from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA. Today we're introducing Sonic-3 - the state-of-the-art model for realtime conversation. What makes Sonic-3 great: - Breakthrough naturalness - laughter and full emotional range - Lightning fast -

1K

8K

1K

5K

5M

alvind319 retweeted

JosH100

@josh_wills

8 months ago

1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc. etc.) 👇

2

70

15

17

27K

alvind319 retweeted

Pratyush Maini

@pratyushmaini

10 months ago

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

pratyushmaini's tweet photo. 1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance https://t.co/MUittjMqOO

22

721

126

583

187K

alvin @alvind319

10 months ago

@code_star eval rich**

0

3

0

64

alvin

@alvind319

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users