Diana Pojar

GP @sparkcapital // databases, distributed systems, and developer tools // formerly @AmplifyPartners @GoogleCloud @Oracle 🌎🌱

5 months ago

Come have dinner with me

1

18

2

1

2K

podiana retweeted

Latent.Space

@latentspacepod

10 months ago

🆕 pod: Better Data is All You Need https://t.co/acv2vxZseS A brief history of open LLM data corpuses: C4 -> Redpajama -> RefinedWeb -> FineWeb -> DCLM -> BetterWeb @arimorcos of @datologyai drops by to tell us about their automated Data Curation work, beating the DCLM baseline by 12x! also ft. 2025 updates on the state of Synthetic Data and the Return of Curriculum Learning!

3

64

14

47

54K

Who to follow

Natalie Vais

@natalievais

co-founder & CTO @DatologyAI working to make it easy for anyone to make the most of their data, hax0r, ex-@Twitter & Amazon Engineering

Slater Stich

@slaterstich

doing capitalism in the innovation economy

podiana retweeted

Pratyush Maini

@pratyushmaini

11 months ago

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

pratyushmaini's tweet photo. 1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance https://t.co/MUittjMqOO

22

723

125

583

187K

podiana retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

11 months ago

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining "we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia."

iScienceLuvr's tweet photo. BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

"we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia."

11

424

68

380

49K

11 months ago

It’s all about the data

Ari Morcos

@arimorcos

11 months ago

Always has been.

0

28

2

3K

0

8

1

0

1K

podiana retweeted

Ari Morcos

@arimorcos

about 1 year ago

Congratulations to our friends and partners @arcee_ai on the release of AFM-4.5B! With data powered by @datologyai, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.

0

48

10

7

6K

podiana retweeted

about 1 year ago

We've definitely seen signs of this already — perhaps not surprisingly, post-training people tend to care more about the value of data. We see a number of companies turning to @datologyai for getting the most out of their existing datasets!

0

16

3

1

2K

podiana retweeted

Ari Morcos

@arimorcos

about 1 year ago

We've improved our image-text curation significantly from our last blog post, now beating SigLIP2 through *data interventions alone* using vanilla CLIP. So proud of @RicardoMonti9, @HaoliYin, @leavittron and the rest of the team! Check out the thread for all the details 👇

arimorcos's tweet photo. We've improved our image-text curation significantly from our last blog post, now beating SigLIP2 through *data interventions alone* using vanilla CLIP.

So proud of @RicardoMonti9, @HaoliYin, @leavittron and the rest of the team! Check out the thread for all the details 👇 https://t.co/YsJsVlR9JN

0

46

10

3

5K

about 1 year ago

Hey folks! I'll be at the Snowflake Summit on June 2-5 and Databricks Data + AI Summit on June 9–12, so if anyone else is in SF during that time and want to catch-up over lunch or coffee, please reach out!

1

4

0

235

podiana retweeted

over 1 year ago

Definitely a paradigm shift we're still learning to navigate intelligently as an industry. Knowing when and how to use AI tools effectively is becoming essential.

1

11

4

0

2K

podiana retweeted

Matthew Leavitt

@leavittron

over 1 year ago

🧵We’ve spent the last few months at @datologyai building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥

5

175

36

90

76K

about 2 years ago

Hey! I’m ar the Databricks Data+AI summit - let me know if you’re here and want to catchup!

0

2

0

522

about 2 years ago

@umang Wow

0

232

about 2 years ago

@sarahcat21 Congrats! She’s adorable 😍

0

1

0

65

about 3 years ago

@johnnyrodgersis La Cabra for coffee ❤️

1

0

367

over 3 years ago

🫣

over 3 years ago

After like a month of chasing why my wifi is breaking down periodically it turns out that it's caused some weird bug in WiFi calling on iPhones that is causing the network to be flooded by ESP UDP packets (to the point in which it was doing 500Mbit+ of this)

2

19

0

1

7K

0

1

0

2K