Thong

@thongonx

AI @Snowflake. Ex-physicist @Caltech & @CERN. Fmr Principal Science Mgr @Microsoft.

Seattle, WA

Joined October 2009

320 Following

182 Followers

352 Posts

Thong @thongonx

2 days ago

Very honored to be a part of this project! By decoupling GPU orchestration from high-level RL algorithms, ArcticRL introduces a unified open-source backend that houses the modern RL optimization stack to deliver up to 3.5x e2e training speedup.

2 days ago

After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL https://t.co/B5EgRoSOCb - Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required - ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs - Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions - Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

StasBekman's tweet photo. After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://t.co/B5EgRoSOCb

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

5

77

15

64

5K

0

2

1

0

140

thongonx retweeted

2 days ago

After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL https://t.co/B5EgRoSOCb - Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required - ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs - Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions - Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

StasBekman's tweet photo. After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://t.co/B5EgRoSOCb

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

5

77

15

64

5K

Thong @thongonx

over 1 year ago

@sytelus Nit: Entropy measures chaos / uncertainty of a **system**, not the data itself. You would always want to reduce entropy by adding low-probability event data. Synthetic data typically follows the existing distribution (high prob. event), hence does not reduce significant entropy.

0

0

0

0

20

Thong @thongonx

about 2 years ago

Physicists have been using fake data to build models for nearly a century. But no one has even bothered to call us out 🙃

thongonx's tweet photo. Physicists have been using fake data to build models for nearly a century. But no one has even bothered to call us out 🙃 https://t.co/sIAfV6kRm7

0

3

0

0

217

Who to follow

Verified account

@ Google Deepmind. Past: @MetaAI, @OpenAI, @unitygames, @losalamosnatlab, @Princeton etc. Always hungry for intelligence. Only my opinions stored here.

Verified account

research scientist @googledeepmind ✨♊, model co-lead/captain of gemini deepthink imo gold medal 🥇, opinions are my own.

Verified account

cofounder @cohere

Thong @thongonx

about 2 years ago

https://t.co/X64CZTJSBQ

0

1

1

0

117

Thong @thongonx

about 2 years ago

Language models are leading a revolution into the new era — conversational computing. Read my latest piece on @wef

thongonx's tweet photo. Language models are leading a revolution into the new era — conversational computing. Read my latest piece on @wef https://t.co/Q5H9zOpbRA

1

0

0

0

144

thongonx retweeted

about 2 years ago

😔 RIP David Louis Goodstein @Caltech April 5 1939-April 10 2024 https://t.co/uQFHtk9FnQ Famous opening paragraph of his "States of Matter" stat mech textbook

Caltech_LHC's tweet photo. 😔 RIP David Louis Goodstein @Caltech April 5 1939-April 10 2024 https://t.co/uQFHtk9FnQ
Famous opening paragraph of his "States of Matter" stat mech textbook https://t.co/8WH4GshACa

0

2

2

0

195

Thong @thongonx

over 2 years ago

Reddit ads algorithm is on another level

thongonx's tweet photo. Reddit ads algorithm is on another level https://t.co/7cDPMqrENW

0

1

0

0

109

Thong @thongonx

over 2 years ago

That Netflix paper analyzes embeddings from a simplistic linear matrix factorization setup without normalization and ML Twitter acts like it’s the end of RAG without even reading it.

thongonx's tweet photo. That Netflix paper analyzes embeddings from a simplistic linear matrix factorization setup without normalization and ML Twitter acts like it’s the end of RAG without even reading it. https://t.co/ZK85fOUake

0

0

0

0

104

Thong @thongonx

over 2 years ago

@dylan522p Isn’t GPT-4 also around 35% MFU? The RingAttention paper also reports 30-36% MFU, on par with LLaMa with memory-efficient attention & FFN.

1

2

0

0

1K

Thong @thongonx

over 2 years ago

@ylecun @yaroslavvb Step 1. Build an autoencoder with latent dimension ~ 2M. Step 2. Flatten GPT-4 weights. Step 3. Train and overfit the autoencoder with the flattened GPT-4 input. Step 4. Extract the latent variable. Congratulations, you’ve got the SOTA LLM in less than 8MB of information!

0

2

0

1

182

Thong @thongonx

over 2 years ago

@satnam6502 @groq Is there any chance you could make the presentation public after this conference?

1

0

0

0

195

Thong @thongonx

over 2 years ago

Nonlinearity is the heart of deep learning. The superiority of Mamba over Transformers through linear modeling in state space is strikingly beautiful.

0

2

0

1

201

Thong @thongonx

over 2 years ago

The AlphaGeometry paper itself is great, but the media hype around it is unbearable. This is not AGI. A better comparison is Stockfish / AlphaGo / AlphaFold — marvelous AI-powered engines with embedded domain knowledge during construction, which aren’t generalizable.

1

1

0

0

244

Thong @thongonx

over 2 years ago

@eigenhector Any QFT book has a gentle-introduction-to-gauge-theory section chapter :-) also much more fun to read than a pure math textbook IMO

1

3

0

0

93

Thong @thongonx

over 2 years ago

@francoisfleuret @DaniloJRezende @pfau If simulation gives you new knowledge we wouldn’t have to spend $20B building a new particle accelerator :-)

1

2

0

0

218

Thong @thongonx

over 2 years ago

Back in 2020 when I was at Google X, there was a QFT course taught by Lenny Susskind. I registered out of curiosity, ended up in the same room with Sergey. His understanding of field theory surpassed a lot of PhD students. Not surprised if he actually contributed to Gemini.

over 2 years ago

sergey brin being a core contributor on the gemini paper is peak technical founder

nearcyan's tweet photo. sergey brin being a core contributor on the gemini paper is peak technical founder https://t.co/ILqca6RjhT

18

2K

113

222

588K

0

6

1

2

763

Thong @thongonx

over 2 years ago

@sytelus @jeffboudier You can just configure fp8 during training directly, can’t you? This library is more about quantizing models trained in fp16 (on A100s) into fp8 for inference on H100.

0

0

0

0

48

Thong @thongonx

over 2 years ago

@ylecun There are multiple design pathways to achieve the same end goal. We would’ve ended up with a flappy airplane if we tried to copy a bird. Aerodynamics / first principles thinking gave us the airplane today. Same for AGI — it’s pointless to compare with the brain architecture.

0

0

0

0

40

Thong @thongonx

over 2 years ago

@ChrSzegedy Simplicity is a criterion, but putting too much weight on it has misled physicists in the past 2 decades.

2

1

0

0

183

Last Seen Users on Sotwe

Trends for you

Most Popular Users