Very honored to be a part of this project!
By decoupling GPU orchestration from high-level RL algorithms, ArcticRL introduces a unified open-source backend that houses the modern RL optimization stack to deliver up to 3.5x e2e training speedup.
After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL
https://t.co/B5EgRoSOCb
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
After many months of intense work the @Snowflake AI Research team is happy to present to you the new open source project: Arctic RL
https://t.co/B5EgRoSOCb
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
@sytelus Nit: Entropy measures chaos / uncertainty of a **system**, not the data itself. You would always want to reduce entropy by adding low-probability event data. Synthetic data typically follows the existing distribution (high prob. event), hence does not reduce significant entropy.
😔 RIP David Louis Goodstein @Caltech April 5 1939-April 10 2024 https://t.co/uQFHtk9FnQ
Famous opening paragraph of his "States of Matter" stat mech textbook
That Netflix paper analyzes embeddings from a simplistic linear matrix factorization setup without normalization and ML Twitter acts like it’s the end of RAG without even reading it.
@dylan522p Isn’t GPT-4 also around 35% MFU? The RingAttention paper also reports 30-36% MFU, on par with LLaMa with memory-efficient attention & FFN.
@ylecun@yaroslavvb Step 1. Build an autoencoder with latent dimension ~ 2M.
Step 2. Flatten GPT-4 weights.
Step 3. Train and overfit the autoencoder with the flattened GPT-4 input.
Step 4. Extract the latent variable.
Congratulations, you’ve got the SOTA LLM in less than 8MB of information!
The AlphaGeometry paper itself is great, but the media hype around it is unbearable. This is not AGI. A better comparison is Stockfish / AlphaGo / AlphaFold — marvelous AI-powered engines with embedded domain knowledge during construction, which aren’t generalizable.
Back in 2020 when I was at Google X, there was a QFT course taught by Lenny Susskind. I registered out of curiosity, ended up in the same room with Sergey. His understanding of field theory surpassed a lot of PhD students. Not surprised if he actually contributed to Gemini.
@sytelus@jeffboudier You can just configure fp8 during training directly, can’t you? This library is more about quantizing models trained in fp16 (on A100s) into fp8 for inference on H100.
@ylecun There are multiple design pathways to achieve the same end goal. We would’ve ended up with a flappy airplane if we tried to copy a bird. Aerodynamics / first principles thinking gave us the airplane today. Same for AGI — it’s pointless to compare with the brain architecture.