Mini-R1: Reproduce @deepseek_ai R1 „aha moment“ a RL tutorial! Recreate an RL "aha moment" using Group Relative Policy Optimization (GRPO) and train an open model using reinforcement learning to teach it self-verification and search abilities all on its own to solve the Countdown Game.
TL;DR:
🤯 DeepSeek R1's "aha moment" demonstrates RL's potential for self-improvement in LLMs.
2️⃣ Using 2 reward functions, 1x for format (<think>,<answer>) and 1x for correctness
🤖 Qwen2.5-3B-Instruct model learns self-verification and search abilities.
⚙️ Use @MSFTDeepSpeed and @vllm_project for efficient and distributed online RL Training with @huggingface TRL
🤟 Include Training Observations and Hyperparameter improvements
🧮 Uses Countdown Game (arithmetic puzzles) to teach models self-correction via <think> and <answer> tags
📊 Achieves 50% success rate after 450 training steps on 4x H100 GPUs
⚡ Training takes ~6 hours on 4x H100 GPUs for 450 steps
NetworkX from NVIDIA is one THE most popular Python graph analytics library with ~15K Github starts and 80M downloads monthly.
This library is for working with networks and graphs. It helps analyze connections between things - like social networks, computer networks, or any system where objects are connected to each other.
And now NetworkX just got massively accelerated after its backend integration with NVIDIA's cuGraph.
✨ Up to 500x speedups on large graph workloads in NetworkX with zero code changes.
And it is Zero Code Change Acceleration.
📌 cuGraph is NVIDIA's GPU-accelerated graph analytics library within the RAPIDS ecosystem. The library provides fast graph algorithms on GPUs, supporting property graphs, remote operations, and graph neural networks (GNNs). Works with GPU DataFrames (cuDF) and integrates smoothly with NetworkX-like API.
--------
📌 The traditional bottleneck of NetworkX's pure Python implementation becomes apparent when processing graphs larger than 100K nodes and 1M edges.
📌 And so now cuGraph solves this by offloading supported algorithms to the GPU. PageRank, Louvain community detection, betweenness centrality, and about 60 other algorithms get instant acceleration.
📌 This acceleration enables previously impractical use cases. Fraud detection systems can now process massive transaction networks in real-time. Recommendation engines handle millions of user-item interactions efficiently. Social network analysis scales to entire platforms worth of data on a single machine.
@NVIDIAAIDev
Preprint on "BWT construction and search at the terabase scale". We can compress 100 human genomes to 11GB in 21 hours, find SMEMs with it, do affine-gap alignment and retrieve similar local haplotypes. 7.3Tb commonly sequenced bacterial genomes ⇒ 30GB https://t.co/DiRwZNHVVa
Introducing Critique Fine-Tuning (CFT): a more effective SFT method for enhancing LLMs' reasoning abilities.
📄 Paper: https://t.co/oK4vCIMP7z
CFT is simple: instead of training models to directly answer questions, we train them to critique noisy answers.
What's fascinating is that while most approaches focus on using generative critique or reward models to provide feedback for policy models, these critique models can themselves serve as policy models: directly answering questions with stronger reasoning.
Interestingly, we also found that CFT saturates quickly: overtraining on critiques can even degrade problem-solving performance.
Work led by @YuboWang726 and collaborated with @WenhuChen
Run DeepSeek-R1 (671B) locally on @OpenWebUI - Full Guide
No GPU required.
Using our 1.58-bit Dynamic GGUF and llama.cpp.
Tutorial: https://t.co/xaR9KpJzcj
DeepSeek [1] uses elements of the 2015 reinforcement learning prompt engineer [2] and its 2018 refinement [3] which collapses the RL machine and world model of [2] into a single net through the neural net distillation procedure of 1991 [4]: a distilled chain of thought system.
REFERENCES (easy to find on the web):
[1] #DeepSeekR1 (2025): Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2501.12948
[2] J. Schmidhuber (JS, 2015). On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models. arXiv 1210.0118. Sec. 5.3 describes the reinforcement learning (RL) prompt engineer which learns to actively and iteratively query its model for abstract reasoning and planning and decision making.
[3] JS (2018). One Big Net For Everything. arXiv 1802.08864. See also US11853886B2. This paper collapses the reinforcement learner and the world model of [2] (e.g., a foundation model) into a single network, using the neural network distillation procedure of 1991 [4]. Essentially what's now called an RL "Chain of Thought" system, where subsequent improvements are continually distilled into a single net. See also [5].
[4] JS (1991). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991. First working deep learner based on a deep recurrent neural net hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training (the P in CHatGPT) and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills - such approaches are now widely used. See also [6].
[5] JS (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990, introducing high-dimensional reward signals and the GAN principle). Contains summaries of [2][3] above.
[6] JS (AI Blog, 2021). 30-year anniversary: First very deep learning with unsupervised pre-training (1991) [4]. Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled [4] into a single deep neural network. 1993: solving problems of depth >1000.
🔥 o3-mini-high beats deepseek r1 and o1-pro! in a p5.js challenge!
03-mini result is so good that deserves a video on its own.
deepseek r1 (bad result) and o1-pro (better) in comments below.
Prompt in last comment.
1/4
🚨 o3-mini crushed DeepSeek R1 🚨
"write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"
Transformers can overcome easy-to-hard and length generalization challenges through recursive self-improvement.
Paper on arxiv coming on Monday.
Link to a talk I gave on this below 👇
Super excited about this work!
o3-mini is out!
smart, fast model.
available in ChatGPT and API.
it can search the web, and it shows its thinking.
available to free-tier users! click the "reason" button.
with ChatGPT plus, you can select "o3-mini-high", which thinks harder and gives better answers.
📚🤖 Advanced RAG + Agents Cookbook
A comprehensive open-source guide delivering production-ready implementations of cutting-edge RAG techniques with AI agents. Built with LangChain and LangGraph, it features advanced implementations like Hybrid, Self, and ReAct RAG.
Learn more: https://t.co/pXkXMFFSYt
Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s🔥
Inspired by our team's effort to open-source DeepSeek's R1 training, we are releasing the training and evaluation code on top of the weights 🫡
Now you can train any of our SmolVLMs—or create your own custom VLMs!
Letter-dropping physics comparison: o3-mini vs. deepseek-r1 vs. claude-3.5 in one-shot - which is the best? Prompt:
Create a JavaScript animation of falling letters with realistic physics. The letters should:
* Appear randomly at the top of the screen with varying sizes
* Fall under Earth's gravity (9.8 m/s²)
* Have collision detection based on their actual letter shapes
* Interact with other letters, ground, and screen boundaries
* Have density properties similar to water
* Dynamically adapt to screen size changes
* Display on a dark background
AI Agents for Computer Use
This report provides a comprehensive overview of the emerging field of instruction-based computer control, examining available agents – their taxonomy, development, and resources.
Gemini 2.0 doesn’t get nearly enough credit. I just dumped all my workers-qb source code into it, hit it with a simple, humble prompt, and boom => it one-shotted the docs.
Not just good docs, way better than what I had before, packed with examples.
Kinda insane.