Introducing Critique Fine-Tuning (CFT): a more effective SFT method for enhancing LLMs' reasoning abilities.
📄 Paper: https://t.co/oK4vCIMP7z
CFT is simple: instead of training models to directly answer questions, we train them to critique noisy answers.
What's fascinating is that while most approaches focus on using generative critique or reward models to provide feedback for policy models, these critique models can themselves serve as policy models: directly answering questions with stronger reasoning.
Interestingly, we also found that CFT saturates quickly: overtraining on critiques can even degrade problem-solving performance.
Work led by @YuboWang726 and collaborated with @WenhuChen
Run DeepSeek-R1 (671B) locally on @OpenWebUI - Full Guide
No GPU required.
Using our 1.58-bit Dynamic GGUF and llama.cpp.
Tutorial: https://t.co/xaR9KpJzcj
🔥 o3-mini-high beats deepseek r1 and o1-pro! in a p5.js challenge!
03-mini result is so good that deserves a video on its own.
deepseek r1 (bad result) and o1-pro (better) in comments below.
Prompt in last comment.
1/4
Transformers can overcome easy-to-hard and length generalization challenges through recursive self-improvement.
Paper on arxiv coming on Monday.
Link to a talk I gave on this below 👇
Super excited about this work!
o3-mini is out!
smart, fast model.
available in ChatGPT and API.
it can search the web, and it shows its thinking.
available to free-tier users! click the "reason" button.
with ChatGPT plus, you can select "o3-mini-high", which thinks harder and gives better answers.
📚🤖 Advanced RAG + Agents Cookbook
A comprehensive open-source guide delivering production-ready implementations of cutting-edge RAG techniques with AI agents. Built with LangChain and LangGraph, it features advanced implementations like Hybrid, Self, and ReAct RAG.
Learn more: https://t.co/pXkXMFFSYt
Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s🔥
Inspired by our team's effort to open-source DeepSeek's R1 training, we are releasing the training and evaluation code on top of the weights 🫡
Now you can train any of our SmolVLMs—or create your own custom VLMs!
Letter-dropping physics comparison: o3-mini vs. deepseek-r1 vs. claude-3.5 in one-shot - which is the best? Prompt:
Create a JavaScript animation of falling letters with realistic physics. The letters should:
* Appear randomly at the top of the screen with varying sizes
* Fall under Earth's gravity (9.8 m/s²)
* Have collision detection based on their actual letter shapes
* Interact with other letters, ground, and screen boundaries
* Have density properties similar to water
* Dynamically adapt to screen size changes
* Display on a dark background
AI Agents for Computer Use
This report provides a comprehensive overview of the emerging field of instruction-based computer control, examining available agents – their taxonomy, development, and resources.
Gemini 2.0 doesn’t get nearly enough credit. I just dumped all my workers-qb source code into it, hit it with a simple, humble prompt, and boom => it one-shotted the docs.
Not just good docs, way better than what I had before, packed with examples.
Kinda insane.
OpenAI o3-mini just one shotted this
prompt: write a script for 100 bouncing yellow balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.js
Finished a run (R1 style) GRPO on Qwen-2.5-0.5B (base model) yield +10 accuracy points on GSM8K. Literally just works. Base model scores 41.6% as reported on qwen paper vs 51%~ GRPO
for people learning gpu programming and especially triton should check out liger kernel by linkedin
it was released last year and built on top of triton to provide pre-optimized, ready-to-use implementations gpu optimization techniques specifically tailored for llm training
Excited to announce https://t.co/azlzx4Rrah
A website that turns any website into a get API with @firecrawl /extract endpoint. Data on the web has never been more accessible!
Thanks to @devdigest, for starting this fabulous trend. Check out his GitHub repo below!
OpenAI o3-mini is a good model, but DeepSeek r1 is similar performance, still cheaper, and reveals its reasoning.
Better models will come (can't wait for o3pro), but the "DeepSeek moment" is real. I think it will still be remembered 5 years from now as a pivotal event in tech history, due in-part to the geopolitical implications but for many other reasons too.
All this discussed in 5 hour technical podcast I just recorded on the state of AI industry. Out tomorrow (hopefully).
OpenAI’s o3-mini is here - a significant jump forward from o1-mini
Initial results (full benchmarking coming soon):
➤ Artificial Analysis Quality Index of 89, matching DeepSeek R1 and just below o1
➤ Cheaper - $1.1/$4.4 input/output pricing per million tokens, lower than many DeepSeek R1 APIs (higher than DeepSeek’s first party R1 API)
➤ Fast - similar speed to o1-mini at 170 tokens/s, although that means 2000 tokens of ‘thinking’ time will still take ~12 seconds
When working with o1/o3 models, I always have this feeling that I'm leaving a lot on the table with my prompting. Creating a long sequence of prompts for regular LLMs is good practice. This is because you don't want to overload what an LLM can process (or it'll lead to hallucinations). But Large Reasoning Models (LRMs) are different.