Enterprise UI is polished… and boring. I spent some time thinking about how AI agents may change that, and why we will need entirely new UX primitives.
https://t.co/oDz0Hpzzv9
What are some things you’ve seen really work when designing LLM evals that rely on LLM-as-a-judge? (e.g., prompt design, rubric structure, aggregation methods, calibration techniques, etc.)
Introducing ❄️ @snowglobe_so, the simulation engine for AI chatbots.
Magically simulate the behavior of your users to test and improve your chatbots.
Find failures before your users do.
🧠 Join the 10k developers supercharging their #LLM skills with Reinforcement Fine-tuning—and it's free! 🧠
Reinforcement Fine-Tuning (#RFT) and #GRPO are fast becoming popular techniques to teach LLMs how to reason.
We teamed up with @DeepLearningAI to build the definitive starter course for RFT including everything you need to know to become and GRPO pro!
✅ 1-hour #FREE short course
✅ Hands-on labs: write #reward functions & train with GRPO and RFT
✅ Learn how to train an LLM to master Wordle and other complex tasks
✅ See exactly when RFT beats supervised fine-tuning (#SFT)
✅ Led by the researchers who built the first managed RFT platform
Join the 10,000+ learners who’ve already leveled-up their model-tuning skills. Enroll free today and start shipping smarter #AI! 👇
Register for the free course: https://t.co/tayUbqNda6
And check out our AMA with the course creators: https://t.co/d1uDsFIbHa
Struggling with context management? Wish you could just stick it all in your model?
We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵
Big news! We will be joining @RubrikInc to accelerate agentic AI adoption from pilot to production at scale! ⚡️
Together, we can deliver radical simplicity in models and data. This is an exciting next step in our journey. More from @devvret_rishi here: https://t.co/a1q9TXykOj
🚀 Fresh off our hit @DeepLearningAI course on RFT + #GRPO, we’re going live!
🎙️ Let’s Talk Tokens: Live #AMA on Reinforcement Fine-Tuning with the Experts Who Built the Definitive Course!
#RFT isn’t just research any more—it’s driving real-world GenAI with tighter feedback loops and smarter #reasoning. Want to know how to ship it without melting GPUs or falling into reward-hack traps?
Join our live AMA to get a lightning-fast technical primer and then we'll hand the mic to you for an interactive #Q&A with the engineers who built the “Reinforcement Fine-Tuning #LLMs with GRPO” course.
Ask us anything about:
• When RFT beats supervised fine-tuning (and when it bombs) 💣
• Designing rewards that don’t collapse your model 🛡️
• Stress-testing RFT models in production 📊
• and more! 🙌
If you’re already fine-tuning models or about to get started, you’ll leave this session with the answers you need to deploy Monday morning.
No fluff. Real engineers. Real talk.
👉 Seats are limited so grab yours now: https://t.co/GwHM3OyEES
It was an honor getting to work together with the https://t.co/pvvRuZg4mC team and my colleague @grg_arnav on this course covering all things Reinforcement Fine-Tuning and GRPO.
Similar to our last course on efficient LLM inference, we wanted to really drill into the intuition behind RFT and build up the concepts piece by piece.
I hope you enjoy the end result!
New Course: Reinforcement Fine-Tuning LLMs with GRPO!
Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with @Predibase, and taught by @TravisAddair, its Co-Founder and CTO, and @grg_arnav, its Senior Engineer and Machine Learning Lead.
Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning.
Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective.
In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks.
In detail, you’ll:
- Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data.
- Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO.
- Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time.
- Design reward functions that power the reinforcement fine-tuning process.
- Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge.
- Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors.
- Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence.
- Launch reinforcement fine-tuning jobs using Predibase’s hosted training services.
By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback.
Please sign up here: https://t.co/2BSuKuzE6N
I had a blast working with the @DeepLearningAI team and my colleague @TravisAddair over the last few months to put this course together on Reinforcement Fine-Tuning with GRPO!
We’ve tried to make this course as practical as possible and help you build intuition. Hope you enjoy!
New Course: Reinforcement Fine-Tuning LLMs with GRPO!
Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with @Predibase, and taught by @TravisAddair, its Co-Founder and CTO, and @grg_arnav, its Senior Engineer and Machine Learning Lead.
Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning.
Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective.
In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks.
In detail, you’ll:
- Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data.
- Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO.
- Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time.
- Design reward functions that power the reinforcement fine-tuning process.
- Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge.
- Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors.
- Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence.
- Launch reinforcement fine-tuning jobs using Predibase’s hosted training services.
By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback.
Please sign up here: https://t.co/2BSuKuzE6N
🚀 Serve and fine-tune #Qwen3 — in your cloud or ours with blazing fast #inference speeds! No need to share your data. 🚀
Qwen 3 is the latest #opensource LLM dominating the leaderboards. Don't get left behind!
Now you can serve and customize the latest Qwen models instantly on our shared serverless endpoints or deploy securely in your own #VPC!
➡️ Try Qwen 3 with $25 in free Predibase credits: https://t.co/P7s8UZSxvs
➡️ Get access to high-end GPUs to deploy Qwen 3 in your cloud: https://t.co/ujZvGSyA10
🐳 AI teams are testing DeepSeek—but nobody agrees on when to use it
In our recent survey of 500+ AI professionals, DeepSeek-R1 is getting serious attention—but it's far from mainstream. Here’s what we uncovered:
📊 57% of teams have experimented with DeepSeek-R1
⚠️ Only 3% have deployed it in production
🤷♂️ Nearly half are unsure how it stacks up to other models
And the demand for customization is clear:
🔧 46% want fine-tuning or distillation options
🧪 The takeaway? DeepSeek-R1 has potential—but teams are still figuring out how to unlock it.
👉 Ready to see if it fits your use case? Start experimenting on Predibase—free trial available.
#AI #LLM #DeepSeek #MLOps #Predibase #GenAI #MachineLearning #opensourcellms
As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (https://t.co/UbBv4rzM09) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training.
I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem.
What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong).
So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness.
In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.
Today we're thrilled to announce the first end-to-end platform for Reinforcement Fine-Tuning.
With just a dozen labeled data points, you can outperform #OpenAI o1 and #DeepSeekR1 on complex tasks. Built on the #GRPO methodology that DeepSeek-R1 popularized, our platform delivers exceptional results.
In our real-world PyTorch to Triton transpilation case study, we achieved 3x higher accuracy than OpenAI o1 and DeepSeek-R1 when writing GPU code.
Check out the thread below to learn how you can adapt an #opensource #LLM to your use cases with unmatched efficiency. #rft