So excited to be opening up OpenEnv to the whole community. It will now be owned by @huggingface , Meta-PyTorch, @reflection_ai , @UnslothAI , @modal, @PrimeIntellect , @NVIDIAAI , @mercor_ai , and @fleet_ai .
the reason is: frontier labs train the model and the harness together, so the model is fitted to its harness. that coupling is a chunk of why claude code and codex feel so good.
open source can't do that. you bring whatever harness, whatever model, whatever env, whatever trainer. which is the whole point of open source and also the problem for training.
openenv is the socket in between all of this.
in short: it's a protocol layer, not a reward framework. it does not have opinions about your rewards or your training loop. those live in the libs that are actually good at them.
read more in the blog post. it's early, come break it.
So proud of this release! It's the first step towards agents running on device.
We learned so, so much post-training this model (stay tuned!). Massive congrats to the team, you've been amazing to work with ♥️
Spent the weekend crossing one thing off my "to learn" list:
GRPO
In this blog, we walk through:
• What is GRPO and how does it work
• Fine-tune @liquidai's LFM2.5-1.2B-Instruct
• using @UnslothAI and some free @kaggle T4s
Blog: https://t.co/vv3VK4GF1j
Kaggle Notebook: https://t.co/hXOV9z4mK3
🧮 Synthetic pretraining for sub-1B reasoning models
Cool write-up from Tufa Labs (Matteo Saponati) on whether synthetic data augmentation actually helps very small (<1B) models reason better.
They pretrain a 0.8B model with the Qwen3 architecture from scratch on 12B tokens of MegaMath-Web-Pro-Max, and compare the original corpus against three synthetic rewrite prompts.
→ The synthetic-pretrained models match the original's final accuracy with 3-6x fewer training tokens on GSM8K and ~2.5x fewer on MATH500.
→ The generator is Qwen3.5-0.8B in non-thinking mode, same parameter count as the student. This shows that a larger teacher is not necessarily needed.
→ The few-shot gap widens as you add more demonstrations (synthetic models pull 2-3x further ahead at higher shot counts), and it holds when demonstrations are randomized per question.
→ All three rewriting prompts beat the baseline despite very different output lengths (1.75x token ratio for the lightest "rephrasing" prompt, up to 3.53x for "first principles"). The shortest one is still competitive, which is interesting from a generation cost perspective.
I enjoyed the fact that it was nicely self-contained and focused on small language models. It's particularly interesting to see that the shortest rephrasing prompt performs comparably to the longer ones at a much lower generation cost.
Time to consider not just human visitors, but to treat agents as first-class citizens. Cloudflare’s network now supports real-time content conversion to Markdown at the source using content negotiation headers.
https://t.co/B7wYH4PtA8