Meet the new Stitch, your vibe design partner.
Here are 5 major upgrades to help you create, iterate and collaborate:
🎨 AI-Native Canvas
🧠 Smarter Design Agent
🎙️ Voice
⚡️ Instant Prototypes
📐 Design Systems and DESIGN.md
Rolling out now. Details and product walkthrough video in 🧵
SFT Memorizes, RL Generalizes. New Paper from @GoogleDeepMind shows that Reinforcement Learning generalizes at cross-domain, while SFT primarily memorizes. rule-based tasks, while SFT memorizes the training rule. 👀
Experiments
1️⃣ Model & Tasks: Llama-3.2-Vision-11B; GeneralPoints (text/visual arithmetic game); V-IRL (real-world robot navigation)
2️⃣ Setup: SFT-only vs RL-only vs hybrid (SFT→RL) pipelines + RL variants: 1/3/5/10 verification iterations (”Reject Sampling”)
3️⃣ Metrics: In-distribution (ID) vs out-of-distribution (OOD) performance
4️⃣ Ablations: Applied RL directly to base Llama-3.2 without SFT initialization; Tested extreme SFT overfitting scenarios; Compared computational costs versus performance gains
Insights/Learning
💡 Outcome-based rewards are key for effective RL training
🎯 SFT is necessary for RL training when the backbone model does not follow instructions
🔢 Multiple verification/Reject Sampling help improve generalization up to ~6%
🧮 Used Outcome-based/rule-based reward focusing on correctness
🧠 RL generalizes in rule-based tasks (text & visual), learning transferable principles.
📈 SFT leads to memorization and struggles with out-of-distribution scenarios.
Today we launched our Pika 2.0 model. Superior text alignment. Stunning visuals. And ✨Scene Ingredients✨that allow you to upload images of yourself, people, places, and things—giving you more control and consistency than ever before.
It’s almost like twelve days worth of gifts in one 💅
Available now at https://t.co/GpHQsnLhx1
This may be the most important figure in LLM research since the OG Chinchilla scaling law in 2022. The key insight is 2 curves working in tandem. Not one.
People have been predicting a stagnation in LLM capability by extrapolating the training scaling law, yet they didn't foresee that inference scaling is what truly beats the diminishing return.
I posted in February that no self-improving LLM algorithm was able to gain much beyond 3 rounds. No one was able to reproduce AlphaGo's success in the realm of LLM, where more compute would carry the capability envelope beyond human level.
Well, we have turned the page.
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter.
1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased.
2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS).
3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month:
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search.
4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much.
5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards.
This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Introducing 𝐀𝐋𝐎𝐇𝐀 𝐔𝐧𝐥𝐞𝐚𝐬𝐡𝐞𝐝 🌋 - Pushing the boundaries of dexterity with low-cost robots and AI. @GoogleDeepMind
Finally got to share some videos after a few months. Robots are fully autonomous filmed in one continuous shot. Enjoy!
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
https://t.co/YYpOAcrXQ3
Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”