There’s a big misconception about how GLM 5.2 was trained. Yes, they distilled Claude and GPT 5.5 — but distillation is not how they matched Opus quality. Distillation only fixed the cold start problem in RL.
RLing an agentic coding model isn’t rocket science. In simplified terms:
1. RL needs trajectories — rollouts where the model actually completed a task in some env
2. No successful trajectory on a task = zero gradient = you can’t RL it. This is the cold start problem
3. Distillation solves it. You seed your model with knowledge from a smarter one (Claude, GPT) on tasks it can’t do yet
4. Now it produces positive trajectories on those tasks
5. RL on those trajectories and hill climb agentic coding
6. At that point you no longer need to distill and can solely hill climb RL to better models
This is an interesting curve. I’d argue it’s harder to get to Opus 4.8 from scratch than to go from Opus 4.8 → Fable/Mythos tier.
GLM 5.2 is already producing positive trajectories, so they have plenty to RL on — they’ll keep climbing to Mythos quality without distilling any further. They no longer need American models.
If your WFH desk setup doesn't cost more than a used Honda Civic, you aren't serious about your pipeline.
My ergonomic chair is built from the salvaged suspension of a 2019 Tesla Model S.
My primary monitor is a converted IMAX screen I bought from a bankrupt theater in Oakland.
When I drag a cell in Google Sheets, I physically have to rotate my entire torso. I burn 400 active calories a day just searching for the Slack icon.
Stop complaining about back pain and optimize your environment.
to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty"
i've also read the tech report to get an opinion on the technical stuff:
fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier
about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)
the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task
the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release
Lots of misinformation being spread about me the last couple days, so some quick facts
- My name is Tina, not Guo Can (or Jessie Anderson). I’m one of many Raptor flight operators on console since flight2. Before that, I wrote control software for the vehicle, and was a stage software operator for flight1
- Been living in Starbase since surborbital days in 2020, absolutely love it down here. The people are wonderful and so so excited about the mission - the lows are lows but the highs are very high. My friends here are the best in the world, and I love them to the moon/mars and back :)
- The reason I decided to say something was because facts matter, but also because wanted to share my real life journey to how I got here. I don’t have a masters or a PhD, I started full time directly after college after 2x internships also at spacex doing software/automation. I was on a couple design teams in college, including Stanford solar car + mars rover. When I started spacex as a software engineer, I knew very little about fluids / propulsion engineering - I learned a lot of it on the job with some pretty incredible mentors. Then I swapped over to propulsion about halfway through my career and have been loving it ever since
Prepare for takeoff. ✈️ Flight simulator is now available globally on web to all users. https://t.co/hQP0No142P
We've recently added many our most powerful professional desktop features to web. Elevation profiles, new import types, but there's always been one other feature you've been asking us to add to the web version of Google Earth, just for fun...
Where will you fly? Share your best maneuvers, views, and flyovers with us!
🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer?
Best-of-N (e.g., GRPO) and tree search share two limitations:
🔻 Verification signals are sparse
🔻 Candidates stay within the model's own distribution
We introduce BES: Bidirectional Evolutionary Search — a search framework that couples forward candidate evolution with backward goal decomposition.
✅ Works for both post-training and inference.
🌟Introducing🎻Violin — an Open-source Video Translation Skill.
📹Video is the dominant medium on the internet, yet most high-quality content (lecture, talk, podcast) is locked behind a single language, leaving global audiences behind.
So we built Violin: a video skill that combines speech recognition, LLM translation, and speech synthesis into one seamless pipeline.
🌐 Demo: https://t.co/QFLuz4ANoE
📝 Blog: https://t.co/7FLQYQnCkn
🔗 GitHub: https://t.co/Allp6RZV4V
✨Key Features:
🎙️High-quality multilingual ASR & Translation & TTS.
🗣️Personalize translation & voice (turn an academic talk into something children can follow).
💬Chat with the video — ask any questions grounded in the video.
🧩Support Web app, CLI, and Agent skill
🍃Fully open-source under MIT.
❤️Built with the wonderful @ShangZhu18 and advised by @james_y_zou !
All features powered by @togethercompute .
Try it and let us know what you think! 🎻
If you love fine-tuning open-source models (like me), then listen.
> Start with 1B, 2B, 4B, and 8B models. (Don't start with a 27B model or bigger at first.)
> Use WebGPU providers. I use Google Colab Pro for any model smaller than 9B. A single A100 80GB costs around $0.60/hr, which is cheap. Enough for small models.
> Don’t buy GPUs unless you fine-tune 7 to 10 models. You'll understand the nitty-gritty in the process.
> Use Codex 5.5 × DeepSeek v4 Pro to create datasets. Codex to plan, DeepSeek v4 Pro to generate rows.
> Use Unsloth's instruct models as a base from Hugging Face. Yes, there are others too, but Unsloth also provides fast fine-tuning notebooks.
> Use Unsloth's fine-tuning notebooks as a reference. Paste them into Codex, and Codex will write a custom notebook with the configs you need.
> Spend 1 day learning about:
- SFT (supervised fine-tuning)
- RL training (GRPO, DPO, PPO, etc.)
- LoRA / QLoRA training
- Quantization and types
- Local inference engines (llama.cpp)
- KV cache and prompt cache
> Just get started. Claude, Codex, and ChatGPT can design a step-by-step plan for how you can fine-tune your first AI model.
Future tech is moving toward small 5B to 15B ELMs (Expert Language Models) rather than general 1T LLMs.
So fine-tuning is an important skill that anyone can acquire today.
Tune models, test them, use them. Then fine-tune for companies and make a career out of it. (Companies pay $50k+ to fine-tune models on their data so they can get personalized AI models.)
Shoot your questions below. I'll be sharing in-depth raw findings about this topic in the coming days.
Introducing Aurora, a new optimizer for training frontier-scale models.
We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks.
Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs.
By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity.
What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Kinda crazy we can now generate realistic-looking 3d models with AI, generated this one in @omma_ai and built a #threejs scene around it as if it was a furniture website showcase.
Live link: https://t.co/PQn9WSjrm7
Here is an example prompt that I used:
🤯 Ollama now supports Claude Desktop via Claude’s built-in third party inference.
ollama launch claude-desktop
This allows all models from Ollama's Cloud to be used across Claude Cowork and Claude Code from the Claude Desktop app.
Inference Chips for Agent Workflows
@sdianahu
Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result.
That gap is where purpose-built silicon wins.
Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights:
The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). Three examples of new horizons:
1. menugen: an app that can be fully engulfed by LLMs, with no classical code needed: input an image, output an image and an LLM can natively do the thing.
2. install .md skills instead of install .sh scripts. Why create a complex Software 1.0 bash script for e.g. installing a piece of software if you can write the installation out in words and say "just show this to your LLM". The LLM is an advanced interpreter of English and can intelligently target installation to your setup, debug everything inline, etc.
3. LLM knowledge bases as an example of something that was *impossible* with classical code because it's computation over unstructured data (knowledge) from arbitrary sources and in arbitrary formats, including simply text articles etc.
I pushed on these because in every new paradigm change, the obvious things are always in the realm of speeding up or somehow improving what existed, but here we have examples of functionality that either suddenly perhaps shouldn't even exist (1,2), or was fundamentally not possible before (3).
The second (ongoing) theme is trying to explain the pattern of jaggedness in LLMs. How it can be true that a single artifact will simultaneously 1) coherently refactor a 100,000-line code base *and* 2) tell you to walk to the car wash to wash your car. I previously wrote about the source of this as having to do with verifiability of a domain, here I expand on this as having to also do with economics because revenue/TAM dictates what the frontier labs choose to package into training data distributions during RL. You're either in the data distribution (on the rails of the RL circuits) and flying or you're off-roading in the jungle with a machete, in relative terms. Still not 100% satisfied with this, but it's an ongoing struggle to build an accurate model of LLM capabilities if you wish to practically take advantage of their power while avoiding their pitfalls, which brings me to...
Last theme is the agent-native economy. The decomposition of products and services into sensors, actuators and logic (split up across all of 1.0/2.0/3.0 computing paradigms), how we can make information maximally legible to LLMs, some words on the quickly emerging agentic engineering and its skill set, related hiring practices, etc., possibly even hints/dreams of fully neural computing handling the vast majority of computation with some help from (classical) CPU coprocessors.