It makes sense that a new Mythos would be available by now, after all it's been four months since the original Mythos finished training! Four months is a long time in AI and nothing about the recent actions by the Trump administration actually pauses AI development internally.
If current AI trends for 2024-2026 continue, this new Mythos likely scores >4h on the "METR 80%" and has a decent shot of >6h (current-Mythos was 3.1h and Claude 4.6 Opus was 1.2h).
The timing also makes sense. Anthropic-Amazon New Carlisle opened in Dec 2025 with 471k H100-equivalents of compute and Mythos came out three months later.
In March, Anthropic-Amazon New Carlisle was upgraded to host 687k H100-equivalents and June would be about three months after that as well. With twice as much compute as before, new-Mythos is probably quite capable!
Like I always said, Mythos was just the beginning and the AI trend is continuing. Whether we're ready for that is increasingly a guess rather than a certainty.
A few people have asked why focus on serving OSS models now (and why not just use claude or gpt)? starting with dsv3 (we really llama 405B) OSS models became good enough bases to post train from but they were not parctiaclly useful themselves for most people / for most things (at least not for me). Starting with kimi k2.5, i was reasonably able to move 20% of my tokens to an OSS model for things that were simple enough / fit well enough into the training dataset, to use any model for (i.e, there was absolutely no difference between using opus and k2.5 for those sessions, and i had regression tests to prove it). over the last 6 months, that number moved from 20% -> 80% with our k2.6 fine tune and now, with the combination of GLM 5.2 and Kimi 2.7 , i can reasonable say that 80% of my total token use can be OSS. however, in order to make that practical, there needs to be first class model support in coding harnesses, and eventually these models should be fine tuned / virtically integrated for the tools and platforms they are expected to operate it . this is how ncode was born originally and as internal tool and how i want to contine to proceed (for the near term). find the best OSS models available, provide first class support for them in ncode and on noumena, iterate and fine tune those models for using ncode and the noumena platform as we find deficiencies and sharp edges, and raise the number of tokens you can meaningfully send to OSS models higher and higher with each iteration eventually making these models and tools your preferred way to interact with AI
What all is involved with onboarding a new model to noumena and ncode? I added GLM support over the weekend so i thought this might be interesting for some of you.
first, you have to understand the architecture and how to properly serve it . luckily GLM5.2 is close enough to DeepSeek (which i have spent nearly a year working closely with) that this part fit very well into the existing serving platforms. it took a bit of DSA tuning but other than that, more or less was able to just be deployed in my existing dsv3 harness including the FA4 work, etc i have done over the past few months
so now that it is serving you have to write model specific stream parsers for the chat format, the reasoning logic and the tool call format. writing the parsers are pretty straight forward as the hugging face project usually comes with the .jinja to specify it but understanding how to parse it in a stream and what the typical generation errors look like is a bit more challenging (you cant just look for opening and close brackets as parallel tool calls stream out a few tokens at a time) . when there is an error, typically you would log this as training data and make sure the mode was more robust next time, but as this is an OSS model, and i do my best not to save any customer data on purpose EVER, you need to be more clever. this typically means exposing the poorly formatted data back to the model and saying 'this is bad, dont do this please'.
now this is just the serving end to get the responses into an openai compatible format, but to add support into ncode, it means exercising every tool call available to the model and common tool call chains to make sure the prompts, tool schema contracts and the ncode side parsing all the model to understand how to use all of the tools at its disposal (and ideally use them well) . luckily GLM was very well trained on ncode shaped tool calls so it didnt take as much work as i had feared. Similarly to the serving side, as i do not store session data for training, in order to make the model behave better, the idea is to give the model context when it screws up tool calls such that it can properly format the call on the next turn.
there is a ton more required on the model routing, and preview metadata , and supporting multiple models in a single session and kv caching that is less interesting, but that is less than 1/2 the hours spent getting GLM onboarded for everyone!
Hopefully you found that interesting and you continue to use and enjoy GLM 5.2 on noumena with ncode
Review of what we did at Noumena over the weekend:
- Added first class support for GLM to noumena and ncode which means making sure tool calling, function parsing, app routing, reasoning traces, etc work as well as possible for a model that was not finetuned on the harness.
- ramping and scaling the clusters for the additional model and load.
- Most of this weekend was spend hardening capacity and abusive sessions via the api. certain keys were spamming 1m ctx len requests and causing very long TTFT for the rest of the sessions on whichever cluster they were hitting . That has now been addressed and we have split the api endpoint to add glm-5.2 and glm-5.2[1m] to make ttft and regular ncode sessions go back to being lightening fast as of midnight on Sunday
- Interactions have been so positive with GLM 5.2 that i have changed the default model in ncode from kimi to glm . your fresh builds of ncode should automatically pick up the change but if you still see kimi as your default, you can switch the model selection with the /model slash command and update your settings at CONFIG_HOME (usually ~/.config/noumena/ncode/settings.json)
- To help alleviate some additional load on the system so we can try to keep it free for y'all for just a little longer, we adding support for DSV4-Flash as the haiku class model . that means, for new builds of ncode, glm is the default opus mode, kimi is the default sonnet model, dsv4-flash is the default kimi model
- we ramped down kimi capacity because the overwhelming traffic was pointed at the glm endpoint, but i do strill really like kimi in certain situations and will try to maintain access to it . it is honestly the perfect sonnet class model for subagents etc in ncode
- cleared the some backlog items on the way to ship some additional features this week
- woke up in the middle of the night to deal with my x account being hacked
Should be another amazing week this week! cant wait
@0xIlyy Hi, this is an update we rolled out in April that applies only to a small subset of users flagged for potentially fraudulent activity instead of outright banning them.
It was updated on June 17 as an update to the appeals process. It's unrelated to the Fable or Mythos rollout.
A new, more capable version of Mythos has emerged from training. I don't know whether it will be called Mythos 5.1 or Mythos 6, or if Anthropic will keep it internal to accelerate further development - but it has arrived.
Stopping models like Fable 5 or Mythos 5 from being served to the public does nothing to slow down development. In fact, it probably speeds it up slightly by freeing up resources. There are also no rules preventing the labs from continuing to advance capabilities while any current model is under embargo - or from keeping progress quiet until they choose to release it. None of them can afford to pause or slow down. We need only look at how capable GLM-5.2 is as proof of this. To protect their business models, the frontier labs must continually train increasingly capable systems to stay ahead of open source, and each other. The current continues to rage beneath the ice, and we continue to race toward our destination.
We built the Codex App with models that were okayish at front-end.
Wait to see what we can do when we finally improve front-end capabilities significantly in our models. That day will be something.
GLM 5.2 is now on DeepSWE as the top open-source model on our leaderboard.
With a pass@1 score of 44% at max effort, GLM 5.2 is indisputable #1 open-source model besting Kimi K2.7 Code by 17%.
GLM 5.2 is one *of the* greatest gap reductions ever, but I think it is *the* greatest show of benchmark solidity from an open model claiming SoTA ever. Normally, you have some variety of the bad old Qwen pattern: headline benchmarks are SoTA+, new OOD ones are ≈8 months behind, and real experience is spiky, competitive in places, but usually ≈1 year behind, and sometimes utterly falling apart. Knock on it and hear the hollow sound. Yes, even DeepSeek.
Not so here. There's no progressive decay. It's "Opus 4.5-4.7ish" throughout, in anything of value that you throw at it. It is the first truly, completely solid Chinese model. A phase change, I hope.
Even before Mythos I was getting asked more and more what Anthropic's deal is, and why tf they're acting the way they're acting if they believe what they say they believe.
The best answer I can give is that their basic worldview is something like:
1. There are giant, dangerous monsters in the forest
2. We see others going out and making loud noises that will rouse the monsters, and they're not going to stop because of all the treasure and magical artifacts that can be found in the forest
3. We believe the best way we can help is to send out our own vanguard to go faster and farther into the forest than everyone else, because we'll spend a ton on monster containment and taming and we'll also send back detailed reports of what monsters we're finding so that the townspeople can ready themselves, which those other guys won't do
On the one hand I understand how they got there, and I think it's possible they're basically right. On the other hand it's not hard to see why this approach makes people wonder if you're crazy or lying or both.
Codex can now hand off threads between local and remote hosts.
Start work on your laptop, send it to a remote box before you close the lid, bring it back later.
And yes, Codex can orchestrate the handoff for you.