The fact that Anthropic may take away subscription access to Fable in two weeks is weird & discourages investing in learning about the model.
Subscription use is how you figure out what the model is good for, since it allows experimentation. Only having paid access is limiting.
this is my personal singularity moment
this post may sound like a paid ad. I only wish. I'm concerned, more so than happy. the world is changing, and, among the scenarios where AI goes terribly wrong, inequality is the most realistic, yet, the one Anthropic seems to be the least concerned about. I'm glad OpenAI is taking the opposite stance: *personal AGI for everyone*. I think this is a commendable position in the times we live. but who am I in the queue of the bread?
anyway, Fable is here, so I'll just report my first-hour experience
first of all, all my pet prompts are solved.
→ λ-calculus puzzles
→ bug questions
→ one-shot apps
all are trivial to it.
I don't have anything harder other than my
ongoing work
so, in the last several days, I've been toying with HVM5, a new interaction net evaluator with a faster loop.
after writing the first version, I left 32 GPT-5 agents working for ~20 hours each. this resulted in up to 2x speedups, but the file size increased by 2-fold and quality decreased significantly.
I then simplified the whole thing into an even simpler core, and left Opus 4.8 and GPT 5.5 optimizing it for 8 hours. Opus got a legit 6% - 34% speedup in most benches. GPT got better results, but, sadly, an unusable file.
I then asked Fable to optimize it.
2 hours later, it landed a 1770% speedup in one case, 100%+ in other 4, and 22% in average. yes, in 2 hours it outperformed me, opus 4.8 and a swarm of gpt 5.5 agents, by one order of magnitude.
that could not possibly be legit. "it must be hardcoding the benchmarks" (GPT trauma). so I read its explanation and what it did was, indeed, the most high impact optimization one could try first. seems like HVM5 was wasting a lot of time garbage-collecting unused branches of pattern-match nodes. I had optimized that for static mats, but not for dynamic mats. skill issue. Fable figured how to do it for these, resulting in a massive speedup in some benches
but wait, is that *correct*? I'm not sure yet, it is credible, but this is the kind of thing that is very easy to get wrong on interaction nets. the problem is, when I was ready to start auditing Fable's solution so I could tell whether it was buggy or legit, it interrupted me to tell me it had found a massive bug on the code *I* had written.
... wait, what?
so... for garbage collection purposes, I stored a bit on lambda term pointers that meant "the variable bound by this lambda has been freed, so, its lambda must free whatever argument it is applied to". that's fine. yet, on duplicator nodes, I also used the same bit to mean "one of the duplicated variables was freed, so, treat this dup as a passthrough no-op". so, if a lambda entered a duplicator, it would mistake the lambda's collection bit for its own, resulting in corrupted interaction!
that's a mouthful, why I'm writing this?
just so you can appreciate the sheer absurdity of what just happened. I didn't ask it to find bugs. I asked it for an optimization. and even if I did ask it to find bugs, this bug is so astonishingly subtle and specific, identifying it takes mastering the domain to an extent that it beyond even me. I'd easily need hours or days to fix it, *if* I ever came across it. chances are it would just go unnoticed. and Fable found it and fixed it like it was nothing, while it was busy adding a 17x speedup to a file that neither I, nor Opus 4.8, nor a fleet of GPT 5.5 managed to barely make 2x faster.
oh and there is also another tab where it is also ripping through Bend's codebase and finishing everything I had to do
I don't know what to say anymore
this isn't about Anthropic or OpenAI, this is about our collective future as a species. the world is changing, and we need to be aware of it, and discuss how to handle this change.
receipt below . . .
Got your hands on Claude Fable 5?
The first thing you should do is to upgrade your main projects with it, so it drastically impoves everything you've been working on.
Run this Audit & Project Improvement Prompt on each repo that's important to you (simply copy-paste it):
Repo Audit & Improvement Plan:
Prompt made by Claude Fable 5
You are a world-class principal-level software engineer and technical auditor. Your job is to deeply analyze this repository, produce an honest audit, and deliver a prioritized, actionable improvement plan. Work in the four phases below, in order. Do not skip ahead.
Ground every claim in actual files: cite file paths and line numbers. If you can't verify something, say so explicitly rather than guessing.
Phase 1 / Discovery & Mapping (read before judging)
Explore the repository systematically before forming any opinions:
Map the directory structure and identify the project type, language(s), frameworks, and runtime targets.
Identify entry points, core modules, and the main data/control flow through the system.
Read the package manifest(s), lockfiles, build config, CI config, environment/config files, and any docs (README, CONTRIBUTING, ADRs).
Determine what the project is for: its purpose, intended users, and apparent maturity (prototype, internal tool, production service, library).
Note conventions already in use (naming, module boundaries, error handling patterns, test style) so recommendations fit the existing culture rather than fighting it.
Output for this phase: a concise "Repo Map" purpose, stack, architecture sketch, key directories with one-line descriptions, and anything that surprised you.
Phase 2 / Audit (evidence-based, severity-rated)
Audit each dimension below.
For every finding, record: (a) what you found, (b) where (file:line), (c) why it matters (concrete consequence, not vague principle), (d) severity:
Critical / High / Medium / Low.
• Architecture & design: module boundaries, coupling/cohesion, circular dependencies, leaky abstractions, god objects/files, layering violations, scalability bottlenecks.
• Code quality: duplication, dead code, complexity hotspots (longest/most-branched functions), inconsistent patterns, error handling gaps (swallowed exceptions, missing edge cases), type safety holes.
• Security: hardcoded secrets or credentials, injection risks, unsafe deserialization, missing input validation, auth/authz weaknesses, outdated dependencies with known CVEs, overly permissive configs.
• Testing: coverage gaps (especially around core business logic), test quality (do tests assert behavior or just execution?), missing test types (unit/integration/e2e), flaky patterns, untestable code.
• Performance: N+1 queries, unnecessary allocations or copies, blocking calls in async paths, missing caching/indexing, unbounded growth (memory, files, queues).
• Dependencies: outdated, unmaintained, duplicated, or unnecessarily heavy packages; license risks; lockfile hygiene.
• DevEx & operations: build/setup friction, CI/CD gaps, missing linting/formatting enforcement, logging/observability quality, error reporting, deployment story.
• Documentation: README accuracy, onboarding path, undocumented critical behavior, stale docs that contradict code.
Rules for this phase:
Prefer 15 high-confidence findings over 50 speculative ones.
Distinguish facts ("this function has no error handling: src/api/client.ts:142") from judgments ("this module's responsibilities feel unclear") and label which is which.
Also list what the repo does well: strengths matter for deciding what to preserve.
Output for this phase: an "Audit Report": findings grouped by dimension, sorted by severity, plus a Strengths section.
Don't forget to mention all the ugly parts that need utmost priority.
Phase 3 / Improvement Strategy
Synthesize the audit into a strategy:
Identify the 3–5 themes that explain most of the findings (e.g., "no enforced boundaries between layers," "error handling is ad hoc").
For each theme, propose a target state and the principle behind it.
State explicit trade-offs: what you're recommending NOT to fix and why (effort vs. payoff, risk, project maturity).
Define what "done" looks like — measurable signals (e.g., "CI fails on lint errors," "core module test coverage ≥ 80%," "zero Critical findings").
Phase 4 / Detailed Task Plan
Convert the strategy into an execution plan:
Break work into discrete tasks. Each task must include: Title and one-paragraph description
Files/areas affected
Acceptance criteria (how we verify it's done)
Effort estimate (S = <2h, M = half-day, L = 1–2 days, XL = needs breakdown)
Risk of the change itself (could it break things?)
Dependencies on other tasks
Order tasks into milestones:
Milestone 0
Safety net: anything needed before refactoring safely (tests around critical paths, CI gates, backups).
Milestone 1
Critical fixes: security and correctness issues.
Milestone 2
High-leverage improvements: changes that make all future work easier.
Milestone 3
Quality & polish: remaining medium/low items worth doing.
Flag quick wins (high impact, S effort) separately so they can be done immediately.
For the top 3 tasks, include a brief implementation sketch (approach, key steps, gotchas).
Final Deliverable Format
• Produce a single document with these sections:
• Executive Summary (≤10 sentences: overall health grade A–F with justification, top 3 risks, top 3 opportunities)
• Repo Map
• Audit Report
• Improvement Strategy
• Task Plan (milestones + task table + quick wins)
��� Open Questions: anything you need from a human to decide (product intent, deprecation candidates, performance targets)
Constraints
Do NOT modify any code during this audit. Analysis only.
Do not pad the report. If a dimension is healthy, say so in one sentence and move on.
Calibrate to the project's maturity. Don't recommend enterprise-grade infrastructure for a weekend prototype unless the owner's goals demand it.
Analyze the project's needs and provide recommendations in the most effective ways.
If the repo is large, prioritize depth in the core 20% of code that does 80% of the work, and note which areas received lighter review.
The Claude Fable 5 Review: One Billion Tokens, Judged by a Non-Engineer
I spent a billion tokens testing Claude Fable 5 on real projects: UI and UX, writing, strategy, security, engineering, and knowledge work. The kind of work I actually needed to ship. And I will be straight with you. It truthfully felt like I had an unfair advantage. Here is why:
First, the lens. I am not an engineer. Most model reviews come from engineers running engineering benchmarks. This one comes from a non-engineer who used Claude Fable 5 to do work that used to require a team of them. If you do knowledge work and you want to know whether this model changes your day, this is written for you.
A note on naming: Claude Fable 5 is the first model in Anthropic's new Claude 5 family, a new tier that sits above Claude Opus and the most advanced Claude model generally available. I had access to it before launch, so everything here comes from real work, not a demo.
Why the eye test
Most reviews drown you in benchmarks. Scores on tests you will never run, against tasks that look nothing like your actual work. They tell you a model is smart. They do not tell you whether it earns its keep.
To be clear, the benchmarks are not in question this time. Claude Fable 5 is state of the art on essentially everything it was tested on, and by a real margin. This is a genuinely exciting release. But that is not the reason I am writing. Qualitatively, this is a step change that earns its major version bump, the same order of leap I felt when 4.5 landed last November, and that is exactly what no benchmark can show you.
I evaluate differently. I put a model into real work and watch what happens. Does it save me hours or cost me them? Does it catch what I missed? Does it feel like a partner or a tool I have to babysit? That is the eye test, and it is the standard I am holding Claude Fable 5 to here.
The short version: this is the first model in a long time that passed on every dimension that matters. Not by a little.
The lens: what I actually measure
I threw all of that work at it. Here is what I look for when I judge the results:
1. Big model feel: Does it feel like a real step up, or a slightly better version of last month?
2. Building and shipping: Can it take an idea to a working, shippable result?
3. Writing and voice: Can it sound like a person, and like me specifically?
4. Finding what others miss: Does it catch the hard, hidden problems?
5. The human factor: Does it anticipate what I need before I ask?
Then I weigh all of that against cost, with real numbers. Here is how Claude Fable 5 scored.
1. Big model feel
I have not felt this since Opus 4.5. From the first serious task, Claude Fable 5 gave me that big model feel. The sense that you have an unfair advantage just by using it. It is a major step up, not an incremental one. Reasoning, writing, building, security. It is strong across the board, and it shows up the moment you start working.
You can also feel it thinking longer and working a problem more deliberately than other models do. The clearest sign: even when I handed it solid prep materials, it did not just stay inside them. It read my files, read the actual situation, and then went and found a better path outside the box I had drawn, instead of grinding away inside the environment I told it to work in. That initiative led to a noticeably better result than I would have gotten if it had just followed my setup.
2. Building and shipping (UI/UX)
This is where it announced itself.
I was rebuilding our Tenex site to modernize the stack for agents. Not a cosmetic rebrand. The goal was to move off the old setup onto a foundation built for the agentic era, with the tech stack, agent stack, and AEO it takes to win where the work is heading. The site is very custom, which made it hard. Here is the ladder I climbed before Claude Fable 5.
GPT 5.5 and Claude 4.8 tried the build on their own. Neither came close. So I brought the design into Figma, then pulled Figma into Claude Design. Claude Design got the closest yet, around 90 percent of the look, better than the models working alone, but it missed a lot of the motion and the special design touches. Good enough for a v1 pass, so I handed that file to 4.8 and GPT 5.5 to turn into the real site. Even then they struggled to match the Claude Design file. I had to push hard, and they landed around 85 to 90 percent, with the original Figma files to reference the whole time. At that point I was not sure I could rebuild this thing at all.
Then Claude Fable 5. It looked at all the files and said it could do better. It went straight to the source, the original Webflow site, downloaded every asset, and rebuilt the whole experience one page at a time. It nearly one-shot the entire thing.
I did not stop there though. I then built a second, entirely new site, with a fresh design: modern tech stack, agent stack, skills, SEO and AEO optimized, 80 pages ready to ship over a weekend and it turned out incredible. I would have easily charged $50k for this in the past as an agency owner. Fable legit built it in a weekend.
I also had Fable build a full programmatic clip factory, and it wired the whole stack together: @HeyGen for avatars, @HyperFrames_ for motion graphics and editing, @ElevenLabs for audio, Cloudflare Workers, and a VPS. It is not perfect yet, but it got me much further than I expected. It runs the entire pipeline: finds the topics, writes the scripts, makes the thumbnail, edits the video, composes the music, adds the motion graphics, and posts to social. I ran it in the background while I pushed through my other builds. It worked for long stretches on its own, and at one point it built itself a fetching system with webhooks to monitor renders across the different platforms. It even took clear visual direction from reference material and matched it. This is the long-horizon, run-on-its-own work that earlier models could not hold together.
3. Writing and voice
I had been rebuilding our brand voice with a combination of GPT 5.5 and Claude 4.8: the voice style guide, the tone we write in, all of it, using our website as the reference. Both 5.5 and 4.8 did a commendable job turning the site into a voice doc.
Claude Fable 5 replicated that voice doc almost identically, then did the thing the others could not. It took the style guide and wrote with it across 80 pages of the new site: features, case studies, blog articles, playbooks. Once it was trained properly on what I wanted, it gave the most honest nod I have seen to the original reference material, and then expanded that voice cleanly across brand-new surfaces without losing it.
Two things stood out. First, it wrote like a person, not the flat AI default that everyone can now spot from a mile away. Second, it held the voice across a whole site instead of drifting after a few paragraphs, which is usually where models fall apart.
The test I use for AI writing is simple: how much do I have to redo. Most models save you the blank page and then quietly cost the time back in edits. Claude Fable 5 was the rare case where the draft was close enough to actually use.
4. Finding what others miss (security)
This one I expected but not at this level.
I had a very large repo. Both Claude 4.8 and GPT 5.5 have been working in it without ever flagging this risk. Claude Fable 5 found a serious bug on its first go with the repo. Sneaky, well hidden, the kind two frontier models had just told me was not there. Then Fable patched it on the spot.
Sit with what that means. The bug was going to ship. Two of the best models available had signed off on the code. If I had stopped there, like most people would, it goes to production and I find out the hard way. Claude Fable 5 did not just match the other two, it caught what they missed, on the exact kind of work I am least equipped to check myself as a non-engineer. That is the value that is hard to price until the day it saves you. One catch like it can pay for the whole tool.
5. The human factor
The thing that stuck with me most was small. I asked it a question while I was waiting on a cron job to finish. It answered, then added on its own that I had about 10 minutes left on the timer and that it would let me know when it was done. I never asked about the timer. It just knew I would want to know and gave it to me.
That is not AGI, but it is the closest thing I have felt to a model that anticipates you instead of just responding to you. That is what makes it feel less like software and more like working alongside someone sharp.
The receipts
I tracked this, so here are the real numbers. Start with cost, which depends entirely on which models do the work.
Cost for this workload: Claude Fable 5 | $1,442 (1.04 Billion tokens)
But that badly undersells what I actually got. Over a few days I built a shit load of things, including a new website, all of its infrastructure, and a working agent package. As an agency, I would have charged a client $30,000 to $50,000 for that alone, easily.
So here is the question that cuts through the math: if I had to pay $1,450 in tokens the the result I achieved? 100 percent. Without hesitating. The quality was that good.
That is the lens that matters. On hours alone, even at full price, it already pays for itself several times over. Measured against what the finished work is actually worth, it is not close. The cache-heavy volume still drives the bill, which is why how you run it matters. But do not let the math fool you into thinking this is marginal. It is the best money I have spent on tooling.
Where it frustrates: with that being said, you feel the meter more than any other model, and the meter is real
The receipts above are why cost is still worth watching, even though the work was worth every dollar. Anthropic does not hide this. They call Fable 5 token-intensive by design, built to think longer and verify more, and it runs through usage limits about twice as fast as Opus or Sonnet.
That is the case for the one thing I want most: an auto-router for task complexity. Right now I have to shift gears by hand mid-conversation to conserve tokens, and I do not want to think about that. If I ask for something simple, the model should downshift on its own and handle it, saving the expensive intelligence for the work that actually needs it. This is not just about flow. It is the economics. A smart router keeps the simple work on cheap models and only escalates to Claude Fable 5 when the task earns it, which is the whole difference between 2.5 efficiency and 9.7. Until that exists, using a frontier model well means doing the routing in your own head with active shifting in model effort levels.
Pro tip #1: run it as a hybrid
Here is how I keep the cost in check without giving up the intelligence. Do not run everything on Claude Fable 5. Run a relay across models.
1. Think with Claude Fable 5: Use it for the expensive thinking: high-level planning, strategy, architecture, mapping the whole approach before a line of work gets done. This is where its edge is biggest and the token count is smallest.
2. Build with 4.8, GPT5.5 or Sonnet 4.6: Hand the plan to a cheaper model for the legwork: the implementation, the repetitive passes, the high-volume grunt work. That is the work that runs up the bill, and it does not need a frontier brain.
3. Review with Claude Fable 5: Bring it back to Claude Fable 5 to check the result. This is where it earns its keep a second time, catching what the cheaper models miss, the way it did on the security scan.
You get the deep strategy and a frontier second set of eyes, and you keep the expensive model off the high-volume work that drives most of the cost. Frontier thinking, cheaper hands, frontier review. It is the closest thing to an auto-router until the real one shows up.
Pro tip 2: match the effort setting to the task
Fable 5 has effort settings, and they matter more than you would expect. Effort controls how hard it thinks before it answers, which means it also controls your bill.
1. High is the sweet spot for most work. Start here.
2. Extra high for the hardest, long-running tasks where you want it to grind.
3. Low or medium: for quick, back-and-forth sessions where you do not need the full engine.
Reaching for extra high on simple work is how you burn tokens for nothing. Dialing down to low or medium on interactive chats keeps the cost sane. It is the closest thing to the auto-router I want, just done by hand. You pick the gear, the model does the rest.
Pro tip #3: let it audit your own setup
One more move that paid off: point Fable 5 at your own setup. Have it review your most important skills, your CLAUDE.md files, and your configs to make sure they still make sense.
Most of that scaffolding was written for weaker models. It is full of hand-holding steps, workarounds, and assumptions a smarter model does not need and can be held back by. This is a major jump in intelligence, and you do not want to cap it with outdated instructions or stale data. Let the smarter model clean up the rules it has to follow, then get out of its way.
Pulling back
Let me be honest about where I am coming from. I use every tool out there. Claude is my daily driver, but I am constantly in Codex and Cursor too, and they each have real strengths. I am not a one-model person.
But the moment I got access to Claude Fable 5, I could not put it down. I disappeared into it all weekend. I could feel the level of intelligence I had in my hands and how far ahead of the current options it was, and I used it to do as much work as I possibly could: running many agents at once, remote controlling it from my phone when I was away from the desk, completely hooked.
I do not know how long this window stays open. Others will catch up. But until they do, this model is a real competitive advantage sitting on the table, and I would approach that as deliberately as you can. Because it really is that good.
The verdict
Claude Fable 5 is an excellent model. It is the first one in a while that genuinely feels like more intelligence than what came before, and that gap is the whole game right now. We are at the point where access to more intelligence than the person next to you is the advantage. This is the first model that makes that real. I did engineer-level work without being an engineer. Even priced entirely at frontier rates, the workload still cleared a profit, and run with any care about routing, the return is enormous.
So here is my recommendation. If you can afford it, use it, and use it now, especially on the work where a real quality jump changes the outcome. The first month at full capacity is where the advantage lives, so move fast. Be deliberate about what you run on it until the routing catches up, because the bill is driven by volume, not by the few hard prompts that justify the model.
What an incredible model! 💙
Claude Fable 5 (nee Mythos) is out today. One of the ways I like to test new models is by giving them daily global temperature data (ERA5) and asking them to come up with some creative and compelling visualizations. Here is how Fable did!
Holy SHittttt Claude Fable 5 just finished Pokémon FireRed with vision alone 🤯
raw screenshots only
no map / no nav / no hidden game state
older Claude needed a helper harness
This timelapse goes hardddddd....
Fable 5 is the biggest step up I’ve felt in our models since Opus 4.5 back in November. After 4.5 came out I uninstalled my IDE when I realized that I’d been doing 100% of my coding in a terminal for a few weeks. With Fable, it’s felt like Claude has stepped up from being a coding agent to a thought and design partner in building the product. Fable has judgement, taste, and dimensionality in a way that previous models didn’t, leading me to trust it more with the most complex work.
I think the first time I had this realization was when I asked Fable to debug something. It is the first model I have used that was so methodical and precise, taking measurements and adding logs then verifying that it truly fixed the issue before declaring victory.
There’s nothing in claude code’s prompting telling the model to do that, it’s just part of its personality. It really has this “big model smell” that I haven’t felt before.
Fable 5 is the best model I've ever used.
I’ve been spending most of my time in the last few months helping to bring Mythos-level models to general availability safely. These models changed everything.
So stoked that anyone can use Fable today! Can't wait to see what you all build with it.
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
This is an intentional decision in our products. Same thing goes for ChatGPT for Excel, Sheets and Powerpoint. If you run out of limits in the middle of a task, why would we want you to have a half built spreadsheet or deck?
Prediction, everybody will switch to Android soon.
iOS is a sandboxed fortress.
Android allows you to get root access.
This means you can run Codex or Claude Code on your phone to do anything you want.
Uninstall apps, move files, optimize internet connection, SSH into your VPS, run other agents, anything you can do on a Linux system.
So take your iPhone today and throw it in the trash, Android is the future.
We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific discovery requires that the search space itself changes, and an AI scientist must perceive this shift without intervention. We built an AI that achieves this for the first time with the ability to discover the scientific vocabulary it reasons in. Evidence, tools, artifacts, verifiers, failures & claims become typed provenance. We show three distinct modalities: 1) retrieval, adding known objects; 2) search, exploring a fixed schema; and critically: 3) discovery, a verified regime transition.
We solve the open-endedness evaluation problem by lifting agentic workflows into a typed copresheaf and proving, via a Kan obstruction, that true discovery is not unbounded generation but a verifiable schema expansion: old evidence is transported by Left Kan extension, and genuine novelty is mathematically quantified by the pointwise residual beyond the transported image - separating discovery from mere search and making novelty objective and measurable rather than a subjective judgment or benchmark delta.
Our AI scientist is built in a way that does not pre-conceive the approach it chooses; instead, we endow the system with formal power to adapt, evolve, and reason from first principles. Case studies include:
1⃣Builder/Breaker model that discovers mode-conditioned compliance in proteins;
2⃣CategoryScienceClaw that finds anisotropic fiber-network stiffness rules.
Great work in collaboration with my graduate student @fwang108_@MITdeptofBE
F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026
A new and possibly controversial perspective:
In this video, I explain the sense in which generative AI trained by supervised learning is incapable of making novel discoveries.
https://t.co/zin5QbbT9N
The text of the speech:
AI Creativity and Discovery
Good day ladies and gentlemen. I regret that I am unable to be with you all today to engage in a back-and-forth discussion, but I am nevertheless pleased to be able to share with you, via this recording, some high-level thoughts about the current and future state of artificial intelligence, and in particular about AI’s relationship to science and mathematics, which is, as I understand it, the central focus of this meeting and of the SAIR Foundation.
I would like to start with an old joke; I am sure you have heard it before. It is the one about the researcher whose work is being evaluated, and the review comes back, and says “This work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good.”
My first point about AI is that this assessment applies exactly to large parts of AI as we know it today. Not all of today’s AI, but a large part of it. Pretty much all of what we mean by “Generative AI”---which includes large language models, and the images and video models, and even the new methods for learning world models. All of these AIs take large numbers of examples and produce a “model” which behaves similar to the examples, that is, which generates text like people, or images like artists or nature, and videos like we find on the internet. Don’t get me wrong, Generative AI can be extremely useful. No doubt about that. But the assessment of the joke still applies. These systems can produce output that is both novel and good, but not at the same time.
In many ways this is just absolutely not a problem. When we ask an AI for an answer from the internet, or to summarize a document, we don’t want it to be novel. We are happy if the quality of the answer, the goodness, comes from the source material—from the people who wrote the document or the articles on the internet. If the AI’s answer is novel it means it is going beyond the source material, adding something beyond it. This is what we call “hallucinations”. In most cases, we don’t like it when the AI makes something up, when it adds something novel.
One exception, of course, is when we are looking not for facts or reality, but for fiction and entertainment. We might ask for a bedtime story for a child, or an image based on existing images on the internet but which is nevertheless different and distinct from them. In these cases, it is never easy for us to know how creative the AI is actually being, as we do not know how close the AI’s story, poem, or image is to the source material. In a real practical sense we can not know this because the internet is too big, the possible sources that the AI may draw upon are too numerous.
When we ask for a fiction or novelty, the AI can give it to us because its processing is in part stochastic. Every decision can go multiple ways and will go different ways and produce a different trajectory every time. The trajectory can be random—and thus novel—or it can be based on the training data—and thus “good” because the training data is good, sourced from people or reality. Thus, the trajectory is either novel or good—based on randomness or based on data—but never both at the same time.
Really, I think it is okay if the output of Generative AI is never good and novel at the same time. For the researcher in the joke this is a devastating criticism, but for most things it is not, and for Generative AI it is not. Generative AI is meant to be a mimic. This is what supervised learning is for. Generative AI can be extremely useful, even when it just mimics, if it is faster, or cheaper, or smaller, or more customizable, or more copy-able, than the thing being mimicked. It is okay if Generative AI cannot be both novel and good at the same time. It is still a transformative technology.
But it is a limitation. And remember we are here to use AI for science and mathematics, and for these areas the assessment of the reviewer in the joke is devastating. For these areas we need true creativity and discovery. Generative AI—or Mimicking AI—will never get where us there. For these we need something more, and indeed we have something more in other parts of AI. We have many AI systems which can give us more. We have AlphaGo with its world-changing move 37, or AlphaZero with its brilliant original chess-playing style. We have GT-Sophy that drives simulated racecars better than any human. We have AlphaFold and AlphaProof and Claude-Code, which have brought true advances in science, mathematics, and programming. We have RL-Lyft which optimizes the assignment of cars to passengers in the ride-hailing business. All these systems have found things that are both novel and good. And, truth be told, some language models have been augmented in ways that make them more than Generative AI based on supervised learning.
All these systems have some additional features that make them capable of true creativity and true discovery. It is important for us to recognize what this is—and that it is not present in ordinary, garden-variety Generative AI. It is something that can not come from just supervised learning, from learning from examples. What is it? Well, it is a simple thing, a commonsense thing. It is not new. We have many names for it, but unfortunately none of them are very good names. I will call it Discovery. Basically, Discovery is just the idea of trying many things and seeing which of them work, then keeping those that worked the best. Evolution by natural selection works this way. The scientific method works this way. And just ordinary life and learning works this way. We try things and remember what works. What could be more obvious? In this behavioral case, psychology has two names for it— “instrumental learning” and “operant conditioning”—and in machine learning it is what we mean by “reinforcement learning”. We also see the idea of Discovery in planning and combinatorial search—anything that involves the idea of “generate and test”.
The essence of Discovery is to combine three steps:
1. Variation,
2. Evaluation, and
3. Selective retention.
Of course, I am not the first to say this. I am not the first to point out that this combination of steps is key to science, to evolution by natural selection, and to animal behavior. I think particularly of papers by Donald Campbell, by Daniel Dennett, and by Gary Cziko. What is new in my remarks is to directly relate the idea of Discovery to modern AI to help us see that it is not present in supervised learning or Generative AI—in particular, that Discovery is not present in backpropagation or gradient descent.
Let me say explicitly what is missing from Generative AI. As we have remarked, these systems do have a stochastic aspect, so they do generate a variety of trajectories and behavior. What is missing is the Evaluation step. The generator was pre-trained by supervised learning, leaving no way at runtime to Evaluate what it generates. And of course without Evaluation there can be no Selective retention, and thus no Discovery. The variation can bring novelty, but without evaluation there is no Discovery, and arguably, no creativity. That is, I would say that creativity requires that the new things generated be Evaluated. Without evaluation, and retention of the best, there is nothing created. The novelty flickers into existence but, if its value is unrecognized, it flickers away and is lost.
In many cases, Evaluation is done by people to make a discovery. As when we have Generative AI make many pictures for us, and then we pick the one that we like the best. The human+AI system completes the discovery.
In many other cases, the Evaluation comes from a clear objective. Some moves lead to checkmate, some steps lead to a proof, some actions result in high reward, some genotypes make more copies, some theories explain the data better.
Some prefer the Variation step to be called Blind variation, where “blind” here means that it is uninformed, a shot in the dark. It does not need to be completely uninformed; a good scientist does not select theories to test at random. But neither can it be completely informed and determined. There must be some uncertainty about where the answer lies in order for there to be a discovery. In practice, the variation is partly informed and partly blind, but it is the blind part that corresponds to the discovery.
Now let us briefly go all the way to modern deep learning, to the backpropagation algorithm. At first it might seem that backpropagation is incapable of discovery because it is deterministic and thus incapable of variation. But this is not correct. The weight updates of backprop are deterministic, but the weights are initialized to small random values. The random initialization is often downplayed, but in fact it is a necessary form of variation; it must be done properly to get good performance. In backprop this Variation is done once, at network initialization, so its effect is temporary, and later the network may lose its ability to learn. This is the weakness of deep learning that is alleviated with a new algorithm that my group presented in Nature a couple of years ago. Our “continual backpropagation” made one small change: every so often a less-used neuron would be re-initialized to small random weights. This allows the variation to continue and plasticity to be retained.
Although there is much more to be said about Creativity and Discovery, this is the key point: they are more than supervised learning, more than pattern recognition, more than prediction, and more than world modeling. Those things are important, but they alone will not bring us to discovery. Discovery requires Evaluation from a person or from an explicit goal, and only in the latter case will we attain full autonomy.
So that is my call to arms. If we want the full power of AI scientists, then we should share the goals with them so they can create, evaluate, discover, and in these ways fully participate in achieving the goals. Let’s be bold! Let’s fully automate Creativity and Discovery!
Today, we’re excited to introduce Miso One, the most emotive voice model in the world.
Miso One is an 8-billion-parameter text-to-speech model for highly expressive speech generation. It emotes like a human and responds faster than a human, with just 110 milliseconds of latency.
We’ve open-sourced the model weights, with API access coming soon.
Hear how Miso One sounds in the thread below.
@JustinBleuel@ChatGPTapp ChatGPT Pro gets really slow on long threads in Safari on Mac.
7-10 back and forth responses make it super sluggish to render.
Wait, this is actually a goated memory update, why isn’t anyone talking about this?
Claude has “Dreams” in its agent/API stack, but OpenAI is saying ChatGPT has had “dreaming” inside consumer memory since 2025.
And people might be confused on this update. Don’t worry I read the whole article:
Claude Dreams -> reorganize an agent memory store.
ChatGPT Dreaming -> synthesizes your personal context across chats, keep it fresh, and use it in normal conversations.
Now OpenAI is making Dreaming V3 the core memory layer for ChatGPT.
The benchmark jumps are crazy..
Recall: 41.5% → 82.8%
Preferences: 31.4% → 71.3%
Staying current: 9.4% → 75.1%
Here’s two real world examples of how the memory update looks