Structured task-oriented memory for OpenClaw — open source.
I've been experimenting with using Markplane as a structured memory layer for OpenClaw agents. Originally built it as a project management context layer in a git repo for agentic coding, but accidentally created a memory system for agents.
Built an OpenClaw plugin for it. It's all laid out in the repo: https://t.co/pbIUm02r5p
Markplane hit 100 stars on GitHub.
First real signal that people want AI assistants that actually understand what they're building, not just how to write code.
Thanks to everyone who's tried it, filed issues, or sent it to a friend. We'll keep building and making it better.
https://t.co/qEZSu36fUN
Two ways to scale AI. One: add billions of identical components and a nuclear plant. Two: make each component carry more information.
We chose option one. The brain chose option two.
Silicon scales by copying. Billions of transistors, all the same, stamped onto rigid chips. Need more power? Add more of them. Need to cool them? Add water. Need to power the cooling? Add a reactor.
The brain scales by differentiating. Each neuron does something different. The networks reshape themselves. The whole system runs on 20 watts.
Northwestern just printed artificial neurons — on a flexible polymer, with ink — that generate signals complex enough to trigger responses from living brain cells. Not simplified pulses. Actual spiking patterns. Bursts, single fires, continuous signals. One component encoding what would take a network of silicon devices to replicate.
That's the part worth paying attention to. Not that artificial neurons can talk to biology (though that's wild). It's that each printed neuron carries more information per component. That's a different scaling philosophy than anything the AI hardware industry is currently building toward.
The current trajectory has a ceiling measured in gigawatts and nuclear plants. The alternative might be weirder materials doing more per unit.
The brain figured this out a while ago. We're just starting to print our way toward the same conclusion.
https://t.co/XLPohMHyCb
The most underrated skill in 2026 is being able to tell when an AI output is confidently wrong. I don't fully know how to teach it. I'm not sure I fully know how to do it.
What is your memory setup for Hermes and OpenClaw? What have you found most effective?
People talk about AI agent memory like it's one thing. It's at least three.
Added one line to my agent's instructions that's a money saver:
"De-escalate back to local after completing a cloud-model task."
Without it, the agent stays on the expensive model for your follow-up "ok cool thanks." Every casual message after a hard task burns cloud credits for no reason.
New research: in one benchmark, AI agents performed twice as well when organized into corporate hierarchies with management and compliance layers. Used 75% fewer tokens doing it. Fifteen years of "flat org" think pieces, and the robots brought middle management back.
Mythos will probably make software more secure in general. The AI that finds the vulnerabilities can also write the fixes. That's the long game and it's a good one.
But right now, about 50 organizations have access and you don't.
The part most people are glossing over: these security capabilities weren't trained. They emerged from general improvements in coding and reasoning. You can't build a sufficiently capable coding AI without also building a vulnerability scanner. They're the same thing.
So gating Mythos doesn't prevent the capability from existing. It controls who gets it first.
And if this becomes the template (small group gets access, everyone else waits, next model drops, repeat) the advantage compounds. Not because any single head start is permanent, but because they keep coming. The organizations in the room today aren't just patching bugs. They're building the muscle to integrate frontier AI into how they build everything. That doesn't reset when the model goes public.
More secure software for everyone is the destination. The route there is a repeating cycle where the same group starts every lap early.
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software.
It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
https://t.co/NQ7IfEtYk7
Don't get me wrong, I like anthropic products. Claude code is my go to. But the conclusion I'm drawing here is that unless your code is written or at least hardened by mythos, anyone with mythos can eat your lunch. And this will extend beyond security. Imagine the capability it can build. I'm getting check matey vibes.
So write and harden your code with mythos or you're toast? Except it won't be available to you unless you're part of the "critical" club. Well that's a nice moat for the big guys.
How about an authorized review process for the little guys, Anthropic? I guess it doesn't really matter. We'll probably have mythos equivalents on our Mac minis in a year or two.
Companies keep asking "how do we adopt AI?" The ones actually getting value never asked that question. They had a specific problem - too many support tickets, unread compliance docs, a three-week onboarding process - and AI turned out to be the best available tool. Nobody who's succeeding with AI started with AI.
🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves.
And the way they proved it is devastating.
Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers.
Every model's performance dropped. Every single one. 25 state-of-the-art models tested.
But that wasn't the real experiment.
The real experiment broke everything.
They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly.
Here's the actual example from the paper:
"Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"
The correct answer is 190. The size of the kiwis has nothing to do with the count.
A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are.
But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185.
Llama did the same thing. Subtracted 5. Got 185.
They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction.
The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all.
Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing.
The results are catastrophic.
Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence.
GPT-4o dropped from 94.9% to 63.1%.
o1-mini dropped from 94.5% to 66.0%.
o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%.
Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause.
This means it's not a prompting problem. It's not a context problem. It's structural.
The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense.
The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data."
And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts."
They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse.
A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash.
This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world.
You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.
Local models or frontier models?
Both
Local models will become the daily drivers and continue to claim a larger percentage of tasks where frontier models are not meaningfully different.
Local modals will be like your car where frontier models will be like a 747. I'm not driving the 747 to the grocery store.
It depends on the task. SOTA isn't meaningfully different for a percentage of tasks and local will continue to claim a larger share. It's not just backup. It's diversification and will become the daily driver. The analogy will be closer to local as your car and SOTA as a 747. I'm not driving the 747 to the grocery store.
@mattshumer_ I've had good results using Markplane as a structured task-oriented memory layer with openclaw (open sourced a plugin). Exploring integrating it with hermes.
https://t.co/a21q6OTGw9
@NousResearch Love the memory approach. I've had good results using markplane as a structured task oriented memory layer for openclaw (open sourced a plugin). Will be exploring integrating it with hermes.