AI systems don't learn from experience. Ours does.
Claude Opus 4.6 had never passed a Terminal-Bench 2.0 task in 5 published trials. We wrapped it with our learning system. Pass in 42 steps.
No model changes. No prompt engineering.
We then ran the same learning system on GLM 5.1, a completely different LLM. GLM had failed the same task 8 times in a row. Pass in 7 steps.
One learning system. Any model.
https://t.co/chLqJUvgWB
@sonyatweetybird@harvey@FactoryAI The app layer can become the learning layer.
@FactoryAI routes models, Harvey is advancing agent/advisor routing, and at Rekursor we’re working on skill routing: selecting durable, reusable capabilities learned from prior work.
Moving up from apps to systems that compound.
Three routing innovations this week.
@FactoryAI routes the model: frontier quality, 25% lower cost.
@Harvey + @FireworksAI_HQ route the advisor: open model primary; frontier model selectively invoked.
Rekursor routes inspectable, customer-owned skills selected at runtime, with a new primitive that shows 100% top-pick correctness across an 18 → 504 skill library while RAG collapses past ~200 candidates.
Different layers. Same wave. Routing is becoming the architecture.
https://t.co/n9piULQ2XT
Introducing model routing to Factory.
Factory Router picks the right model for every task, automatically.
Maintain frontier performance while cutting costs by 25%.
@EnoReyes Awesome. We built something conceptually similar for skills routing using a different selection primitive that works way better than standard RAG. Just released in our latest blog yesterday.
@garrytan called the skills resolver bottleneck: once you have a lot of skills, the hard part is picking the right one.
That's the problem Rekursor solved with a different selection primitive.
Across 18 → 504 skills, Rekursor's resolver picked the skill the run confirmed correct 100% of the time. Standard RAG over skill descriptions: ~11% at 18 skills, ~2% at 50, effectively zero by ~200.
A skill library only compounds if the resolver scales.
Full post:
https://t.co/n9piULQ2XT
The top frontier model passed 7.1% of LAB tasks end-to-end in @harvey benchmark. That is the gap.
Post-training can raise the model floor. Long context can improve the run. Rekursor adds a third axis: a learning layer above the model that turns scored work into skills that can revise near-misses to all-pass.
New results: held-out transfer (45/49 → 48/49), all-pass revision (49/50 → 50/50), autonomous skill generation, revision without regressions, and routing that holds as the skill library grows.
Full post: https://t.co/n9piULQ2XT
@gabepereyra@harvey Thanks Gabe, really appreciate it. We’ll keep running Rekursor across more tasks and practice areas, and share what we find to help make LAB even more useful for evaluating legal agents.
Legal AI agents shouldn't just execute tasks. They should learn from them.
Rekursor's first result on @harvey open benchmark, Harvey LAB with their agent setup:
Baseline: 45/48 (fail)
With Rekursor: 48/48 (all-pass)
0 regressions. No fine-tuning.
https://t.co/IeltjbOOXn
Rekursor just hit 48/48 on the first task in Harvey LAB benchmark (baseline was 45/48). This is continual learning for legal AI agents: @WeAreLegora@SpellbookLegal
Legal AI agents shouldn't just execute tasks. They should learn from them.
Rekursor's first result on @harvey open benchmark, Harvey LAB with their agent setup:
Baseline: 45/48 (fail)
With Rekursor: 48/48 (all-pass)
0 regressions. No fine-tuning.
https://t.co/IeltjbOOXn
@MaxJunestrand@AnthropicAI Strong post. One piece missing from the stack: continual learning. Vertical platforms + governance get you to production. They don't get the agent past the plateau every loop hits. That's what we just launched.
@scottastevenson Congrats. we just launched a layer that handles the long-tail review patterns university procurement throws at contract AI. Same kind of plateau Harvey wrote about in April.
@winstonweinberg 50% daily usage = the bottleneck Harvey named in April is now a daily event in production. We just launched the fix. Autoresearch loops that hit plateau → we break it.
A former Latham associate just released an open-source legal AI tool that he says replicates much of what Harvey ($11B) and Legora ($5.5B) charge enterprise prices for. Built it in two weeks.
The story isn't that he did it. It's that the conversation in every AI vendor renewal meeting just changed: from "is this magic?" to "what exactly am I paying for?"
Writing something longer on this for tomorrow.