We're excited to introduce the Retrieval Harness in LlamaParse - which is the 2026 version of RAG over documents
Generalized agents need the right set of tools to scalably search and read through an arbitrary corpus of data (from 10 docs to 1m+ docs). They can already demonstrate great retrieval performance over a local filesystem, need a proper backend for a large collection of managed data. The Retrieval Harness exposes a diverse set of tools for various needs:
1. Hybrid Retrieval: Combine vector search with keyword search, let the agent set the alpha value to toggle between the two
2. List Files: a scalable version of `ls` to list files within an index
3. File Grep: enable regex search within a given file
4. File Read: Allow agents to read a subsection from an existing document.
The agent can choose to interleave any sequence of these tools in order to complete a variety of tasks, from simple to hard.
Come check it out!
Blog: https://t.co/AGQV6JVkKj
Sign up to LlamaParse: https://t.co/XYZmx5TFz8
Devin Fusion uses a hybrid-model harness built around two ideas:
First, a “sidekick” agent: a smaller agent runs in parallel with the frontier agent. The frontier agent delegates work, monitors progress, and keeps ownership of planning, ambiguity, and final review.
This lets Fusion stay intelligent while avoiding unnecessary frontier-model spend.
With everything going on, it gives me hope that there's such a diversity of companies building open models today.
A lot of the story of open models unfolds under the shadow of the biggest frontier models. Lots of unearthed value.
Improving a product with a chat interface is so much easier than a website. With a website, you're stuck watching screen recordings and guessing what users wanted. With a chat interface, they literally tell you in English what they want your product to do.
We need a counter-culture design movement that appreciates products that are difficult to use initially, but tremendously fast, easy, accurate, explicit and efficient after topping the learning curve.
Ease-of-use is oversold. It's making everything suboptimal and slow.
We just crossed $100M annual run-rate. I know many AI companies are capturing much more $$$ these days, but still proud of the milestone!
Maximizing short-term revenue has never been our priority. In fact, we're proud to manage to store and serve hundreds of petabytes of models and datasets while keeping HF free and open-source for 97% of our users. As a platform, we’re happy to hopefully create orders of magnitude more value for the community than what we capture. To me, that’s the very definition of a platform.
And it has helped us build one of the most loved platform in tech, with network effects, a defensible position and a sustainable business which is quite unique in AI.
Many many thanks to all the community members for building with us, we wouldn't be anywhere without you! Can’t wait for what’s next, especially as more companies start to see the value of open and local AI! Next milestone $1B?
Introducing LFM2.5-230M: our smallest model yet, built to run fast anywhere (CPUs, NPUs, and GPUs) to enable agentic tasks on phones, robots, home and network automation devices.
> 230M parameters, built on the LFM2 architecture
> Pre-trained on 19T tokens, with a 32K context extension
> Post-trained with distillation from LFM2.5-350M
> 213 tok/s decode speed on Galaxy S25 Ultra (CPU)
> 42 tok/s on a Raspberry Pi 5 (CPU)
> Competes with and often beats models more than twice its size on instruction following, data extraction, and tool use.
> use it for large-scale data extraction pipelines or lightweight on-device agentic workloads.
🧵
Zyphra is sharing our first work in continual learning where we study: Can LLMs learn forever from new data?
Many see continual learning as a path to AGI through recursive self-improvement (RSI).
The first obstacle is plasticity loss. We derive a scaling law for its onset 🧵
With agentic coding, complexity compounds in a mechanical way: unnecessary code ends up in the codebase, moves to the context window, degrades the model's reasoning abilities, leads to more unnecessary code (often to fix issues arising from the unnecessary code). It's exponential
every PR will obviously come with 100% coverage of AI app testing, that tries every button in the interface to make sure it works as expected
why are the coding apps not making AI testing first class feature, 80% of problems are obvious for AI if it tries the app itself
We’re open-sourcing Unlimited OCR — built to read long documents in one pass.
With 3B total parameters and only 500M activated, Unlimited OCR sets new end-to-end SOTA results on OmniDocBench v1.5 and v1.6.
The key innovation is Reference Sliding Window Attention (R-SWA), inspired by how humans transcribe books: keeping the source, recent context, and next words in focus, while softly forgetting what’s no longer needed.
With constant KV Cache size and lower attention cost, Unlimited OCR can transcribe 40+ pages in a single forward pass — without losing context or slowing down.
Explore the model👇:
--GitHub: https://t.co/5ZJBsEldKd
--Hugging Face: https://t.co/4FKFr9EfOu
TMax: An open RL recipe for terminal agents
I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements.
TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2.
As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning.
I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested.
The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc.
Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks.
What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right.
This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on.
Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email).
As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way.
So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.
Search has its own bitter lesson
Incentivizing the world to optimize for your search engine matters more than the smartest algorithm
We saw it with Google. Now we're seeing it with Claude Code.
A fundamental problem with extending Codex/Cowork/Code to all knowledge work is that they remain very "software-brained" where the end result (the software) is what is important & that code serves as a source of truth.
For a lot of other knowledge work, the process is at least as important as the outcome. This includes researching what is known, an exploration of alternatives, failed efforts, prototype branches, experiments, etc. All of those things are valuable, so you cannot use the PowerPoint at the end the way you can use a codebase, nor is progress on a to-do list sufficient context post compaction. You work in learning loops, refining your perspectives as you go.
In some ways, this makes long-running models like Fable hard to use for deep knowledge work, since they are designed to deliver product to you in the end. You can prompt your way around this problem, but everything about the Codex and Code harnesses want you to be a software developer and you have to fight them. There is a real disconnect between how a manager or analyst thinks about problems and how the agentic software tools approach solving them. Addressing this is critical to breaking out of the coding niche for these tools.
I’m surprised people are surprised. Latest Chinese big models were clearly meant as Anthropic/OpenAI reproduction up to param size range/MoE structure, main difference still being the incredible breadth of data, environments and use cases well beyond public benchmarks.