Very interesting to see pixels compared to HTML from the perspective of RAG. In web browsing agents we see that models trained on HTML beat VLMs initially but the ceiling for visual models is higher. Seems like RAG is far enough advanced to yield the gains from vision
The web was never meant to be flattened into text.
Yet most web RAG systems start by parsing HTML --- a complex and lossy process.
🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels.
Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly.
PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%.
Why?
(1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals.
(2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever.
(3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines.
Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG.
Demo below 👇
Code: https://t.co/ssDF0nnVwZ
Paper: https://t.co/OIpQ26Vb8H
Playground: https://t.co/UdzM7GQmu3
it’s in gemini, just create it in ai studio. oh, that’s for your personal google one account. for workspace you need gemini business. no, not gemini advanced, that’s ai pro now. unless you need ai ultra. oh agents? you do that in spark actually. no, not gemini api managed agents, that’s different. for coding use jules. unless you mean the agentic ide, that’s antigravity. no, that’s the old antigravity, download the new one. actually gemini cli is being deprecated, use antigravity cli. no the flash model is smarter than the pro model. unless you need pro. if it’s video, use flow. no, flow uses veo. no, nano banana is images. actually that’s in gemini now. unless you’re in search, then it’s ai mode. no, research is notebooklm. anyway it’s all very simple.
Remember action recognition? The days of trying to climb on Kinetics?👻
Announcing VideoNet, a CVPR 2026 Highlight 🎉 which revitalizes action recognition in the VLM era
Explore our data with this fun, interactive demo: https://t.co/W53aBi3QAX
(1/8) 🧵
Robotics models often struggle outside controlled environments. Ours is built to work in real ones.
Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵
LLM evals are hard. Agentic evals are very hard. Web browsing evals are crazy. The same webpage will show different content based on:
Time of year (seasonal promos)
Your IP (stores near me)
Your device (os+browser combo)
Random A/B tests
This codebase solves evals and training.
You can now train, adapt, and eval web agents on your own tasks.
We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵
Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild.
It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵
72hrs after the release, looking at the community’s excitement around MolmoWeb, I have been reflecting on what leading this project throughout the past year was actually like.
It didn’t feel like winning.
It felt like a constant uphill battle.
Making the case that this is worth building.
Building a team around the project from the ground up.
Working through compute constraints and org-wide competing priorities.
Showing early demos that didn’t quite land.
And so on.
But reading people’s comments, it is clear that builders wanted an open web agent they could run locally. They wanted MolmoWeb.
For me, it is a powerful reminder that sometimes you must go against the grain.
Sometimes you must work in silence until your results can speak for themselves.
If you are wrong, you will learn. If you are right, you might just give the world what it needs.
@bnafOg@allen_ai And we will soon release a tool that will allow you to finetune MolmoWeb on a specific type of tasks/websites. This way you can taylor the model towards your needs!
@Web3__Youth@allen_ai It’s great at tasks on a single website. Like looking up plane tickets, finding specific information, online shopping etc. We will soon release the eval code on GitHub, so you can test it in benchmarks!
@anitakirkovska@allen_ai Playwright is a tool to execute actions in a browser. Molmoweb is a model that comes up with the right actions. Playwright is like the hand and molmoweb is like the brain
Ai2 just released MolmoWeb on Hugging Face
A fully open multimodal web agent that autonomously controls browsers to complete tasks,
achieving SOTA results and surpassing GPT-4o based agents on WebVoyager and Mind2Web.
Still missing that sweet summer with the AI2 team ❤️CUA research is incredibly hard in academia — the lack of trajectories and RL environments is still a real bottleneck. (too profitable to open-source🥲)
Excited to see MolmoWeb finally out and potentially unlock key directions for making CUA work: self-play, continual learning, RL in generative environments, and more.
2026 is going to be a big year for CUA. 🚀
Very proud & excited to share what i've been working on at Ai2. MolmoWeb is:
1. A pretty strong agent for browsing the web
2. A huge collection of artifacts. Synthetically generated data, human annotations, model checkpoints, evaluation codebase (coming soon)
Check it out!
Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf.
Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵