Fair question. Different job, really.
Screenshot browser agents use the image to act: where to click, what to type, on one live page. It's a per-session action loop.
PixelRAG uses pixels to retrieve. We train a visual embedding model and index 30M+ pages as image tiles, then search over them. It's dense retrieval over the web, but visual.
And pixelbrowse is just the single-page tip of that: screenshot, read straight, no DOM, no parser. 74% fewer tokens than directly reading the HTML.
Your coding agent reads every web page wrong. And pays 2.5x extra to do it.
pixelbrowse fixes it with one screenshot. β74% tokens, 4x faster.
give Claude Code eyes. π
New: pixelbrowse.
πππ πππππππ πππ‘πππππ Β· https://t.co/wu7aSJ4auw
Built by @YichuanM and @andylizf at StarTrail, out of UC Berkeley's Sky Computing Lab @BerkeleySky, BAIR @berkeley_ai, and Berkeley NLP @BerkeleyNLP. Same team behind LEANN https://t.co/0od7GlyScX (RAG on everything, on-device, with a 97% smaller index).
Thanks to our advisors @matei_zaharia, @profjoeyg, and @sewon__min!
If LEANN gave your models memory of your private data, PixelRAG gives them eyes on the open web.
Code: https://t.co/iccv6aCG5h
Paper: https://t.co/uDEWZmXKhS
Playground: https://t.co/TzmCmIwhy8
π πππ πππππππ πππ‘πππππ
Zoom out: PixelRAG isn't really about RAG.
The more capable models get, the more multimodal they get. And the most natural way to read the world is the way people do, by looking, not by parsing markup. Pixels are the universal format: web pages, PDFs, charts, UIs, all of it.
Building agents that see the world the way we do is where this is headed. PixelRAG is one step there. πͺ
Even our own status page does it, https://t.co/BPlE9HK8bA π
WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead.
pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.Even our own status page does it, https://t.co/BPlE9HK8bA π
WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead.
pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.
I asked both to pull benchmark scores off https://t.co/HQl8oaMgED. The numbers live in a chart image, so the HTML has none of them.
HTML agent: 7 turns, downloaded the image anyway, even burned a billed web search. 47.3k tokens, 97s. pixelbrowse: one screenshot. 12.0k tokens, 23s.
Same answer from both here. The counter-intuitive part, and the core finding in our paper: HTML to text is lossy, so reading pixels wins on accuracy and on tokens at the same time. A screenshot is ~1k visual tokens vs 2k+ for the same page as text. Better and cheaper, not a tradeoff. π€―
Your agent never actually sees the page. It grabs the HTML, flattens it to text, and reasons over that. When the answer is in a chart or an image, that text doesn't have it, so the agent burns tokens working around a problem pixels don't have.
pixelbrowse screenshots the page and the model reads it.
Really amazing results analyzing what's creative/novel vs. what's copied from Internet data, enabled by the amazing @liujc1998's Infini-gram! https://t.co/ZglfLg1dRF
This is also enabled in @allen_ai's OlmoTrace https://t.co/ayLePYwGbZ where anyone can find matching n-grams between LLM-generated text and its training data.
Codex basically replaced OpenClaw for me at this point
but the part I like most here is βmemory as markdown filesβ
pretty refreshing to hear that from someone on the Codex team. model companies usually have every reason to want memory lock-in π
Super stoked that UW SyFI (https://t.co/GsIZJi5LB5) members won a number of prizes at the MLSys'26 competition, NVIDIA Track. Hugre congrats to @KeisukeKamahori , @sudopowr , Yile Gu, Wei Shen, Steven Gao! Thanks to @nvidia , @modal , and the Flashinfer team for the support.
1st place in the GDN Track β Full-Agent Approach
2nd place in the GDN Track β Agent-Assisted Approach
3rd place in the DSA Track β Full-Agent Approach
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.
CODA reparameterizes them to hide in the matmulβs shadow, fused into its epilogue before results leave the chip.
Bonus: LLMs can write fast CODA kernels too (approaching SoLs).