Zhifei Li

Verified account

@andylizf

Incoming PhD @PrincetonCS. Building AI systems & infra. Prev @BerkeleySky & @ruc1937

Joined July 2022

112 Following

224 Followers

42 Posts

9 days ago

Fair question. Different job, really. Screenshot browser agents use the image to act: where to click, what to type, on one live page. It's a per-session action loop. PixelRAG uses pixels to retrieve. We train a visual embedding model and index 30M+ pages as image tiles, then search over them. It's dense retrieval over the web, but visual. And pixelbrowse is just the single-page tip of that: screenshot, read straight, no DOM, no parser. 74% fewer tokens than directly reading the HTML.

1

0

0

0

69

10 days ago

Your coding agent reads every web page wrong. And pays 2.5x extra to do it. pixelbrowse fixes it with one screenshot. −74% tokens, 4x faster. give Claude Code eyes. 👀 New: pixelbrowse. 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚙𝚒𝚡𝚎𝚕𝚛𝚊𝚐 · https://t.co/wu7aSJ4auw

13

64

4

48

70K

10 days ago

Built by @YichuanM and @andylizf at StarTrail, out of UC Berkeley's Sky Computing Lab @BerkeleySky, BAIR @berkeley_ai, and Berkeley NLP @BerkeleyNLP. Same team behind LEANN https://t.co/0od7GlyScX (RAG on everything, on-device, with a 97% smaller index). Thanks to our advisors @matei_zaharia, @profjoeyg, and @sewon__min! If LEANN gave your models memory of your private data, PixelRAG gives them eyes on the open web. Code: https://t.co/iccv6aCG5h Paper: https://t.co/uDEWZmXKhS Playground: https://t.co/TzmCmIwhy8 🐍 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚙𝚒𝚡𝚎𝚕𝚛𝚊𝚐

0

5

2

4

710

10 days ago

Zoom out: PixelRAG isn't really about RAG. The more capable models get, the more multimodal they get. And the most natural way to read the world is the way people do, by looking, not by parsing markup. Pixels are the universal format: web pages, PDFs, charts, UIs, all of it. Building agents that see the world the way we do is where this is headed. PixelRAG is one step there. 💪

0

0

0

0

234

10 days ago

This is PixelRAG up close: one page, in your agent. The same pixels over HTML model powers retrieval across 30M+ web pages.

0

0

0

0

233

10 days ago

Even our own status page does it, https://t.co/BPlE9HK8bA 😅 WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead. pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.Even our own status page does it, https://t.co/BPlE9HK8bA 😅 WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead. pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.

andylizf's tweet photo. Even our own status page does it, https://t.co/BPlE9HK8bA 😅

WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead.

pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.Even our own status page does it, https://t.co/BPlE9HK8bA 😅

WebFetch got the HTML (7.4KB) and couldn't read a single live number. The dashboard renders client-side, so none of it is in the HTML. The agent offered to go scrape our GitHub repo instead.

pixelbrowse screenshots it and reads the dashboard straight: all systems operational, uptime and latency for every service.

1

1

0

1

287

10 days ago

I asked both to pull benchmark scores off https://t.co/HQl8oaMgED. The numbers live in a chart image, so the HTML has none of them. HTML agent: 7 turns, downloaded the image anyway, even burned a billed web search. 47.3k tokens, 97s. pixelbrowse: one screenshot. 12.0k tokens, 23s. Same answer from both here. The counter-intuitive part, and the core finding in our paper: HTML to text is lossy, so reading pixels wins on accuracy and on tokens at the same time. A screenshot is ~1k visual tokens vs 2k+ for the same page as text. Better and cheaper, not a tradeoff. 🤯

0

2

0

1

345

10 days ago

pixelbrowse is a Claude Code skill. Install it in three commands, and give Claude Code eyes 👇

andylizf's tweet photo. pixelbrowse is a Claude Code skill. Install it in three commands, and give Claude Code eyes 👇 https://t.co/5QMDQC7OkJ

0

5

0

1

398

10 days ago

Your agent never actually sees the page. It grabs the HTML, flattens it to text, and reasons over that. When the answer is in a chart or an image, that text doesn't have it, so the agent burns tokens working around a problem pixels don't have. pixelbrowse screenshots the page and the model reads it.

1

3

0

0

500

about 1 month ago

@lihanc02 @OrwellNGoode @HuanzhiMao @MangQiuyang 😂

0

1

0

0

65

andylizf retweeted

about 1 month ago

Really amazing results analyzing what's creative/novel vs. what's copied from Internet data, enabled by the amazing @liujc1998's Infini-gram! https://t.co/ZglfLg1dRF This is also enabled in @allen_ai's OlmoTrace https://t.co/ayLePYwGbZ where anyone can find matching n-grams between LLM-generated text and its training data.

1

83

10

54

17K

about 1 month ago

congrats Zihan! excited to read RAGEN-2

Zihan "Zenus" Wang

about 1 month ago

RAGEN-2 is selected as ICML oral! Congrats and great appreciation to all collaborators!!

wzenus's tweet photo. RAGEN-2 is selected as ICML oral!

Congrats and great appreciation to all collaborators!! https://t.co/J6uuv7nkcf

9

157

16

28

15K

1

5

0

1

497

about 1 month ago

Codex basically replaced OpenClaw for me at this point but the part I like most here is “memory as markdown files” pretty refreshing to hear that from someone on the Codex team. model companies usually have every reason to want memory lock-in 🙄

about 1 month ago

https://t.co/9KkXEINuIj

67

3K

313

6K

529K

0

2

0

0

238

about 1 month ago

@LijieyYang looking forward to it!

1

1

0

0

39

about 1 month ago

@romitjain_ @thepushkarp Congrats! That's awesome.

0

2

0

0

47

about 1 month ago

huge congrats! agent-written kernels are fun VibeServe is even cooler! can't wait for more from @KeisukeKamahori and the SyFI crew

about 1 month ago

Super stoked that UW SyFI (https://t.co/GsIZJi5LB5) members won a number of prizes at the MLSys'26 competition, NVIDIA Track. Hugre congrats to @KeisukeKamahori , @sudopowr , Yile Gu, Wei Shen, Steven Gao! Thanks to @nvidia , @modal , and the Flashinfer team for the support. 1st place in the GDN Track — Full-Agent Approach 2nd place in the GDN Track — Agent-Assisted Approach 3rd place in the DSA Track — Full-Agent Approach

bariskasikci's tweet photo. Super stoked that UW SyFI (https://t.co/GsIZJi5LB5) members won a number of prizes at the MLSys'26 competition, NVIDIA Track. Hugre congrats to @KeisukeKamahori , @sudopowr , Yile Gu, Wei Shen, Steven Gao! Thanks to @nvidia , @modal , and the Flashinfer team for the support.

1st place in the GDN Track — Full-Agent Approach
2nd place in the GDN Track — Agent-Assisted Approach
3rd place in the DSA Track — Full-Agent Approach

3

38

6

5

10K

1

13

1

1

2K

about 1 month ago

one of the most annoying things in MLSys: tiny ops keep forcing data to move around CODA: do them before the tile leaves the chip

about 1 month ago

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

HanGuo97's tweet photo. LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs). https://t.co/cOTeMUr4py

16

687

103

536

200K

0

5

0

0

282

Last Seen Users on Sotwe

Trends for you

Most Popular Users