Peter Sushko @PeterSushko - Twitter Profile

5 days ago

Very interesting to see pixels compared to HTML from the perspective of RAG. In web browsing agents we see that models trained on HTML beat VLMs initially but the ceiling for visual models is higher. Seems like RAG is far enough advanced to yield the gains from vision

Yichuan Wang

@YichuanM

5 days ago

The web was never meant to be flattened into text. Yet most web RAG systems start by parsing HTML --- a complex and lossy process. 🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels. Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly. PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%. Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines. Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG. Demo below 👇 Code: https://t.co/ssDF0nnVwZ Paper: https://t.co/OIpQ26Vb8H Playground: https://t.co/UdzM7GQmu3

25

693

116

701

72K

0

2

1

0

571

Peter Sushko @PeterSushko

11 days ago

Live demo of MolmoWeb at #cvpr with @zixianma02 @RanjayKrishna @allen_ai

0

18

2

0

1K

Peter Sushko @PeterSushko

12 days ago

I am at CVPR. Come chat with the team and learn more open source vision models. Also ai2 is hiring

Ai2 @allen_ai

12 days ago

We're at #CVPR2026 with papers & talks across the conference. Come say hello and learn about our latest research!

2

23

3

7

9K

1

22

2

9

4K

PeterSushko retweeted

Nathan Clark

@nathanclark_

26 days ago

it’s in gemini, just create it in ai studio. oh, that’s for your personal google one account. for workspace you need gemini business. no, not gemini advanced, that’s ai pro now. unless you need ai ultra. oh agents? you do that in spark actually. no, not gemini api managed agents, that’s different. for coding use jules. unless you mean the agentic ide, that’s antigravity. no, that’s the old antigravity, download the new one. actually gemini cli is being deprecated, use antigravity cli. no the flash model is smarter than the pro model. unless you need pro. if it’s video, use flow. no, flow uses veo. no, nano banana is images. actually that’s in gemini now. unless you’re in search, then it’s ai mode. no, research is notebooklm. anyway it’s all very simple.

512

19K

2K

3K

2M

PeterSushko retweeted

Tanush

@tanushyy

about 1 month ago

Remember action recognition? The days of trying to climb on Kinetics?👻 Announcing VideoNet, a CVPR 2026 Highlight 🎉 which revitalizes action recognition in the VLM era Explore our data with this fun, interactive demo: https://t.co/W53aBi3QAX (1/8) 🧵

3

58

23

20

10K

PeterSushko retweeted

Ai2 @allen_ai

about 1 month ago

Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵

11

294

74

149

394K

Peter Sushko @PeterSushko

about 1 month ago

@DJiafei Congratulations Jiafei!

0

1

0

167

PeterSushko retweeted

Weikai Huang

@weikaih04

about 2 months ago

We released the training code and full inference code today for WildDet3D! Check out here https://t.co/pMUaXDFsw4

3

296

32

264

29K

Peter Sushko @PeterSushko

2 months ago

LLM evals are hard. Agentic evals are very hard. Web browsing evals are crazy. The same webpage will show different content based on: Time of year (seasonal promos) Your IP (stores near me) Your device (os+browser combo) Random A/B tests This codebase solves evals and training.

Ai2 @allen_ai

2 months ago

You can now train, adapt, and eval web agents on your own tasks. We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵

allen_ai's tweet photo. You can now train, adapt, and eval web agents on your own tasks.

We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵 https://t.co/yMGRuzbeXQ

3

228

42

119

29K

1

17

7

13

5K

PeterSushko retweeted

Ai2 @allen_ai

2 months ago

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

9

280

62

164

85K

PeterSushko retweeted

Tanmay Gupta

@tanmay2099

3 months ago

72hrs after the release, looking at the community’s excitement around MolmoWeb, I have been reflecting on what leading this project throughout the past year was actually like. It didn’t feel like winning. It felt like a constant uphill battle. Making the case that this is worth building. Building a team around the project from the ground up. Working through compute constraints and org-wide competing priorities. Showing early demos that didn’t quite land. And so on. But reading people’s comments, it is clear that builders wanted an open web agent they could run locally. They wanted MolmoWeb. For me, it is a powerful reminder that sometimes you must go against the grain. Sometimes you must work in silence until your results can speak for themselves. If you are wrong, you will learn. If you are right, you might just give the world what it needs.

tanmay2099's tweet photo. 72hrs after the release, looking at the community’s excitement around MolmoWeb, I have been reflecting on what leading this project throughout the past year was actually like.

It didn’t feel like winning.

It felt like a constant uphill battle.

Making the case that this is worth building.

Building a team around the project from the ground up.

Working through compute constraints and org-wide competing priorities.

Showing early demos that didn’t quite land.

And so on.

But reading people’s comments, it is clear that builders wanted an open web agent they could run locally. They wanted MolmoWeb.

For me, it is a powerful reminder that sometimes you must go against the grain.

Sometimes you must work in silence until your results can speak for themselves.

If you are wrong, you will learn. If you are right, you might just give the world what it needs.

3

28

3

3K

PeterSushko retweeted

Jae Sung Park

@jjaesungpark

3 months ago

MolmoWeb picked my night out from comedy shows in seattle. even my weekends run on open-source now. see y'all at Moore.

1

16

7

3

3K

Peter Sushko @PeterSushko

3 months ago

@quantum_citoyen @allen_ai Small enough to run locally! Even the 8B model fits on a 48GB gpu

0

22

Peter Sushko @PeterSushko

3 months ago

@twlvone @allen_ai Open source for the win :)

0

8

Peter Sushko @PeterSushko

3 months ago

@bnafOg @allen_ai And we will soon release a tool that will allow you to finetune MolmoWeb on a specific type of tasks/websites. This way you can taylor the model towards your needs!

0

1

0

16

Peter Sushko @PeterSushko

3 months ago

@Web3__Youth @allen_ai It’s great at tasks on a single website. Like looking up plane tickets, finding specific information, online shopping etc. We will soon release the eval code on GitHub, so you can test it in benchmarks!

0

20

Peter Sushko @PeterSushko

3 months ago

@anitakirkovska @allen_ai Playwright is a tool to execute actions in a browser. Molmoweb is a model that comes up with the right actions. Playwright is like the hand and molmoweb is like the brain

0

18

PeterSushko retweeted

DailyPapers

@HuggingPapers

3 months ago

Ai2 just released MolmoWeb on Hugging Face A fully open multimodal web agent that autonomously controls browsers to complete tasks, achieving SOTA results and surpassing GPT-4o based agents on WebVoyager and Mind2Web.

HuggingPapers's tweet photo. Ai2 just released MolmoWeb on Hugging Face

A fully open multimodal web agent that autonomously controls browsers to complete tasks,

achieving SOTA results and surpassing GPT-4o based agents on WebVoyager and Mind2Web. https://t.co/FyFD6zSP82

1

54

12

18

7K

PeterSushko retweeted

Boyuan Zheng

@boyuan__zheng

3 months ago

Still missing that sweet summer with the AI2 team ❤️CUA research is incredibly hard in academia — the lack of trajectories and RL environments is still a real bottleneck. (too profitable to open-source🥲) Excited to see MolmoWeb finally out and potentially unlock key directions for making CUA work: self-play, continual learning, RL in generative environments, and more. 2026 is going to be a big year for CUA. 🚀

0

39

4

6

5K

Peter Sushko @PeterSushko

3 months ago

Very proud & excited to share what i've been working on at Ai2. MolmoWeb is: 1. A pretty strong agent for browsing the web 2. A huge collection of artifacts. Synthetically generated data, human annotations, model checkpoints, evaluation codebase (coming soon) Check it out!

Ai2 @allen_ai

3 months ago

Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf. Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵

allen_ai's tweet photo. Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf.

Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵 https://t.co/ivUIcQDXtm

21

806

114

536

131K

0

26

5

4K

Peter Sushko

@PeterSushko

Last Seen Users on Sotwe

Trends for you

Most Popular Users