We're excited to announce MultiNet v1.0 - the first cross-domain benchmark for multimodal AI systems.
Unlike existing evaluations that test models within single domains, MultiNet reveals what happens when AI systems encounter the full complexity of real-world tasks.
Dataset here: https://t.co/RFdGroflPO
This work is a small iteration building on tremendous research from folks like @xwang_lk, @TianbaoX, @PangWeiKoh, @lateinteraction, @ZhiruoW, @Adamlu28 and many others.
What an exciting time to be working in this field!
GUI-Perturbed is open source. Use it to evaluate your own models.
Pipeline: https://t.co/b6DBxRG1My
Technical Report: https://t.co/ge5sKrFUE4
Results Viewer: https://t.co/AbKIP3rgPl
The three (Qwen2.5-VL-7B, UI-TARS-1.5-7B, GTA1-7B) share a base checkpoint but differ in post-training. so any gap in robustness comes from the training recipe, not the architecture.
Which model do you think would perform the best?
The team @figbrains, along with our friends @manifoldrg, took three of the best computer-use models and, surprisingly, broke all of them with very simple perturbations like changing zoom or colors.
Read on to understand our research, including a new SoTA Evaluation Dataset for Browser-use models + a new kind of interactive data sandbox!
At @figbrains, we’re testing frontier models (Fable, Kimi, etc) on simple web tasks that should be solvable.
They failed in ways that wouldn't stump a human (we think)
Results coming in a few days, but we want to see how good humans are:
Which change causes the most failures?
GUI-DR confirms an intuition we at @figbrains have had for a while: today’s computer control models often overfit to specific interfaces rather than learning the underlying task. Systematic GUI perturbations significantly reduce model performance. Read more below!
It was great working with @figbrains on GUI-DR!
We applied domain randomization from robotics to vary visual scenes and instructions, exposing fragile model behaviors like confusing the browser search bar with the formula bar in Google Sheets.
The Software Control research team at Manifold has been working on advancing new frontiers in long horizon computer control & grounding with @figbrains
Check out some of our early research below, with more to come soon!
Computer Control models can score 90%+ on standard benchmarks, but will fail when you set page zoom to 70%.
We're built GUI-DR, an OS pipeline that can restyle, reposition, and remove DOM elements on real webpages to reveal model weaknesses that fixed-scene benchmarks miss.
Fig wants to directly support researchers working on foundationally new takes on frontier models - targeting hard problems like long horizon multi-environent action.
Reach out to contact @ fig . inc if you're working on these or related areas.
This week at #CVPR2026 we presented MultiNet v1.0 at the MMFM workshop. It is a benchmark built around a question most evaluations skip: what happens to a multimodal model when you take it out of the one domain it was trained for and ask it to handle everything at once?
Loved @pliang279’s #CVPR2026 talk on AI modalities beyond vision/language: touch, smell, etc. The vision-tactile retrieval work reinforces that good representations make hard-to-observe signals queryable. We’re applying a similar lens to trajectories at @figbrains. More soon!
We built MultiNet v1.0 to test how well frontier models generalize across domains from text to robotics to gameplay and found surprising patterns of failure.
We're presenting at the #CVPR2026 MMFM workshop @ 3PM, room Four Seasons 4. Come hear where & how they break!
Headed to #CVPR2026!
I'll be there on behalf of @figbrains and @ManifoldRG, presenting our research on next-generation multimodal models and evaluation systems.
If you're into multimodal models, VLAs, or how we actually evaluate them, come say hi - I'd love to talk!
I’ll be at CVPR in Denver, along w/ some brilliant colleagues 🚀
If you’re around anytime over the next few days and interested in computer control or long horizon robotics, please reach out - the @figbrains team is around! We’d love to give a sneak peek at what we’re building.
Headed to #CVPR2026!
I'll be there on behalf of @figbrains and @ManifoldRG, presenting our research on next-generation multimodal models and evaluation systems.
If you're into multimodal models, VLAs, or how we actually evaluate them, come say hi - I'd love to talk!
Our next Frontiers Talk is on Tuesday, Dec 2 at 12 PM PDT.
@pranavguru13, Founding Research Engineer at @figbrains and lead for MultiNet at Manifold, will walk through how to build the next generation of multimodal benchmarks for functional intelligence.
Register below!
Our next Frontiers Talk is on Tuesday, Dec 2 at 12 PM PDT.
@pranavguru13, Founding Research Engineer at @figbrains and lead for MultiNet at Manifold, will walk through how to build the next generation of multimodal benchmarks for functional intelligence.
Register below!
Our next Frontiers Talk is this Friday, Nov 21 at 12 PM PDT!
@pranavguru13, Founding Research Engineer @figbrains and Research Lead for MultiNet at Manifold, will share how to build the next generation of multimodal benchmarks for functional intelligence.
Register below!
Thrilled to share MultiNet v1.0 with the research community - a collaboration between research groups @figbrains, Manifold Research, @GeorgiaTech, and @MIT.
This benchmark reveals critical limitations in how current AI systems generalize across domains. 🧵
We're grateful to work w/ research teams @ManifoldRG@GeorgiaTech and @MIT
Explore the benchmark: https://t.co/WMl4At9rUL and let us know what you think!
We're excited to announce MultiNet v1.0 - the first cross-domain benchmark for multimodal AI systems.
Unlike existing evaluations that test models within single domains, MultiNet reveals what happens when AI systems encounter the full complexity of real-world tasks.
The benchmark includes comprehensive evaluations of GPT-5, OpenVLA, Pi0, Magma, and other leading models - with open-source adaptations enabling testing on tasks far outside their original design.
Results show even our most advanced models struggle with true generalization.