The three (Qwen2.5-VL-7B, UI-TARS-1.5-7B, GTA1-7B) share a base checkpoint but differ in post-training. so any gap in robustness comes from the training recipe, not the architecture.
Which model do you think would perform the best?
The team @figbrains, along with our friends @manifoldrg, took three of the best computer-use models and, surprisingly, broke all of them with very simple perturbations like changing zoom or colors.
Read on to understand our research, including a new SoTA Evaluation Dataset for Browser-use models + a new kind of interactive data sandbox!
At @figbrains, we’re testing frontier models (Fable, Kimi, etc) on simple web tasks that should be solvable.
They failed in ways that wouldn't stump a human (we think)
Results coming in a few days, but we want to see how good humans are:
Which change causes the most failures?
The Software Control research team at Manifold has been working on advancing new frontiers in long horizon computer control & grounding with @figbrains
Check out some of our early research below, with more to come soon!
Computer Control models can score 90%+ on standard benchmarks, but will fail when you set page zoom to 70%.
We're built GUI-DR, an OS pipeline that can restyle, reposition, and remove DOM elements on real webpages to reveal model weaknesses that fixed-scene benchmarks miss.
Computer Control models can score 90%+ on standard benchmarks, but will fail when you set page zoom to 70%.
We're built GUI-DR, an OS pipeline that can restyle, reposition, and remove DOM elements on real webpages to reveal model weaknesses that fixed-scene benchmarks miss.
Foundation models assume capabilities transfer.
MultiNet tests that: what happens when a multimodal model leaves its training domain and has to operate somewhere else?
Excited to see this work presented at CVPR 2026, developed with the @figbrains team!
This week at #CVPR2026 we presented MultiNet v1.0 at the MMFM workshop. It is a benchmark built around a question most evaluations skip: what happens to a multimodal model when you take it out of the one domain it was trained for and ask it to handle everything at once?
Loved @pliang279’s #CVPR2026 talk on AI modalities beyond vision/language: touch, smell, etc. The vision-tactile retrieval work reinforces that good representations make hard-to-observe signals queryable. We’re applying a similar lens to trajectories at @figbrains. More soon!
I’ll be at CVPR in Denver, along w/ some brilliant colleagues 🚀
If you’re around anytime over the next few days and interested in computer control or long horizon robotics, please reach out - the @figbrains team is around! We’d love to give a sneak peek at what we’re building.
Members of the GOLEM team at Manifold will be presenting work today at CVPR’s MFMM workshop - come by to learn more about MultiNet, a next gen benchmark for frontier action systems!
More details on room and time below ⬇️
We built MultiNet v1.0 to test how well frontier models generalize across domains from text to robotics to gameplay and found surprising patterns of failure.
We're presenting at the #CVPR2026 MMFM workshop @ 3PM, room Four Seasons 4. Come hear where & how they break!
Headed to #CVPR2026!
I'll be there on behalf of @figbrains and @ManifoldRG, presenting our research on next-generation multimodal models and evaluation systems.
If you're into multimodal models, VLAs, or how we actually evaluate them, come say hi - I'd love to talk!
What will it take to build the next generation of AI systems and frontier technologies?
The Manifold team will attend both CVPR 2026 and Vision Weekend UK this week!
If you’ll be there, come say hello! We’d love to meet folks interested in ambitious science and technology.
Join us for our first ever Vision Weekend in the UK!
2026 marks 40 years of Foresight. Over three days, we will gather leading researchers, builders, and funders to look forward: exploring what scientific and technological frontiers will shape the coming decades, and how to make them reality.
June 5–7 | London
Confirmed speakers include:
• Ed Boyden (MIT) on biologically accurate brain simulation
• Greg Wayne (Google DeepMind) on universal AI assistants
• Jano Costard (SPRIND) on challenges as a tool for breakthrough innovation
• Christine Peterson (Foresight Institute) on Foresight, 40 years later
• Dorothy Chou (Google DeepMind) on capital for the long game: financing durable innovation in an age of hype
• Irina Rish (Mila) on beyond scaling: toward continual and adaptive intelligence
• Chris Rozell (Georgia Tech) on closed-loop neuroengineering: algorithms that learn from the brain in real time
• Lee Cronin (University of Glasgow)
• Mehmet Fisek (Meridial) on Focused Research Organisation mission and setup
• Zoë Brammer (Google DeepMind) on AI for science 2030
• João Pedro de Magalhães (University of Birmingham) on hacking aging biology
and many more.
Get your tickets: https://t.co/nrK9PKN0ES
Powered by:
@apolloaievals@ARIA_research@e184media@CUHPartners@RenPhilanthropy@SPRIND@andnowstudio
Meet Sidh Sikka, PhD researcher in orbital robotics, co-founder of Manifold Research, and Foresight Fellow 2026.
@SikkaSidh is working toward autonomous robotic swarms, capable of assembling and managing large-scale infrastructure in orbit: the foundational layer for a sustainable industrial economy in space.
His R&D institute @ManifoldRG is currently seeking technical collaborators across several research projects, including their autonomous assembly project. Learn more: https://t.co/PtvLOIiMBJ
Sidh and his team are also seeking funding to grow this work. Reach out at sid [at] sidhsikka [dot] com
Manifold Research Group works on high-impact problems that fall between academia and industry. Small teams, real systems, published results.
Learn more about what we’re building at Manifold: https://t.co/pKzAwX6fqO
What does it actually take to build in space at scale?
The next phase of the space economy will depend on our ability to build and service systems at scale, directly in orbit.
We recently gave a talk on this with @foresightinst.
Link to the talk in the thread below 🧵
We are building toward coordinated, autonomous systems that can enable large scale construction in orbit, turning this from concept into deployable capability.
If you want to work with us on this, check out: https://t.co/5Y19ZLR0Yw
Manifold Research Group works on high-impact problems that fall between academia and industry.
Small teams, real systems, published results.
All open roles → https://t.co/5Y19ZLR0Yw
Can a multimodal model that reasons well in language also do so in a grid world? In a 3D sim? On a different task?
MultiNet tests whether models really generalize across tasks, across modalities. We're building it at @ManifoldRG, and we want researchers to join.
🧵 Roles below.
OS Research Fellow — Benchmark Platform & Release Engineering
Package MultiNet for the research community — notebooks, APIs, documentation, and reproducible tooling.
You'll shape how the field measures cross-modal, cross-task generalization.
Apply → https://t.co/IhQ6rColCP