Computer use models shouldn't learn from screenshots.
We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
We’ve raised 75m in new funding from Sequoia and Spark Capital—partnering with @sonyatweetybird, @MikowaiA, and @YasminRazavi, all of whom are deeply supportive of our long-term mission. We’ve also brought on angels & advisors including @karpathy, @tszzl, and @_milankovac_.
-----
Our early results with FDM-1 moved computer use from a data-constrained regime to a compute-constrained one; this latest round of funding unlocks several orders of magnitude of compute scaling for that work. With the FDM model series we have a path to scale agentic capabilities through video pretraining, and we expect to achieve superhuman performance on general computer tasks in the same way that current language models have superhuman performance on coding tasks.
We’re also now able to invest in the blue-sky research necessary to our long term mission of building aligned general learners. To realize the civilizationally transformative impacts of AI, models must generalize far out of their training distributions, actively exploring and building skills in new environments. This capability represents a substantial shift from the current paradigm of model training. We believe that current alignment techniques are insufficient to predictably and safely steer a model with human-level learning capabilities, and so we’re doing work to study small versions of this problem in controlled environments to develop a science of alignment for general learners.
We’re a team of 6 people in San Francisco. We’re hiring world-class researchers and engineers to help us achieve our mission. If that’s you, please get in touch.
New from me this morning: standard intelligence has raised $75m @ $500m to develop computer use models
Their hypothesis is that video pretraining gives a better action prior than text and screenshots ➡️ continual learning
And their training runs are very brat
Back when we were raising our seed round, Lachy was one of the only people in Silicon Valley who saw our idea, immediately got it, and wrote the check that let us train FDM-1.
Incredibly grateful to have him as an early supporter.
@si_pbc@sonyatweetybird@MikowaiA@YasminRazavi@tszzl@_milankovac_ VPT (https://t.co/CSxHcXY6Vh) blew my mind back in 2022 so I was very excited to see SI scale up the idea with FDM1, but for knowledge work / computer use. Excited and looking forward to more!
There are very few moments in any decade where you come across a team with truly world-historic potential. I remember sitting down with Galen and Devansh and immediately knowing we had to find a way to work together. Partnering with the @si_pbc team has been, and continues to be, a privilege. I’m incredibly excited to see them thrive and to watch what the future holds for both the company and the exceptional people behind it.
We’ve raised 75m in new funding from Sequoia and Spark Capital—partnering with @sonyatweetybird, @MikowaiA, and @YasminRazavi, all of whom are deeply supportive of our long-term mission. We’ve also brought on angels & advisors including @karpathy, @tszzl, and @_milankovac_.
-----
Our early results with FDM-1 moved computer use from a data-constrained regime to a compute-constrained one; this latest round of funding unlocks several orders of magnitude of compute scaling for that work. With the FDM model series we have a path to scale agentic capabilities through video pretraining, and we expect to achieve superhuman performance on general computer tasks in the same way that current language models have superhuman performance on coding tasks.
We’re also now able to invest in the blue-sky research necessary to our long term mission of building aligned general learners. To realize the civilizationally transformative impacts of AI, models must generalize far out of their training distributions, actively exploring and building skills in new environments. This capability represents a substantial shift from the current paradigm of model training. We believe that current alignment techniques are insufficient to predictably and safely steer a model with human-level learning capabilities, and so we’re doing work to study small versions of this problem in controlled environments to develop a science of alignment for general learners.
We’re a team of 6 people in San Francisco. We’re hiring world-class researchers and engineers to help us achieve our mission. If that’s you, please get in touch.
@tbpn@devanshpandey@Roon my blog post in ‘22 actually emphasizes the importance of adept (rip) but also just about the utility of being able to prompt a computer use agent - because a prompt is text and can be arbitrarily created, transformed, piped, split, forked, etc
https://t.co/chotPjd6to
Standard Intelligence's @devanshpandey responds to @tszzl's tweet that "text is the universal interface," and explains why their new foundation model is trained on video:
"At some point in the arbitrarily long future, if we only use text models, we could force most things to be text. But I think there are just a lot of things that are much more native when done from a computer-use [perspective]."
"GUIs are designed for humans to use. We have this massive long tail of things on the internet that are entirely undoable by LLMs."
"For example, when I do ML engineering most of my time is spent doing the grunt work of engineering. It's a lot of looking at graphs, analyzing, and comparing loss curves. You can do this in text, but it's a much larger pain than doing it in the native interface."
"There's a reason humans don't interact with a computer purely through text, it would kind of suck."
Computer use models shouldn't learn from screenshots.
We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
some great design decisions here. masked diffusion, binning of delta mouse movements, IDM in the wild, self-supervision embedding objectives
video modeling, computer use, and robotics are not too far away from each other. great job to the @si_pbc team!
computer use today lags pretty far behind other capabilities. a lot of it depends on the model guessing the right pixel coordinates to click on, which just feels so jank.
what's even more of an issue:
interacting with the web is hard to do properly by taking screenshots and not having a continuous stream of info (you can't watch videos, you miss important but disappearing elements and visual feedback, etc)
excited to see more