What if the entire video world model bet is the wrong modality?
A single F1 car generates 1.1 million sensor data points per second, compare to very little images.
Are we sure we want to learn from the less informative representation?
@ArtificialAnlys Image-to-video generation is just as impressive.
Input image:
"Generate a 16:9 image from a dashcam view of a formula 1 racing event"
Video prompt:
"A high-speed racing event where a car navigates multiple winding turns"
π Sound on - generated by Cosmos 3.
While I share Elonβs distaste for credentialism, I think heβs swung a bit far on the anti-researcher pendulum.
It is my experience you need researchers chewing on cigarettes, walking circles in the parking lot, lying on the floor staring at the ceiling, producing many ideas that donβt work, and making terrible engineering decisions, to find the novel borders of Truth.
One of RenTechβs earliest successful strategies didnβt make money until an entry level engineer came in and fixed a trivial mistake in the code. Necessary, but so too was the thousands of hours of cigarette chewing from mediocre engineers.
Having the right ideas >> beautifully engineering the wrong idea.
Beautifully engineering the wrong idea == lots of work with little displacement
I was going through some of my old projects and came across Cheemscity, a Duolingo for robotics set in a cyberpunk city, with Cheems as the pedestrians.
That was a wild idea.
For every comment, I'll spill a piece of the lore.