Proud to work on such an imaginative type of data set…
Actually we would prompt more people to be creative with their data - we are always here to help with innovative types of data collection and curation.
Once again, we enjoyed this project very much. @JulienBlanchon@satiyum@sarthaktiwaryy
I'm releasing OpenCS2 a 11TB dataset of around 5000 hours of counter strike gameplay recording.
- HD resolution - 1280×720 · 32 fps
- For each frame keyboard and mouse + world state (player position, velocity, weapon ...)
- HD Stereo audio
- All 10 players perspective
More data leads to better models.
And more data is not as hard to get as people claim.
Don’t solve problems that don’t really exist.
Scale real world data.
data collection efforts should aim to maximize total information gain for the models
but information gain over what? for now, its information gain over the internet corpus
as you see here, its very fruitful to control the hardware which makes our world more information rich for models
Core idea: let an LLM actively steer the human during recording to induce structure that passive egocentric video systematically misses.
Most egocentric datasets are heavily biased toward a world where:
•tasks rarely fail
•failures rarely compound
•abandoning a task midway is always acceptable (low stakes)
This creates deceptively “clean” trajectories.
We’re experimenting with LLM-controlled egocentric data collection for world models or robotics learning.
The LLM assigns long-horizon objectives, interrupts with immediate sub-tasks, lets the episode drift into side tasks, and forces failure + recovery.
This produces intent → action → outcome → correction loops in real time, not post-hoc labels.
Claim: diversity-aware prompting can fill data gaps passive video never will, without brute-force scale.
Curious whether this actually helps world models learn controllable, general dynamics. 🧵