AI was trained on the open internet, but the data that matters most lives in the real world.
Introducing early access to Numo, an app built to collect the next generation of AI training data.
Starting with voice data collection in Bengali, Hindi, Tamil, and Telugu.
Details ↴
The problem with AI data marketplaces is what happens after the first sale.
It's a value capture issue: high-signal data gets contributed, but platforms retain the full upside.
→ CDR changes that.
Datasets become composable, with licensing and conditional decryption built in.
I onboarded a new teammate at Story last month. Since then, he's written 9 RCAs, reviewed 31 PRs, and handled 81 Slack requests from coworkers, all in production.
Story is an IP blockchain company, and I'm the only infra engineer on the team based in Asia. He covers the hours I can't. He's an AI clone of me.
Here's how it's wired ↓
Most teams collecting voice data optimize for volume over quality, partly because they’re measuring quality wrong.
To help evaluate quality we created the Poseidon Score. When applied, single-speaker audio scored well while multi-speaker conversations scored worse.
Why? ↓
In 1 year of Story mainnet:
▸ 5.6M+ IP registrations
▸ 90M+ transactions
▸ 13M+ onchain IP addresses
▸ 34k+ hours of real-world AI training data
▸ 405k+ data contributors worldwide
Happy 1 year anniversary, Story community.
In 2025, we built the foundation. Now we scale.
High-fidelity voice isn't something you can scrape from the internet.
It needs to be collected from distributed contributors around the world, curated for quality, and validated by native speakers.
This is what the cutting-edge of voice AI datasets looks like ↓
The English speech performance in Nvidia’s PersonaPlex-7B model is impressive, but it’s also the most saturated corner of the data landscape.
The harder problem is low-resource languages and long-tail data.
Scarce conversational corpora, unclear licensing, and underrepresented accents and dialects.
The next great app is at your fingertips.
This Wednesday: Local Host 3000, our new live-coding tutorial series.
Watch as @jacobmtucker and @0xNock build ideas from 0 to 100.
RSVP below ↓
AI is increasingly trusted with sensitive tasks, and yet it's still a black box.
A new model is emerging: verifiable compute on @eigencloud paired with licensed and properly attributed data on Story.
@jacobmtucker breaks it down ↓
Story in 2025
5M+ real world audio submissions collected for AI training
405k contributors onboarded onchain
IP rails proved they can scale beyond NFTs
2026 to watch
Rights cleared data becoming the backbone of AI and IP markets
More on @StoryProtocol
https://t.co/ZA6Yblo326
The remix economy is here, and it runs on Story.
Eco project @Aria_Protocol’s contest with K-pop icon NANA is proof: rights-cleared IP, remixed by you.
Explore what the future of IP sounds like ↴