Numo just passed 400,000 training data contributions.
all the while operating in early access, near zero marketing and the mobile app hasn't even launched yet 🔥
gnumo
$ 6B+ in fresh crypto VC closing or raising rn
@dragonfly_xyz / $650M
@paraficapital / $125M
@HaunVentures / $1B
@a16zcrypto / $2.2B
@bcap / $700M (raising)
@paradigm / $1.5B (rumored for AI/robotics)
thesis is the same across the board: stables, RWAs, AI agents transacting onchain 24/7
VC money is chasing the infra to bring IRL finance and autonomous economies onchain at bear valuations
infra supercycle?
In 2021, Nancy Pelosi was asked if Congress should be banned from trading stocks.
Her Response: "No… This is a free market."
While serving nearly 39 years in Congress with an average annual salary of $168,000, she increased her net worth to an estimated $280,000,000.
AI training data has a massive problem.
The chart below shows language distribution from the latest Microsoft VibeVoice model, in actual linear scale.
do you see the problem?
the languages representation is embarrassingly bad.
e.g. Hindi has <0.099%, but 600M ppl speak it worldwide.
e.g. arabic sits at 0.19%... yet it's spoken by 400M people across 25+ countries and is one of the UN's six official languages.
e.g. japanese is dead last on the chart at 0.002% — but ~80M people speak it. that's more native speakers than german, which gets 1.85% on this same chart.
so german is roughly 750x more represented than japanese.
e.g. bengali, the world's 6th most-spoken language with 270M speakers, doesn't appear on the chart at all!
That is now changing.
Numo is sourcing high quality, long tail data, starting with voice recordings like hindi and bengali, for better AI training that the whole world can benefit from.
every contribution on Numo is fully licensed, attributed and IP safe thanks to Story. Full data transparency.
AI was trained on the open internet, but the data that matters most lives in the real world.
Introducing early access to Numo, an app built to collect the next generation of AI training data.
Starting with voice data collection in Bengali, Hindi, Tamil, and Telugu.
Details ↴
Italian efficiency when it comes to coffee should be studied.
In Italy:
- Walk into a bar and look at the guy
- Un caffe
- 30 seconds later it’s ready
- Shoot it
- Leave €1
- Walk out
In the US:
- Join a line
- Wait
- Order coffee
- Answer 12 questions: Size? Milk? Roast? Sugar? Temperature? Colombia beans? Name? How do you spell it?
- $12.34
- Ask for a 20% tip. Click 5 times on a ipad to have a custom tip
- Tap phone
- ask where to send the invoice
- Wait again on a different line
- Someone call a name that sounds similar to mine
- get the coffee
- too hot, can't drink it
- finally at temperature
taste like shit
every AI model you use was trained on your work. your tweets. your code. your photos. your voice.
the industry called it "publicly available" (or slid it into their terms of use) and moved on. but that initial wave of "free" data is drying up.
AI models need constantly new real-world data: how people speak, work, move, decide. and that data has to be contributed.
contribution means consent. consent means ownership. ownership means IP.
suddenly... "data is IP" makes a lot more sense.
we see this a lot through Poseidon.
the buyers are asking for real, verified, consented, data around how humans actually behave. and they're willing to pay for it in a way the old scrape economy never required.
we've been building the rails for this. something you can use, not just read about. very soon.
zkSync dropped 38% in q1 and 295m ZK got staked during the drawdown. 75% of season 1 cap filled at the lows. the UAE central bank is settling $20b+ in sovereign assets on this infrastructure right now. people don't lock tokens into a 38% crash unless they know what's in the pipeline. price says dead L2. staking says informed accumulation. one of these is wrong.
Anthropic published a blog post one hour ago.
Cybersecurity stocks have lost $10B since.
CrowdStrike -6.5%. Cloudflare -6%. Okta -5.7%.
One blog post. One hour. $10B gone.
Copyright won’t die, it will need to reinvent itself.
docusigns, manual licensing negotiations and slow government portals are no longer an option.
We are in a “remix first, ask for forgiveness later/never” state of the world. So we need automated retroactive licensing at scale.
programmatic licensing is the only way forward, and it needs to be routed where things happen. Eg at the inference level or even deeper at the infra level of the internet, embedded within the http handshake for example.
Either way, IP has never been more important.