@dambuildshit Step 1: Build a free DNS provider
Step 2: Collect tolls for access to sites behind their DNS
Step 3: Build a crawler that doesn't need to pay their own toll
Step 4: Profit
The web isn't a database. @diffbot makes it one. 10B+ entities and 1T facts extracted from 60B+ pages, rebuilt every 4-5 days. DuckDuckGo, Snapchat, and Dow Jones run on it.
Massive powers the proxy infra behind their continuous crawl.
Ever wondered what your white name should have been? Introducing: https://t.co/d4nEQgv88j
Upload a picture of you, and let the puppy guess your name! Let's test out nominative determinism 🫡
(Immigrants who named themselves will correlate more highly. Give us feedback plz)
Our thanks to:
- @modal for their generous credits toward training this meme model
- @diffbot for the clean, diverse dataset!
- @leannch86920 for the training research!
- Everyone NOT named David (biggest & noisiest dataset ever)
@devanshu_twt Sorry! It’s not ideal but it’s the easiest way to weed out 99% of abusers. When the product makes it easy to crawl the web, you get a lot of bad actors.
Still thinking of a better way to solve this!
@groby Sorry for the late reply (and happy new years!)
It's not on the immediate horizon, but implementing a credit balance model with a low minimum is something we've discussed. I personally prefer it.
Would you mind emailing me at jerome[@]diffbot?
State of E-commerce Data Providers - Q4 2025
E-commerce runs on constant measurement: prices, promos, availability, seller changes, and "what the shelf actually looks like" across retailers and marketplaces.
The challenge is stable collection at scale, retries when sites break, anti-bot evasion, clean geo signals, and then turning messy HTML into usable structured data.
In preparation for the holiday season, we mapped the landscape of e-commerce data providers:
Competitive intel + digital shelf: @dataweavein, @Price2Spy, @bigdataNODE, @Profitero, @WiserInc
Marketplace intelligence + data: @junglescout, @H10Software, @datahawkco, @SellerSprite_EN
Trade, Supply Chain, Imports / Exports: @Trademo1, @ImportYeti, @datamyne
Scraper APIs & Extraction Platforms: @zytedata, @diffbot, @Stratalis, (AutoScraping handle?), @serpapi
Managed Data Extraction & Services: @groupBWT, @Data_Ox, @epctex, @MrScraper_
Retail Media & Ad Platforms: @Pacvue, @PerpetuaLabs, @Teikametrics
Network & runtime infra for e-com scraping: @playwrightweb, Puppeteer, @browserless
YouTube, TikTok, Mastodon, & Threads are mostly there but need optimizing.
Diffbot goes incredibly far with articles & that’s also moving along well.
Reddit & Bluesky are readily available but I haven’t spent the time.
X is finished by the endpoint gets rate limited 😞
BREAKING: The Internet
Massive outage being reported across platforms including Spotify, Google Cloud, AWS, Cloudflare, Claude, YouTube, Gmail, and many, many, more
#Perplexity Sonar Pro API launched last week as the best performing model on factuality.
24 hours later, it's the 2nd best performing model (and it's not because of #DeepSeek).
Why? 👇
Diffbot launches open-source AI model that achieves 81% accuracy by querying a trillion-fact Knowledge Graph in real-time instead of relying on static training data 🧠📊
Read more: https://t.co/YcyWz8DKnw #ArtificialIntelligence#Enterprise#MachineLearning@diffbot