🚀So you wanna build a data business?
One of the things I didn’t understand when starting @cbinsights was the many ways you can collect and create the information that underpins a data business.
As we've built CBI and as I've studied others, I think there are 7 primary strategies employed for data collection & creation (herein data collection)
I’ll describe each below with some examples.
Notes:
▪️ Most data companies that have been around for a while employ multiple of these tactics although they typically start with 1, maybe 2. For example, CB Insights uses 5 of the 7 data collection methods now but we started with just 1.
▪️ Some of these data collection methods are inherently more valuable than others. I don’t discuss that below but may if this niche’y topic is of interest to folks.
If you’re building a data business (or interested in this space), hope this is helpful.
For other data business builders, keep me honest in the comments if I’ve missed anything.
The 7 data collection methods:
1. Ground & pound
2. “Web scraping”
3. UGC
4. Pooled / data consortiums
5. Survey & Interviews
6. Sawdust collection
7. God algorithms
---------------------
1. Ground & Pound
---------------------
▪️ Directly collecting data through manual effort, often involving significant human resources.
▪️ This is the way a lot of people start.
▪️ Data businesses often require a high pain tolerance in the beginning. This is good and bad. Good in that not everyone likes doing data janitor work. Bad in that you’re doing data janitor work.
Examples
▪️ ZoomInfo: Initially collected contact information by cold-calling companies.
▪️ Waze: Initially relied on drivers to manually report traffic conditions, accidents, and hazards.
▪️ CB Insights: We initially extracted company and transaction information from reading and manually parsing 50,000 articles. Once we saw patterns there, it informed the machine learning capabilities we eventually bult.
--------------------
2. “Web scraping”
---------------------
▪️ Aggregating, linking, and harmonizing data from various public and third-party sources, often unstructured or semi-structured.
▪️ This often involves web crawling, machine learning, etc.
▪️ I say “web scraping” in quotes because people seem to think this is easy “Oh, so you have bots?”. Extracting data from semi-structured and unstructured documents in a high fidelity way is not easy to do.
Examples
▪️ Meltwater: Collects data from online news and social media for media intelligence that they sell to comms and PR teams.
▪️ CoStar: Aggregates data from public records, such as property tax records, deeds, and zoning information, to enhance their real estate database.
▪️ Bloomberg: Aggregates financial data from multiple sources and provides harmonized datasets.
--------
3. UGC
--------
▪️ Users contribute content. These are often lead gen or advertising revenue models vs direct data subscription businesses.
▪️ This is the model most at risk from GAI and resulting changes to Google.
Examples
▪️ Yelp
▪️ Glassdoor
-------------------------------
4. Pooled / Data Consortium
-------------------------------
▪️ Also referred to as Data Co-operatives or Data Co-Ops, pooled data typically entails collecting data from multiple entities within an industry to gain broader insights and more comprehensive datasets.
▪️ Typically, contributors of the data are provided back anonymized evaluation benchmarks and analytics that help them operate their businesses better.
Examples
▪️ Equifax: Part of a consortium of credit bureaus sharing data for comprehensive credit reporting.
▪️ Verisk: Collect insurance industry data
▪️ Raiser's Edge NXT: Pooling fundraising and donor data across nonprofits for better insights.
▪️ Pave & Payscale: Gathers salary information from employers and employees to create a comprehensive database for compensation analysis.
▪️ IQVIA: Collects data directly from clinical trials and research studies conducted worldwide. (formerly IMS Health & Quintiles)
---------------------------
5. Surveys & Interviews
---------------------------
Collecting data through structured surveys, interviews, and questionnaires.
Examples
▪️ Nielsen: Using TV diaries, online surveys, and in-person interviews to gather data on viewing habits and consumer preferences.
▪️ JD Power: Known for its customer satisfaction and product quality surveys in various industries including automotive, healthcare, and finance.
▪️ Gallup: Gallup is known for its public opinion polls and surveys, which cover a wide range of topics including politics, economy, and social issues.
▪️ CB Insights: We interview software buyers and make those transcripts available along with extracting structured data from those conversations around metrics like pricing, CSAT, etc.
-------------------------
6. Sawdust Collection
-------------------------
Creating valuable data as a byproduct of a company's primary operations or services.
Examples
▪️ Google: Generates data from search queries and user behavior on its platforms.
▪️ Slice: Shopping app organized order confirmations and shipping notifications that retailers e-mail to consumers after purchase. Slice mined this for insights into purchasing trends they could sell to marketers, investors, etc. Slice was acquired by Rakuten.
▪️ John Deere: Collects agricultural data from sensors on farming equipment. For example, harvesting equipment with sensors that measure crop yield, moisture levels, and operational efficiency which can be used to analyze field performance and improve future yields..
▪️ Linkedin: Insights on where people being hired are coming from, what roles they are hiring in, growth trends, etc comes from the updates people make to their core individual profiles
▪️ Yodlee/Envestnet: Data on financial transactions, i.e. credit cards for example, While they don’t provide directly identifiable information, they do provide aggregate data that can help analysts understand how Uber or Instacart’s revenue might be looking.
---------------------
7. God Algorithms
---------------------
▪️ Using advanced algorithms to create valuable data products from existing data.
▪️ Typically requires market credibility/blessing before introduction
Examples
▪️ Zillow: Uses the Zestimate algorithm to estimate property values.
▪️ FICO: Generates credit scores based on proprietary algorithms analyzing financial data.
▪️ CB Insights: Our private company health score (Mosaic) and Commercial Maturity Score are created using algorithms.
So those are the 7 ways to collect data upon which to build a data business.
However, just because you can collect data doesn’t mean you have a potential business.
You need to understand if that data is valuable.
If there is interest, I’ll dig into what makes data valuable in a separate post
The future of building startups:
- MVP speed (1x per month)
- AI-accelerated
- Superniche is the new niche
- Community 1st, software 2nd
- No-code 1st, some code 2nd
- 10x more automated
- Global teams, localized products
- 95% dominated by solopreneurs and microentrepreneurs (teams less than 12)
- Pop-up digital experiences (apps that only work on certain times)
- Needs the marketing holy-trinity to hit escape velocity: 1. product/market fit, 2. content/market fit and 3. community/market fit
- Team is half robots 🤖, half humans 👨🦰 (cc @youneedarobot)
- Accelerated by "boring marketing" (cc @boringmarketer)
- Multiple revenue streams
- Design matters. The bar is high
- Partnered w/ creators (creators are the distribution)
- Feels like a game (levels, status, badges, in-app currency, challenges, collectibles/items)
- Purpose-driven moonshots: societal impact matters
- Productized agencies to generate cashflow (ex design agency @DispatchDesign)
- Product studios become the norm
- 99% of MVPs won't need VC
Sam’s advice here is great if you’re Sam. Yeah vastly wealthy repeat entrepreneurs can (and sometimes should) raise huge amounts on very high valuations, build huge teams, and run research projects.
Everyone else who can’t raise $100m’s with zero traction on an idea…
I've talked to multiple founders recently who have changed their minds about remote work and are trying to get people back to the office. I doubt things will go all the way back to the way they were before Covid, but it looks like they will go most of the way back.
A bunch of my friends have gotten REALLY into therapy. Which is cool.
The upside, they become really self aware. (of past traumas, shortcomings etc.)
The downside, they become "too" self aware.
Self awareness kills your self delusion. And self delusion is a powerful tool
What an interesting moment.
We're staring at two distinctly different visions of the future. They may co-exist, but they are radically different takes on what's modern, what's current, and where things are headed.
One vision gets the UI out of the way. The other vision is UI everywhere you look.
One vision gets the computer out of the way. The other vision mounts a computer on your face.
One vision is get it and go. One vision is get it and stay.
One vision is about information. The other vision is about immersion.
One vision is natural and understands you. The other vision requires new methods of interaction that you have to learn and master.
One vision feels like an assist. One vision feels like obstruction.
One vision fits with whatever you already have. One vision requires you buy something that fits.
One vision is simply text. One vision is anything but.
One vision feels like before. One vision feels like after. But I'm not sure which is which.
being a VC is so easy, so high-status, the money is so great, and the lifestyle is so fun.
very dangerous trap that many super talented builders never escape from, until they eventually look back on it all and say “damn i’m so unfulfilled”.
I started my company 16 years, 3 months, and 5 days ago.
Today, it went public.
But let's rewind for a second...
5,939 days ago, I was a barista at a small cafe called @2percentJazz2, in Victoria, Canada.
I made $6.50 an hour.
Two guys, Chris and Jeff, started coming into the cafe.
They'd sit there all day drinking espresso and typing away on their laptops, using the wifi.
After weeks of this, I asked them what they did for a living.
Didn't they have jobs?
They told me they were "web designers" and this — sitting on their laptops — was their job.
As I dug in, they told me how it worked:
They asked local businesses if they needed a website, then charged them a couple thousand bucks to make one.
They could whip a website together in a few days, and each make $1,000.
Simple.
This blew my mind.
And at that moment, I realized something:
I wanted to be the guy drinking the espresso, not the one serving it.
Chris and Jeff were clearly smart, but I knew some basic HTML and figured I could do the same.
I decided to try it out.
When I got off my shift, I took the bus over to a book store downtown and bought a book called 'Bulletproof Web Design' by Dan Cederholm (@simplebits) to hone my skills.
Then, I googled "freelance web design jobs" and found a tech job board called Authentic Jobs made by this guy in Utah, @cameronmoll.
There were hundreds of posts, mostly from startups in San Francisco, looking for freelance web designers.
I decided to try to win one of these contracts, but I had a critical insight:
Nobody wants to hire an 18-year-old barista to build their website.
So, I decided I'd create a fake design agency.
Using tricks from Dan Cederholm's book, I whipped together a slick looking site and called my "agency" MetaLab (after the <meta> tag in HTML).
The website was very vague as to what exactly MetaLab did, who worked there, or where we were located.
It also featured a cringe-inducing tagline "We Help People Make Cool Stuff."
Like an email spammer, I started sending emails to every single web design job post I could find.
I was met with crickets, until I got an email from a guy named Kavin Stewart (@kavinstewart).
He worked at a startup called Offermatica in San Francisco and told me he needed an interface designed for a web app.
I barely understood what a web app was, but I assured him I could do it.
He proposed a $2,000 USD budget and my eyes went wide.
This was more than I earned in a month, and the project was just a few days of work.
I walked into the cafe the next day and quit my job.
I told myself that if I could just make enough money to wake up whenever I wanted and comfortably make rent, I'd be good.
The rest is history.
But I slightly overshot.
I still own MetaLab, but along the way me and my business partner @_Sparling_ started dozens of companies, then began buying wonderful businesses, including one (Dribbble) — amazingly — from Dan Cederholm, the designer whose book I bought when I first started.
Today, Tiny went public, and as of this moment has a market capitalization of just under $800 million.
I can't even begin to explain how mind boggling this is to me.
This has not been a feat of entrepreneurial genius.
My key skill has been choosing incredible people to work with, both internally and externally, and I wanted to say a huge thank you to everyone who has worked at Tiny and our various companies over the years.
And a special thanks to @simplebits, @cameronmoll, and @kavinstewart for helping with my first step😉
Watch for us on the TSX Venture Exchange under the ticker TINY (how cool is that ticker?).
https://t.co/tlpgiEO2MU
If you are founding a start-up in Kansas and any of your investors ask you to participate in the @KansasCommerce Angel Investor Tax Credit Program... say NO!
After being acquired last year, with no discussion or a phone call to understand the ramifications, lawyers for the Kansas Department of Commerce sent a demand letter for repayment.
After my emails to the program director were ignored I finally got ahold of their legal counsel. I tried to explain the situation (they accused us of moving out of state, which we didn't). 2 months later (after no update), I sent an email requesting an update. I received an immediate response...
"Commerce will not be modifying the agreement. Please remit repayment of $147,500 on or before close of business Friday, April 14, 2023". lol.
If your investors ask you to participate, DON'T DO IT. If they ask questions, feel free to reach out to me. I'm happy to provide details on why you don't want to touch this program with a ten foot pole.
Very excited for my conversation with @BenRabidoux about the Canadian Real Estate Market to go live this morning! Thank you Ben for your insights on Immigration, Interest Rates, Compounding Pain, and more!
https://t.co/wwGMPxW93b
Around of applause to our winners 👏 Julie with 2ndScan, Toby with Ram, Jordan with Tappen Apothecary, Sutunary with Sutunary and Athena with Inkblot Games from our Spring #PlanIt competition .
#UVicInnovation@uvic
To: My fellow data/info-biz nerds
This is a great summary and another reason why proprietary quantitative or qualitative data & insights are going to be so powerful in this generative AI / LLM future
h/t @joeobrien513
The gov’t has about 48 hours to fix a-soon-to-be-irreversible mistake. By allowing @SVB_Financial to fail without protecting all depositors, the world has woken up to what an uninsured deposit is — an unsecured illiquid claim on a failed bank. Absent @jpmorgan@citi or @BankofAmerica acquiring SVB before the open on Monday, a prospect I believe to be unlikely, or the gov’t guaranteeing all of SVB’s deposits, the giant sucking sound you will hear will be the withdrawal of substantially all uninsured deposits from all but the ‘systemically important banks’ (SIBs). These funds will be transferred to the SIBs, US Treasury (UST) money market funds and short-term UST. There is already pressure to transfer cash to short-term UST and UST money market accounts due to the substantially higher yields available on risk-free UST vs. bank deposits. These withdrawals will drain liquidity from community, regional and other banks and begin the destruction of these important institutions. The increased demand for short-term UST will drive short rates lower complicating the @federalreserve’s efforts to raise rates to slow the economy. Already thousands of the fastest growing, most innovative venture-backed companies in the U.S. will begin to fail to make payroll next week. Had the gov’t stepped in on Friday to guarantee SVB’s deposits (in exchange for penny warrants which would have wiped out the substantial majority of its equity value) this could have been avoided and SVB’s 40-year franchise value could have been preserved and transferred to a new owner in exchange for an equity injection. We would have been open to participating. This approach would have minimized the risk of any gov’t losses, and created the potential for substantial profits from the rescue. Instead, I think it is now unlikely any buyer will emerge to acquire the failed bank. The gov’t’s approach has guaranteed that more risk will be concentrated in the SIBs at the expense of other banks, which itself creates more systemic risk. For those who make the case that depositors be damned as it would create moral hazard to save them, consider the feasibility of a world where each depositor must do their own credit assessment of the bank they choose to bank with. I am a pretty sophisticated financial analyst and I find most banks to be a black box despite the 1,000s of pages of @SECGov filings available on each bank. SVB’s senior management made a basic mistake. They invested short-term deposits in longer-term, fixed-rate assets. Thereafter short-term rates went up and a bank run ensued. Senior management screwed up and they should lose their jobs. The @FDICgov and OCC also screwed up. It is their job to monitor our banking system for risk and SVB should have been high on their watch list with more than $200B of assets and $170B of deposits from business borrowers in effectively the same industry. The FDIC’s and OCC’s failure to do their jobs should not be allowed to cause the destruction of 1,000s of our nation’s highest potential and highest growth businesses (and the resulting losses of 10s of 1,000s of jobs for some of our most talented younger generation) while also permanently impairing our community and regional banks’ access to low-cost deposits. This administration is particularly opposed to concentrations of power. Ironically, its approach to SVB’s failure guarantees duopolistic banking risk concentration in a handful of SIBs. My back-of-the envelope review of SVB’s balance sheet suggests that even in a liquidation, depositors should eventually get back about 98% of their deposits, but eventually is too long when you have payroll to meet next week. So even without assigning any franchise value to SVB, the cost of a gov’t guarantee of SVB deposits would be minimal. On the other hand, the unintended consequences of the gov’t’s failure to guarantee SVB deposits are vast and profound and need to be considered and addressed before Monday. Otherwise, watch out below.