I wrote Deep Learning with Python to be the definitive guide to how deep learning works and how to best make use of it. Tens of thousands of people got their career start via this book. 120,000 copies sold, and downloaded by millions more.
And now it's free to read online: https://t.co/3CbcQ7hmjp
Buckle up because we're crashing into the new year with my annual database retrospective: License change blowbacks! @databricks vs. @SnowflakeDB gangwar! @DuckDB shotgun weddings! Buying a college quarterback with database money for your new lover! https://t.co/NnFHGElFNy
Databricks spent $1-2 BILLION dollars to acquire a ~30 person company from the creators of Apache Iceberg.
A revolution is going on in the Big Data space and its centering around Iceberg. 🧊
Why would Databricks spend this outrageous amount of money on such a small company?
Figure out here (2-minute read) 👇
The story of Iceberg is a classic disruption story. You set out to solve a relatively niche problem and the solution ends up “accidentally” solving a larger problem. 🏆
The relatively niche problem in this case was Apache Hive. 🐝
Apache Hive was a popular query engine for big data sets, and it implicitly used a simple table format.
✋ Pause. Quick primer on the terminology used here:
• 📁 file format - a format which adds additional metadata to a file to help you organize the file’s data (e.g so you can read only what you need, so you can modify and evolve the file’s contents, etc) - stuff like CSV, Parquet, ORC, Avro
• 🗃️ table format - a format which adds additional metadata to a COLLECTION of files, so that again you enhance what you can do with them - e.g only read the file you need, modify the structure and organization of the files, add ACID capabilities, etc.
This includes open table formats like Iceberg, Delta, Hudi and many implicit ones (e.g MySQL’s InnoDB, PostgreSQL, Snowflake)
Hive had done a few things quite well:
• its format was simple and easy to understand 👍
• this made it ubiquitous - Hive tables are WIDELY supported in most query engines - Hive, Spark, Presto, Flink, Pig
• the gain was that the whole ecosystem could use the same at-rest data 👌
But Hive also had a few problems, succinctly:
• non-atomic writes when writing to multiple partitions of the data, resulting in mishaps (deleted data, half-done jobs) 😨
• inefficient with relation to cloud object storage ☁️
• scale challenges
Ryan Blue and Daniel Weeks, the creators of Iceberg, figured out that the main bottleneck was the table format itself.
So, while at Netflix, they set out to create a new format that solved for these issues.
The new format - Iceberg - improved on the following:
1. all changes are atomic with serializable isolation
2. support for many concurrent writers ⚡️
3. native cloud object store support 🌤️
4. no gotchas & surprises (e.g renaming a Parquet column in Hive breaks a ton of stuff)
5. … a lot more
In classic disruption fashion, the first three improvements ended up solving a much bigger problem. 🏆
Which problem was that?
The problem of Shared Database Storage.
With an open table format like Iceberg, you can store your data in one single source of truth (e.g S3) and have many different engines access and modify the data at the same time. 🤯
This is the rise of the so-called headless data architecture, where the storage layer (data) is decoupled from the query layers (engines) that use it. 💡
It is the key enabler of the growing trend called zero copy.
Zero Copy means that you do NOT have to spend millions in expensive cloud networking costs to copy petabytes of data to have it be used by the right processing engine – you can use the same set of data in the standardized Iceberg table format. 🧊
The two layers have always been tightly coupled because the query layer relies deeply on optimizations in the storage layer which allow for data to be fetched efficiently for faster querying.
And because the table format essentially defines the storage layer, you have big dogs like Snowflake and Databricks outbidding each other for Tabular (a company founded by the Iceberg creators) and aggressively competing with each other on the table formats. 💸
Just in the last few weeks we had some major announcements:
• June 3: Snowflake’s Open Source Polaris Iceberg Catalog announced
• June 4: Databricks acquires Tabular
• June 13: Databricks’ Unity Delta Catalog open sourced
And it seems like this is just the beginning…
Interested in more concise, simple content around the table format wars and the lakehouse revolution?
1. Follow me here - ✅ @kozlovski
2. Retweet this story so your network learns too. It takes 5 seconds to do, and it takes me 5 hours to write 🙏
The fundamental error all of these GenAI use cases make is in assuming that people will want to read something that other people couldn't be bothered to write.
The xz situation is absolutely insane and almost certainly state sponsored.
This is an excellent example of a widely used software being maintained by basically one person.
Read this web article and then frown and become sad.
https://t.co/0nHEh7hsBY
Good morning, and say "Hello!" to Raspberry Pi 5! The #Pi5 is ×2–3 faster, has PCIe support, a new RTC, and our own silicon designed in‑house here in Cambridge The everything computer. Optimised, https://t.co/tsjQaESmiH. More at https://t.co/6PKTuSiDxq. #RaspberryPi#RaspberryPi5
Today might have seen the biggest physics discovery of my lifetime. I don't think people fully grasp the implications of an ambient temperature / pressure superconductor. Here's how it could totally change our lives.