Experience how easy it is to take data from your source data systems, ingest them into Apache Iceberg and serve a BI dashboard from the confines of your laptop with these tutorials.
#DataLakehouse#DataLake#DataEngineering#ApacheIceberg
Know Someone Learning Data Engineering, share this with them:
Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio… https://t.co/QltefURtV3
#DataEngineering
ICEBERG METADATA TABLES
This article will walk you through a hands-on exercise to get familiar with the Iceberg metadata tables.
Read here: https://t.co/32q9KUqKhh
#DataEngineering#ApacheIceberg#DataLakehouse
DREMIO.ICEBERG.DBT.NESSIE.MINIO.POSTGRES.MONGODB
If you want to try a deep end-to-end tutorial that will get you hands-on with a variety of popular data tools, try this one out.
https://t.co/jsZFBLMJ97
#DataLakehouse#DataEngineering#ApacheIceberg#Dremio#dbt#DataScience #DataAnalytics
HOW ICEBERG CATALOGS WORK
Iceberg tables are one part data stored in several parquet files and a second part metadata files that provide context and understanding of that data as a singular table.
The metadata entry point is a file called metadata.json which tracks the tables schemas, partition schemes and snapshots. Everytime the table changes a new metadata.json is created.
So when there is possibly dozens or hundreds of these metadata.json files, how does an engine like Dremio, Snowflake or Apache Spark know which is the right one to query the table accurately. This is where a catalog comes in like Nessie and Polaris.
A catalog acts like a traffic controller maintaining a list of tables along with the file address where the current metadata.json is stored. These references are updated at the end of a transaction after the new metadata.json is created enabling Atomicity guarantees.
A catalog directs queries to the right metadata.json and updates that list when writes are complete.
If you enjoyed this post, give it a like and a share! Also check out https://t.co/EfSOHlh2PV for a lot more Apache Iceberg education resources.
#ApacheIceberg #DataLakehouse #DataEngineering
OPTIMIZING ICEBERG TABLES
One the things that make Iceberg queries fast is that the metadata can be used eliminate files that don’t need scanning from the scan plan. This is great but if the data is not clustered properly or spread out across many small files, you can still see less than ideal performance.
** Compaction **
When you have more manifests and data files than you need, you are doing more file operations and slowing down performance. By rewriting these files so you can collapse the data into fewer larger files you have the opposite effect. This can be done the REWRITE_DATA_FILES or REWRITE_MANIFESTS procedures in Spark or the OPTIMIZE TABlE command in Dremio.
** Clustering **
If I only am searching for agent in the northwest region, it’d be nice if all those reps where in the same few files, this is known as clustering. When rewriting data files with Spark, there is a “sort” parameter you can pass so it can cluster the data as it rewrites the files.
By compacting and clustering you data, the Apache Iceberg metadata becomes even more powerful in skipping data files when executing queries.
Read more in my new article on maintaining Apache Iceberg lakehouses here:
https://t.co/jJ5epcl6ST
#DataLakehouse #ApacheIceberg #DataEngineering
Join us on September 5th at 10am PT for a MinIO x @dremio x @Carahsoft webinar about how modern #datalakes can help government customers solve their modernization initiatives. Register here: https://t.co/Y80uY8zyur
Join us for "An Apache Iceberg Lakehouse Crash Course" an in-depth series designed to provide a comprehensive understanding of Apache Iceberg, taught by Iceberg expert Alex Merced.
https://t.co/4YiSfUTVXN
🎙️ Dive into the minds of data disruptors! 🚀 Join us on the #DataDisruptors podcast as we unravel the strategies and insights shaping the future of data leadership. Tune in for exclusive conversations that redefine the data landscape. Listen now!
🔗 https://t.co/3oLtvDojOQ