Apache Iceberg Data Lakehouse Tips

@IcebergDataLake

Unofficial account tweeting content on working with Apache Iceberg Data Lakehouses

Joined May 2023

62 Following

191 Followers

149 Posts

Pinned Tweet

Apache Iceberg Data Lakehouse Tips @IcebergDataLake

about 2 years ago

Experience how easy it is to take data from your source data systems, ingest them into Apache Iceberg and serve a BI dashboard from the confines of your laptop with these tutorials. #DataLakehouse #DataLake #DataEngineering #ApacheIceberg

IcebergDataLake's tweet photo. Experience how easy it is to take data from your source data systems, ingest them into Apache Iceberg and serve a BI dashboard from the confines of your laptop with these tutorials.

#DataLakehouse #DataLake #DataEngineering #ApacheIceberg https://t.co/ixkV6pT5hx

0

6

5

1

359

Apache Iceberg Data Lakehouse Tips @IcebergDataLake

over 1 year ago

Know Someone Learning Data Engineering, share this with them: Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio… https://t.co/QltefURtV3 #DataEngineering

IcebergDataLake's tweet photo. Know Someone Learning Data Engineering, share this with them:

Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio… https://t.co/QltefURtV3

#DataEngineering https://t.co/LjMs5BbiTA

0

2

0

1

158

Apache Iceberg Data Lakehouse Tips @IcebergDataLake

over 1 year ago

ICEBERG METADATA TABLES This article will walk you through a hands-on exercise to get familiar with the Iceberg metadata tables. Read here: https://t.co/32q9KUqKhh #DataEngineering #ApacheIceberg #DataLakehouse

IcebergDataLake's tweet photo. ICEBERG METADATA TABLES

This article will walk you through a hands-on exercise to get familiar with the Iceberg metadata tables.

Read here: https://t.co/32q9KUqKhh

#DataEngineering #ApacheIceberg #DataLakehouse https://t.co/UJuvbHMw1Z

0

0

0

0

121

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

APACHE ICEBERG MIGRATION GUIDE https://t.co/Or8yRMuqHn

0

3

2

0

170

Who to follow

Open Lakehouse Community

@open_lakehouse

A community page to share all things Open Lakehouse ft. Apache Hudi, Iceberg & Delta Lake.

Alex Merced | Open Data Lakehouse Advocate

Verified account

@AMdatalakehouse

O'reilly and Manning Author, Dremio Head of DevRel, and Friendly Tech & Data Hipster. (https://t.co/RV3bH5h4cY)

CData is the data connectivity platform for AI applications, real-time analytics, and enterprise integration—access any data, anywhere. Visit: https://t.co/6JR3mlRkBw

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

DREMIO.ICEBERG.DBT.NESSIE.MINIO.POSTGRES.MONGODB If you want to try a deep end-to-end tutorial that will get you hands-on with a variety of popular data tools, try this one out. https://t.co/jsZFBLMJ97 #DataLakehouse #DataEngineering #ApacheIceberg #Dremio #dbt #DataScience #DataAnalytics

0

3

2

0

176

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

RECENT DATA ARCHITECTURE/ENGINEERING/ANALYTICS CONTENT — Apache Iceberg — > What is Data Lakehouse Table Format? https://t.co/Er7POv8eR4 > Comparing Iceberg to Other Lakehouse Solutions https://t.co/b6azd3f7y2 > Iceberg Migration Guide https://t.co/zjaiSZBWuP > Hands-on with Managed Polaris Catalog https://t.co/wf9v1QkE43 > Hands-on with Self-Managed Polaris https://t.co/DpXVIKr9oQ — Hybrid Lakehouse — > 3 Dremio Use Cases for On-Prem Data Lakes https://t.co/ys6H1nJ5F4 > Hybrid Lakehouse Solution: NetApp https://t.co/hDXCtLy5Ac > Hybrid Lakehouse Solution: Minio https://t.co/VJzxKYDXWV > Hybrid Lakehouse Solution: Vast Data https://t.co/jNEwtQzELj > Hybrid Lakehouse Solution: Pure Storage https://t.co/lm4EIiZjGa — Unified Analytics — > Analysts Guide to JDBC/ODBC, REST, and Arrow Flight https://t.co/Ww1ZN6tZ97 > Unified Lakehouse https://t.co/ey85YBC45K #DataEngineering #DataLakehouse #DataScience #DataAnalytics #DataArchitecture

AMdatalakehouse's tweet photo. RECENT DATA ARCHITECTURE/ENGINEERING/ANALYTICS CONTENT

— Apache Iceberg —

> What is Data Lakehouse Table Format?
https://t.co/Er7POv8eR4

> Comparing Iceberg to Other Lakehouse Solutions
https://t.co/b6azd3f7y2

> Iceberg Migration Guide
https://t.co/zjaiSZBWuP

> Hands-on with Managed Polaris Catalog
https://t.co/wf9v1QkE43

> Hands-on with Self-Managed Polaris
https://t.co/DpXVIKr9oQ

— Hybrid Lakehouse —

> 3 Dremio Use Cases for On-Prem Data Lakes
https://t.co/ys6H1nJ5F4

> Hybrid Lakehouse Solution: NetApp
https://t.co/hDXCtLy5Ac

> Hybrid Lakehouse Solution: Minio
https://t.co/VJzxKYDXWV

> Hybrid Lakehouse Solution: Vast Data
https://t.co/jNEwtQzELj

> Hybrid Lakehouse Solution: Pure Storage
https://t.co/lm4EIiZjGa

— Unified Analytics —

> Analysts Guide to JDBC/ODBC, REST, and Arrow Flight
https://t.co/Ww1ZN6tZ97

> Unified Lakehouse
https://t.co/ey85YBC45K

#DataEngineering #DataLakehouse #DataScience #DataAnalytics #DataArchitecture

0

6

4

1

220

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

HOW ICEBERG CATALOGS WORK Iceberg tables are one part data stored in several parquet files and a second part metadata files that provide context and understanding of that data as a singular table. The metadata entry point is a file called metadata.json which tracks the tables schemas, partition schemes and snapshots. Everytime the table changes a new metadata.json is created. So when there is possibly dozens or hundreds of these metadata.json files, how does an engine like Dremio, Snowflake or Apache Spark know which is the right one to query the table accurately. This is where a catalog comes in like Nessie and Polaris. A catalog acts like a traffic controller maintaining a list of tables along with the file address where the current metadata.json is stored. These references are updated at the end of a transaction after the new metadata.json is created enabling Atomicity guarantees. A catalog directs queries to the right metadata.json and updates that list when writes are complete. If you enjoyed this post, give it a like and a share! Also check out https://t.co/EfSOHlh2PV for a lot more Apache Iceberg education resources. #ApacheIceberg #DataLakehouse #DataEngineering

AMdatalakehouse's tweet photo. HOW ICEBERG CATALOGS WORK

Iceberg tables are one part data stored in several parquet files and a second part metadata files that provide context and understanding of that data as a singular table.

The metadata entry point is a file called metadata.json which tracks the tables schemas, partition schemes and snapshots. Everytime the table changes a new metadata.json is created.

So when there is possibly dozens or hundreds of these metadata.json files, how does an engine like Dremio, Snowflake or Apache Spark know which is the right one to query the table accurately. This is where a catalog comes in like Nessie and Polaris.

A catalog acts like a traffic controller maintaining a list of tables along with the file address where the current metadata.json is stored. These references are updated at the end of a transaction after the new metadata.json is created enabling Atomicity guarantees.

A catalog directs queries to the right metadata.json and updates that list when writes are complete.

If you enjoyed this post, give it a like and a share! Also check out https://t.co/EfSOHlh2PV for a lot more Apache Iceberg education resources.

#ApacheIceberg #DataLakehouse #DataEngineering

0

6

4

0

162

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

OPTIMIZING ICEBERG TABLES One the things that make Iceberg queries fast is that the metadata can be used eliminate files that don’t need scanning from the scan plan. This is great but if the data is not clustered properly or spread out across many small files, you can still see less than ideal performance. ** Compaction ** When you have more manifests and data files than you need, you are doing more file operations and slowing down performance. By rewriting these files so you can collapse the data into fewer larger files you have the opposite effect. This can be done the REWRITE_DATA_FILES or REWRITE_MANIFESTS procedures in Spark or the OPTIMIZE TABlE command in Dremio. ** Clustering ** If I only am searching for agent in the northwest region, it’d be nice if all those reps where in the same few files, this is known as clustering. When rewriting data files with Spark, there is a “sort” parameter you can pass so it can cluster the data as it rewrites the files. By compacting and clustering you data, the Apache Iceberg metadata becomes even more powerful in skipping data files when executing queries. Read more in my new article on maintaining Apache Iceberg lakehouses here: https://t.co/jJ5epcl6ST #DataLakehouse #ApacheIceberg #DataEngineering

AMdatalakehouse's tweet photo. OPTIMIZING ICEBERG TABLES

One the things that make Iceberg queries fast is that the metadata can be used eliminate files that don’t need scanning from the scan plan. This is great but if the data is not clustered properly or spread out across many small files, you can still see less than ideal performance.

** Compaction **

When you have more manifests and data files than you need, you are doing more file operations and slowing down performance. By rewriting these files so you can collapse the data into fewer larger files you have the opposite effect. This can be done the REWRITE_DATA_FILES or REWRITE_MANIFESTS procedures in Spark or the OPTIMIZE TABlE command in Dremio.

** Clustering **

If I only am searching for agent in the northwest region, it’d be nice if all those reps where in the same few files, this is known as clustering. When rewriting data files with Spark, there is a “sort” parameter you can pass so it can cluster the data as it rewrites the files.

By compacting and clustering you data, the Apache Iceberg metadata becomes even more powerful in skipping data files when executing queries.

Read more in my new article on maintaining Apache Iceberg lakehouses here:

https://t.co/jJ5epcl6ST

#DataLakehouse #ApacheIceberg #DataEngineering

0

6

5

0

172

IcebergDataLake retweeted

almost 2 years ago

Join us on September 5th at 10am PT for a MinIO x @dremio x @Carahsoft webinar about how modern #datalakes can help government customers solve their modernization initiatives. Register here: https://t.co/Y80uY8zyur

Minio's tweet photo. Join us on September 5th at 10am PT for a MinIO x @dremio x @Carahsoft webinar about how modern #datalakes can help government customers solve their modernization initiatives. Register here: https://t.co/Y80uY8zyur https://t.co/5K2E403HCR

0

7

6

0

488

IcebergDataLake retweeted

almost 2 years ago

Join us for "An Apache Iceberg Lakehouse Crash Course" an in-depth series designed to provide a comprehensive understanding of Apache Iceberg, taught by Iceberg expert Alex Merced. https://t.co/4YiSfUTVXN

dremio's tweet photo. Join us for "An Apache Iceberg Lakehouse Crash Course" an in-depth series designed to provide a comprehensive understanding of Apache Iceberg, taught by Iceberg expert Alex Merced.

https://t.co/4YiSfUTVXN https://t.co/9Pg3waM25y

0

7

5

1

343

Apache Iceberg Data Lakehouse Tips @IcebergDataLake

almost 2 years ago

Basics of Lakehouse Engineering - Apache Iceberg, Nessie (2 hour Course) https://t.co/3JkRfI9g8V #DataEngineering #Nessie #Dremio #ApacheIceberg

0

1

0

1

58

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

almost 2 years ago

Do you use an open table format? If so how’s your experience been, vote, reply and share! #ApacheIcberg #DeltaLake #ApacheHudi #DataEngineering #DataLakehouse

1

4

5

0

220

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

Open Tables (Apache Iceberg) + Open (Nessie, Polaris, Gravitino) Catalogs = No Vendor Lock-in Lakehouses Read More: https://t.co/q67i8pHbK2 #DataEngineering #DataLakehouse #DataLake @dremio @SnowflakeDB @ApacheIceberg

AMdatalakehouse's tweet photo. Open Tables (Apache Iceberg) + Open (Nessie, Polaris, Gravitino) Catalogs = No Vendor Lock-in Lakehouses

Read More: https://t.co/q67i8pHbK2

#DataEngineering #DataLakehouse #DataLake

@dremio @SnowflakeDB @ApacheIceberg https://t.co/ZRWZWrgkY5

0

13

5

2

307

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

NEW MEDIA ON OPEN SOURCE APACHE ICEBERG CATALOGS - New Episode of "Datanation" which you can find on iTunes and Spotify - Substack: https://t.co/9Mq7j0VucC #ApacheIceberg #DataLakehouse #Dremio #Snowflake #Databricks #DataEngineering

AMdatalakehouse's tweet photo. NEW MEDIA ON OPEN SOURCE APACHE ICEBERG CATALOGS

- New Episode of "Datanation" which you can find on iTunes and Spotify

- Substack: https://t.co/9Mq7j0VucC

#ApacheIceberg #DataLakehouse #Dremio #Snowflake #Databricks #DataEngineering https://t.co/3XATn8nzOf

0

5

5

0

216

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

DATA PROFESSIONAL FOLLOW TRAIN - reply to this tweet, I will follow you - follow me and everyone else who replies - retweet to maximize reach of the train #DataEngineering #DataAnalytics #DataScience #BigData #DataLakehouse #DataLake

AMdatalakehouse's tweet photo. DATA PROFESSIONAL FOLLOW TRAIN

- reply to this tweet, I will follow you
- follow me and everyone else who replies
- retweet to maximize reach of the train

#DataEngineering #DataAnalytics #DataScience #BigData #DataLakehouse #DataLake https://t.co/fmyygJairN

4

6

4

0

291

Apache Iceberg Data Lakehouse Tips @IcebergDataLake

about 2 years ago

@AMdatalakehouse Follow me

0

0

0

0

18

IcebergDataLake retweeted

about 2 years ago

🎙️ Dive into the minds of data disruptors! 🚀 Join us on the #DataDisruptors podcast as we unravel the strategies and insights shaping the future of data leadership. Tune in for exclusive conversations that redefine the data landscape. Listen now! 🔗 https://t.co/3oLtvDojOQ

dremio's tweet photo. 🎙️ Dive into the minds of data disruptors! 🚀 Join us on the #DataDisruptors podcast as we unravel the strategies and insights shaping the future of data leadership. Tune in for exclusive conversations that redefine the data landscape. Listen now!

🔗 https://t.co/3oLtvDojOQ https://t.co/nmdyX93zrF

0

1

1

0

212

IcebergDataLake retweeted

Antony Savvas @AntonySavvas

about 2 years ago

#Dremio paddles harder in the growing #lakehouse market with product improvements https://t.co/3vHq19YdfG via @BlocksandFiles #AI #OpenSource

0

2

2

0

180

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

WHAT IS A DATA LAKEHOUSE? #DataEngineering #DataAnalytics #BigData #ApacheIceberg

1

11

4

2

331

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

8-BIT GAME TRAILER: Quest for the Data Lakehouse #DataLakehouse #DataEngineering #ApacheIceberg #Dremio

2

7

5

0

217

IcebergDataLake retweeted

Alex Merced | Open Data Lakehouse Advocate

@AMdatalakehouse

about 2 years ago

FREE YOUR DATA AND GET A FREE BOOK Watch to learn more #DataEngineering #DataAnalytics #DataLakehouse #DataLake #BigData

1

6

5

0

229

Last Seen Users on Sotwe

Trends for you

Most Popular Users