⏰ Final Reminder – Delta Lake Webinar Tomorrow!
Wondering if data engineering design patterns can unlock new insights into Delta Lake? Or how Delta Lake can become a key part of your streaming data architecture?
Join @newfront (@bufbuild) and @waitingforcode as they tackle these questions head-on!
🗓️ Oct 14 @ 9AM PT
🎥 Live on LinkedIn, YouTube & X
📍 Reserve your spot today: https://t.co/ERn2SbT0qZ
#opensource #oss #deltalake #streaming #dataengineering
Why don’t Iceberg or Delta Lake have secondary indexes?
Because analytics workloads and OLTP workloads optimize for opposite I/O patterns.
See my dive into data layout, pruning, and what “indexing” really means in open table formats: https://t.co/beurjdS8u4
Are you wondering if general concepts like data engineering design patterns can help you learn about #DeltaLake? Or, if it's possible to leverage Delta Lake within your streaming data architecture?
In this webinar, Scott Haines and Bartosz Konieczny will answer these two questions. Scott, who gained streaming expertise at Yahoo, Twilio, and Nike, will share with you best practices for leveraging Delta Lake as a component of your streaming architecture. ✅
Bartosz, who recently published Data Engineering Design Patterns, will reverse-engineer a few of these design patterns to explain which Delta Lake features make everything tick.
🗓️ Tuesday, Oct 14
🕝 9AM PT
Don't miss it! 🔗 Register today: https://t.co/ERn2SbT0qZ
#opensource #oss #dataarchitecture #dataengineering @waitingforcode
Releasing Soon! Pre-order now https://t.co/qAV65je7Du
Data Engineering Design Patterns
By Bartosz Konieczny @waitingforcode. with @OReillyMedia
Focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more. #dataengineering
If you want to understand the consistency models of the mentioned table formats of the paper, I've written about it extensively and written formal models.
* https://t.co/JE0oPUBtAt
* https://t.co/1E1F9WaXJz
* https://t.co/qAQF6HUSNJ
* https://t.co/nxZljyLHuw
@AdiPolak I'm not that new anymore, but "Stream Processing with Apache Flink" was my first learning resource; well structured, covering IMO the most important parts to start. Now, I'm deeply appreciating Flink Forward technical deep dives to go further 🤩
Join @newfront and @waitingforcode and learn all about streaming Delta Lake tables with Apache Spark Structured Streaming! 🦀
🗓 March 21st
🕝 9:00AM PT / 12:00PM ET
💻 Join this webinar via LinkedIn, YouTube, or Zoom!
Learn more: https://t.co/FYjB9Uy2Fz
#deltalake#streaming
I have been busy the last few months writing a book for O'Reilly about how to build ML systems (batch, real-time, and LLMs), distilling much of what I have learnt from both working with customers as well as students. Why could the book interest you?
* Data Scientists - transition from training models to building ML systems
* ML Engineers - learn about how to build batch, real-time, LLM systems in modular parts that you compose into a ML system
* Data Engineers - learn about the data transformation taxonomy for ML and how badly structured DAGs prevent reuse in ML systems
* Architects - divide et impera - learn how modularity helps you build faster and better ML systems.
Early access to the first chapter (52 pages) is available here:
https://t.co/px4BmxCnUV
I don't want to start a flame war here, but IMO it is a mistake to jump straight to distributed databases (and 90% of the content below is distributed databases) without first learning fundamentals on single node databases.
Here's my 10 things to understand about databases:
1. Relational model. Primary keys, foreign keys, normal form.
2. SQL language. Ideally with advanced SQL (CTE, analytics)
3. ACID and how transactions work
4. Write-ahead log (or binlog) and how it is used. Especially around restarts, recovery and replication.
5. Buffer cache, disk storage layout and how they interact
6. What happens when databases start? when they shut down?
7. Indexes, cluster tables, partitions and other types of database structures.
8. Query parsing, planning and optimizing.
9. MVCC and how to deal with its quirks in your DB of choice
10. Security - authentication, authorization, encryption on wire and at rest.
11. (Bonus) Investigating performance issues and making sense of benchmarks.
Entire world, stuff that 99% of developers use daily. You can be a deep expert without ever looking at distributed databases. And this also serves as strong foundation once you do.
And if you use Postgres, I found this free book super helpful in making sense of things: https://t.co/cPNk493KU5
The early release of Delta Lake: The Definitive Guide is here! 🎉 The latest edition includes the addition of Chapter 12: Performance Tuning.
Download here ➡️ https://t.co/rXMjhs4dyV
Authors @dennylee, Prashanth Babu, Tristen Wentling, & @newfront#opensource#deltalake#oss
Data Engineering patterns on the cloud: How to solve common data engineering problems with cloud services? https://t.co/s70rgr8RRD by Bartosz Konieczny is the featured book on the Leanpub homepage! https://t.co/7B8N80e7nt @waitingforcode#CloudComputing#AmazonWebServices
Last week I spent some time to understand the #PySpark applyInPandasWithState. This week I'm refactoring the code, hoping to still understand it 2 months later ;) 👉 https://t.co/qja12phovZ
In the previous release #PySpark has got an interesting streaming feature -> the arbitrary stateful processing. It has a different API than the Scala version but is more adapted to the Python world.
More 👉 https://t.co/KfzgtIby32
[ANNOUNCEMENT] Congrats to the Apache Spark community and all the contributors! The Apache Spark 3.5.0 release is here. Try it out! https://t.co/o8YcLnSysZ
It's not a rebranding but more a regrouping 😉 All my additional #dataengineering content is now available from there https://t.co/5VjHc37ZsL (planning to add some stream processing materials soon)
If Delta Lake implemented the commits only, I could stop exploring this transactional part after the previous article. But as for RDBMS, #DeltaLake implements other ACID-related concepts, such as isolation levels 👉 https://t.co/OilPib06dK