The project structure dbt recommends for data modeling is complex and confusing.
At almost every company I've worked we've used these 3 layers:
1. raw tables as-is from source
2. tables modeled as facts and dims
3. summary tables
I'm working on upgrading my dbt tutorial
#dbt
If you are a data engineer or looking to break into Data Engineering, Apache Airflow is a must-know.
Check out my post on how to think about orchestration and scheduling with Airflow.
https://t.co/dIq3cLdG9J
#dataengineering#apacheairflow#datapipelines
Stuck preparing for data engineering system design interviews?
Use this article to guide you through the prep.
https://t.co/7Pf2EgN8AQ
#dataengineering#interview#systemdesign
Pipeline code can be generated; set yourself apart by understanding how to design pipelines.
Learn how to design your pipelines based on given inputs and required outputs.
https://t.co/BdVQb37rqU
#dataengineering#datawarehouse#sql#etl#datapipeline
I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out.
I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really).
It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely.
The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture.
We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying.
I worry.
Your data warehouse bill is high for one reason.
Full table scans. Every query reads everything, whether it needs to or not.
Here are 6 storage patterns that fix this 👇
https://t.co/x6bvhqy97K
6. Read the Spark UI: Slow stages, skewed tasks, spill to disk are all there. If you can't diagnose a hanging job, you can't own one in production.
7. Observability, audit, and lineage: You need to know what ran, when, on what data, and whether it succeeded.
Large datasets are stored as individual files in S3. Too many small files per dataset make reads expensive!
Learn to: Detect it, Compact it or use table properties during insert.
Read how to here 👇
https://t.co/qT3XwyDgRy
#dataengineering#data#apachespark#apacheiceberg
PSA: Understand the concepts and read the docs, before using LLMs
Claude sent me on a wild goose chase, hallucinations, complex setup that breaks stuff, etc
Wasted a lot of time, only to realize the tool(quarto) I work with already does what I needed
Too many small files in your data lake impact performance.
Detect it with Spark UI
1. Go to the stages tab, see the event timeline.
2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem.
Fix coming tomorrow
#dataengineering