Joseph Machado @startdataeng - Twitter Profile

Pinned Tweet

about 6 years ago

Exercise project for anyone starting in data engineering https://t.co/F3GegLjIrH #dataengineering #bigdata #ETL #ApacheAirflow #AWS #ApacheSpark

startdataeng's tweet photo. Exercise project for anyone starting in data engineering https://t.co/F3GegLjIrH
#dataengineering #bigdata #ETL #ApacheAirflow #AWS #ApacheSpark https://t.co/M3B3fSK7mL

14

475

91

422

0

Joseph Machado @startdataeng

1 day ago

The project structure dbt recommends for data modeling is complex and confusing. At almost every company I've worked we've used these 3 layers: 1. raw tables as-is from source 2. tables modeled as facts and dims 3. summary tables I'm working on upgrading my dbt tutorial #dbt

startdataeng's tweet photo. The project structure dbt recommends for data modeling is complex and confusing.

At almost every company I've worked we've used these 3 layers:
1. raw tables as-is from source
2. tables modeled as facts and dims
3. summary tables

I'm working on upgrading my dbt tutorial
#dbt https://t.co/VCLz0L4tTa

1

3

0

2

167

Joseph Machado @startdataeng

1 day ago

If you are a data engineer or looking to break into Data Engineering, Apache Airflow is a must-know. Check out my post on how to think about orchestration and scheduling with Airflow. https://t.co/dIq3cLdG9J #dataengineering #apacheairflow #datapipelines

startdataeng's tweet photo. If you are a data engineer or looking to break into Data Engineering, Apache Airflow is a must-know.

Check out my post on how to think about orchestration and scheduling with Airflow.

https://t.co/dIq3cLdG9J

#dataengineering
#apacheairflow
#datapipelines https://t.co/9go74R10ec

0

6

2

10

666

Joseph Machado @startdataeng

2 days ago

Adding short video explanations to my posts, let's see how that goes https://t.co/4OX2X8tlMZ #dataengineering

0

2

0

133

Who to follow

building AI systems at rippling. all views my own.

Apache Airflow

@ApacheAirflow

Airflow is a workflow scheduler. Now on https://t.co/r5iMFW0E2h

Joseph Machado @startdataeng

9 days ago

Stuck preparing for data engineering system design interviews? Use this article to guide you through the prep. https://t.co/7Pf2EgN8AQ #dataengineering #interview #systemdesign

0

2

0

4

205

Joseph Machado @startdataeng

14 days ago

Everyone says you need to know Python for data work. But what does that mean exactly? In this post I cover the key Python concepts every DE needs to know ⬇️ https://t.co/2e23ybfu81 #dataengineering #python #datapipeline #softwareengineering #dataanalysis

1

5

1

4

343

Joseph Machado @startdataeng

17 days ago

Pipeline code can be generated; set yourself apart by understanding how to design pipelines. Learn how to design your pipelines based on given inputs and required outputs. https://t.co/BdVQb37rqU #dataengineering #datawarehouse #sql #etl #datapipeline

0

17

1

12

591

startdataeng retweeted

Mitchell Hashimoto

@mitchellh

19 days ago

I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out. I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really). It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely. The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture. We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying. I worry.

513

15K

2K

5K

2M

Joseph Machado @startdataeng

20 days ago

@sspaeti Nice, How'd you get image view inside vim? I use render markdown but no image protocols. Any recommendations?

1

0

144

Joseph Machado @startdataeng

20 days ago

Your data warehouse bill is high for one reason. Full table scans. Every query reads everything, whether it needs to or not. Here are 6 storage patterns that fix this 👇 https://t.co/x6bvhqy97K

0

3

0

10

364

Joseph Machado @startdataeng

21 days ago

The API is the easy part. Production is where the real learning happens.

0

1

0

78

Joseph Machado @startdataeng

21 days ago

Spark API is easy to learn. But to debug a hanging job, you need to know Spark internals. Here are 7 topics to know for production Spark 👇

startdataeng's tweet photo. Spark API is easy to learn.

But to debug a hanging job, you need to know Spark internals.

Here are 7 topics to know for production Spark 👇 https://t.co/xC9itpJBSP

1

9

1

14

391

Joseph Machado @startdataeng

21 days ago

6. Read the Spark UI: Slow stages, skewed tasks, spill to disk are all there. If you can't diagnose a hanging job, you can't own one in production. 7. Observability, audit, and lineage: You need to know what ran, when, on what data, and whether it succeeded.

1

0

95

Joseph Machado @startdataeng

21 days ago

@VicVijayakumar Thank you

0

140

Joseph Machado @startdataeng

22 days ago

Large datasets are stored as individual files in S3. Too many small files per dataset make reads expensive! Learn to: Detect it, Compact it or use table properties during insert. Read how to here 👇 https://t.co/qT3XwyDgRy #dataengineering #data #apachespark #apacheiceberg

0

4

2

3

249

Joseph Machado @startdataeng

22 days ago

PSA: Understand the concepts and read the docs, before using LLMs Claude sent me on a wild goose chase, hallucinations, complex setup that breaks stuff, etc Wasted a lot of time, only to realize the tool(quarto) I work with already does what I needed

0

3

0

1

405

Joseph Machado @startdataeng

23 days ago

Too many small files in your data lake impact performance. Detect it with Spark UI 1. Go to the stages tab, see the event timeline. 2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem. Fix coming tomorrow #dataengineering

startdataeng's tweet photo. Too many small files in your data lake impact performance.

Detect it with Spark UI

1. Go to the stages tab, see the event timeline.
2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem.

Fix coming tomorrow
#dataengineering https://t.co/9G1u10fvSL

0

9

3

5

582

Joseph Machado

@startdataeng

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users