luminousmen @luminousmen - Twitter Profile

Pinned Tweet

over 2 years ago

Hello wonderful people! I’m thrilled to announce that my new book, “Grokking Concurrency,” has officially hit the shelves! You can find it here: https://t.co/TwSqRbNfJF I could really use your support in spreading the word!

luminousmen's tweet photo. Hello wonderful people!
I’m thrilled to announce that my new book, “Grokking Concurrency,” has officially hit the shelves!
You can find it here: https://t.co/TwSqRbNfJF
I could really use your support in spreading the word! https://t.co/i4UZNjvFSv

6

107

28

60

15K

luminousmen @luminousmen

about 1 month ago

The answer is almost always in the UI. Every item came from a real production screwup. Full checklist: https://t.co/dwlsrm68v8

0

35

luminousmen @luminousmen

about 1 month ago

Your Spark job worked fine in dev. In prod it takes 4 hours. Welcome to the club.

1

0

66

luminousmen @luminousmen

about 1 month ago

The rest of it Joins that explode (broadcast, skew, bucket) JDBC reads that are silently single-threaded Spark 4.x defaults you probably haven't turned on yet What to actually look at in the UI before you guess

1

48

Who to follow

Fly.io

@flydotio

The platform for devs who just want to ship. Powered by sandboxes that let you deploy any code with confidence.

Linux Handbook

@LinuxHandbook

Learn to use Linux commands, SysAdmin tools, Self-hosting, Docker and other DevOps stuff with us.

Maxime Beauchemin

@mistercrunch

Founder & CEO @preset_data. Original creator of Apache Superset and Apache Airflow.

luminousmen @luminousmen

about 1 month ago

A data engineer on Reddit got drunk and wrote down everything he learned in 10 years. The post lives on. I agree with almost all of it. Preserved it here: https://t.co/xo5RgUbu0g

0

1

0

193

luminousmen @luminousmen

about 2 months ago

"How many unique users did X AND Y?" - simple question, brutal at scale. Theta sketches solve it. But the gap between the algorithm and not getting burned in production is huge. Wrote the guide I wish existed: https://t.co/Ye4LTkOuGa #dataengineering

0

72

luminousmen @luminousmen

2 months ago

The full sketch family: - Theta/HLL → COUNT DISTINCT - KLL/REQ → percentiles & distributions - Frequent Items → heavy hitters - Reservoir Sampling → representative samples Each has sharp trade-offs. I break them all down in the full post: https://t.co/QzAH5jjuCT

0

37

luminousmen @luminousmen

2 months ago

Your COUNT DISTINCT has been running for 3 hours. Your cluster is on fire. Nobody tells you early on: exact answers to simple questions can be impossibly expensive at scale. Data sketches fix this. Here's how 👇 #dataengineering

1

0

112

luminousmen @luminousmen

2 months ago

The real breakthrough: sketches are mergeable. Build them in parallel on hundreds of machines. Merge in milliseconds. Handle late-arriving data by merging it in. Pre-compute at ingestion and store a 2KB "unique users" column instead of 2GB.

1

0

36

luminousmen @luminousmen

2 months ago

Built it in Vue + TypeScript. The irony of a data engineer building a frontend game is not lost on me. More data eng stuff (the serious kind): https://t.co/KJgcxCGU3w

0

31

luminousmen @luminousmen

2 months ago

Built a tycoon game about data engineering. You start at the ground level. Goal: reach AGI before you go bankrupt. Spoiler: the cash flow problem is just as real as in actual startups. Free, browser, no signup: https://t.co/hnJXcxuPHB

1

0

90

luminousmen @luminousmen

2 months ago

My favorite design decision: if you hire AI researchers, Meta shows up after 60 seconds and takes half of them. I didn't even need to make that part up.

1

0

28

luminousmen @luminousmen

3 months ago

Made this for engineers who aren't sure if they're using AI correctly or just laundering their thinking. Full rant on my Substack: https://t.co/RUVILA3N25

luminousmen's tweet photo. Made this for engineers who aren't sure if they're using AI correctly or just laundering their thinking.
Full rant on my Substack: https://t.co/RUVILA3N25 https://t.co/ZGkcT5bCTw

0

1

0

1

62

luminousmen @luminousmen

3 months ago

Adding this to my bookmarks bar, just in case: https://t.co/XxMC773FZb

0

1

0

68

luminousmen @luminousmen

5 months ago

Then it breaks. And 9 out of 10 times, it breaks because of memory. Full blog post: https://t.co/YfRS9oYKjr

0

64

luminousmen @luminousmen

5 months ago

Spark is powerful. It scales. It's fast out of the box. And yeah, the defaults are surprisingly decent - until your dataset grows, your joins get messy, or you start mixing Scala with PySpark and Arrow and some eager ML engineer starts throwing 200MB Pandas UDFs at the cluster.

1

0

65

luminousmen

@luminousmen

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users