I’m very thrilled to share that I’ve been selected as a Kafka catalyst in the Kafka community by Confluent for the year 2023-2024 ❤️🎉
Many thanks to @ena_9428 for the amazing swag 🙏
Not able to make the Kafka Current this year, but hope to make the Kafka Summit in 2024
How good is the idea of writing e-books at this point of time? I ask because at least most of the folks I know just use AI to skim everything and read only the summary.
@purosanantonio_ @Wimbledon I don’t think Carlos got lucky but Fritz choked. He was not too brave to go for the shots when he had chances on every point after 5-4 in the tie break
Event Streaming at scale is hard.
Real-time Analytics at scale is hard.
Building and maintaining Lakehouse at scale is hard.
Life as a Data Engineer in the future without knowing above is hard.
Choose your hard 🤷♂️
Dear Flink users and Data Engineers,
I feel the managed Flink from AWS is not really suitable for intensive stream loads with respect to both low memory and low compute nodes offered by AWS for their Flink clusters.
Does anyone feel the same?
#Flink#dataengineering
wtf is in apple silicon? my single threaded code is 3x faster on macbook than on server while burning like half the power.
if apple bothered making a 64-core part they'd take over every datacenter
Data pipelines will put you in the top 1% of the market.
If you could only learn one skill for the next decade, I can't think of anything more critical than learning to move and process data at scale.
I like to tell people I'm a Machine Learning Engineer, but in reality, 90% of the value I produce comes from my ability to move data around consistently and correctly.
In the field, we like to use the term "orchestration" when talking about coordinating workflows that move and process data. At a high level, there are three main steps you need to worry about:
1. Getting the data from its source
2. Processing and cleaning that data
3. Delivering the cleaned data to the right place
You might have also heard about "ETL" (Extract, Transform, Load). That's how most people refer to the process above.
Of course, building a simple ETL system isn't complex; most developers can do it without too much trouble. The problem is designing resilient, scalable, and fault-tolerant systems.
You can't code your way to a production-ready orchestration platform (ask me how I know.) I started with AirFlow and eventually moved to @kestra_io because of its event-driven architecture.
Event-driven means you can kick off a workflow automatically based on different triggers. For instance, when somebody uploads a new file to a folder, an app updates a database table, or there's a new message in a queue.
It's hard to summarize everything you get from Kestra, but here are some of the highlights:
• Kestra is free and open-source
• You install it from a Docker container
• Workflows as Code using YAML <--- this is awesome
• Scales to millions of executions
• It integrates with every cloud platform you've seen
• Language agnostic (but I still like Python the most)
Here is a link to their GitHub repository:
https://t.co/h8mWoYcymb
Here are the three things I recommend:
1 - Take a look at their live demo in their GitHub repo
2 - Build a simple workflow (it will take 5 minutes)
3 - Talk to your boss. Where can you plug this into your company?
I started using Kestra at the height of the pandemic. It's an awesome tool, and I'm proud that they are sponsoring my writing. I hope you find it helpful as well.
@BarclayCard18 Did they shake hands after delpotro withdrew? I remember watching the YouTube video of this one like a decade back but not able to find the full match.
And yeah, at US open in the same year, when Murray won, they were embracing at the net.
The biggest problem of #Java is poor perception. It's technically super-solid, but too often folks discard it based on misconceptions or information outdated years ago.
@Archimusik @Lil_Miss_AM @AustinTunnell Even when an employee wants to quit, they need to follow a fair process. Serve notice period of 3 months. They can’t do it as they wish the next day. I recently had a conversation with my friend who is in US and can understand where this is coming from.
Most of the Data Engineers & Analysts don’t realise their power and the impact they can create organisation-wide despite working in this field for several years.
#DataEngineers#data#dataengineering
To all those aspiring Data Engineers,
Kindly make sure you put enough effort to get the computer basics right - CPU, main memory, caching, multithreading, hyper threading, I/O, etc..
This topic is very commonly overlooked by many DEs unfortunately.
#dataengineering#data
How do I tackle this issue?
Go for multiprocessing.
With multiprocessing, you are guaranteed to launch multiple processes which is then set to run on dedicated CPU cores available on your machine with its own memory thus bypassing the limitations of the GIL per process.
You have a CPU bound task to implement in Python such as computing a sum of squares of 50 million integers.
You want to run on your macOS/linux with multiple cores. You want to smartly do your task by making use of all the available cores.
#Python
🧵
And when each thread attempt to read the Python byte code, they are blocked by the current thread reading the byte code since the Global Interpreter Lock allows only one thread per process to read the Python byte code at a time.
What is a graph database and why should you use it when you’re already doing analytics efficiently with relational databases like PostgreSQL or Snowflake?*
🧵
#graph#graphdatabase
If your analytics is based on a graph over a window period or you want to serve real-time graph analytics, you need to consider databases like memgraph which processes the entire graph in memory.
Choosing the right database has its own trade-offs.