I break PySpark ( and recently Databricks) for a living (and then fix it).
Making sense of Spark chaos & Parallelism.
12+ yrs in data, Open to opportunities
Why should we avoid using Interactive Clusters for production workloads ? Interactive Clusters seem convenient.
But convenience can become a risk.
Since Interactive Clusters are long-running and often shared:
• library changes can affect other workloads
• cluster state may persist between runs
• debugging code and production code can coexist
• resource contention becomes common
That’s why most prod ETL pipelines move toward:
→ Job Clusters
→ Serverless Jobs
→ isolated execution environments
Data engineering is not only about making pipelines work.
It’s about making them predictable, reproducible, and stable at scale.
#Databricks #ApacheSpark #DataEngineering #Lakehouse #DatabricksInterviewPrep
Apache Spark 4.1 is out today. 🚀
AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run.
Three architectural changes in 4.1 shrink that surface area.
1️⃣ Spark Declarative Pipelines (SDP)
2️⃣ Real-Time Mode
3️⃣ Spark Connect + Project Feather
Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship.
👉 Get started: https://t.co/dnakBRz8IE
#ApacheSpark #DataEngineering #OSS #AIagents
Can we use Serverless and Classic compute together in Databricks?
For a single task, usually you pick one.
But in a multi-task Databricks Job, yes each task can use different compute.
Example:
Bronze load takes 2 mins → use Serverless
Heavy transformation takes longer → use Classic job compute for more control
That flexibility is actually useful when designing real pipelines.
#Databricks #DataEngineering #AzureDatabricks #Lakehouse
Serverless is great when I want Databricks to handle the compute side and let me focus more on the pipeline logic.
Classic / job compute still makes sense when I need more control over cluster config, libraries, runtime, sizing, or cost behavior.
Both have their place.
The key is not to treat one as a replacement for everything.
#Databricks #DataEngineering #PySpark #Lakehouse
External Locations in Unity Catalog made more sense to me when I looked at them from the Azure side.
It is not about giving every Databricks user an ADLS key.
You create a Storage Credential using an Azure Managed Identity, map it to an ADLS path through an External Location, and then control access through Unity Catalog permissions.
Azure RBAC controls the identity.
Unity Catalog controls the user access.
That separation is what makes it cleaner.
#Databricks #UnityCatalog #AzureDatabricks #DataGovernance #Lakehouse
Unity Catalog is often explained as a security layer, but I think that undersells it.
For me, the bigger value is how it brings structure to the whole Databricks environment.
Data organization, permissions, lineage, auditing, storage governance, and cross-workspace consistency all start coming together in one place.
That is when Databricks starts feeling much easier to manage at scale.
#Databricks #UnityCatalog #DataGovernance #Lakehouse
Serverless is great when I want Databricks to handle the compute side and let me focus more on the pipeline logic.
Classic / job compute still makes sense when I need more control over cluster config, libraries, runtime, sizing, or cost behavior.
Both have their place.
The key is not to treat one as a replacement for everything.
#Databricks #DataEngineering #PySpark #Lakehouse
Unity Catalog is often explained as a security layer, but I think that undersells it.
For me, the bigger value is how it brings structure to the whole Databricks environment.
Data organization, permissions, lineage, auditing, storage governance, and cross-workspace consistency all start coming together in one place.
That is when Databricks starts feeling much easier to manage at scale.
#Databricks #UnityCatalog #DataGovernance #Lakehouse
Volumes in Unity Catalog helped me understand that not everything in Databricks has to be a table.
Sometimes you need governed access to files:
CSV, JSON, Images, ML artifacts, Config files, Raw landing data
That’s where Volumes fit nicely.
#Databricks#UnityCatalog #DataEngineering #PySpark
External Locations in Unity Catalog are underrated.
They make the connection between cloud storage and Databricks much more controlled.
You are not just pointing to an S3/ADLS/GCS path randomly.
You define who can access which storage location and under what governance boundary.
#Databricks #UnityCatalog #DataLakehouse
One thing I like about Unity Catalog is that access control becomes much easier to reason about.
Instead of managing permissions separately across workspaces, storage paths, and random tables, you can centralize governance at the catalog/schema/table level.
#Databricks #UnityCatalog #DataGovernance #BigData
Unity Catalog finally made Databricks feel like a proper governed data platform to me.
Earlier, it was easy to create tables, jobs, notebooks, and access patterns everywhere.
But UC forces you to think clearly:
Catalog → Schema → Table / View / Volume
That structure is how control and governance become manageable.
#Databricks #UnityCatalog #DataEngineering #Lakehouse
One big mental shift with Databricks: Where does the hardware actually live?
In a Hybrid setup, your VMs and resources sit under your own Azure/AWS/GCP subscription.
But with Serverless, that infrastructure lives in the Databricks subscription instead.
It’s the difference between managing the plumbing yourself vs. just turning on the tap.
#Databricks #Azure #DataEngineering #CloudComputing
Nothing kills a budget faster than "mystery" cloud costs because someone forgot to label their resources.
Stop manually policing cluster usage. Use Azure Policy to enforce tagging if the ProjectID and Environment tags aren't there, the cluster simply doesn't start. Guardrails always beat "asking nicely" when the bill comes due.
#Azure #CloudGovernance #DataEngineering #FinOps #Databricks
Moving a production pipeline to Databricks Serverless is a total shift for CI/CD.
Honestly, it's such a relief to stop obsessing over whether the cluster config is "perfect" and just focus on the logic instead.
The only caution: Serverless removes infra friction, not cost discipline. You still need to know your process and resources they can consume using strong monitoring.
We’re finally getting close to that "Write Code, Run Data" dream at scale.
#PySpark #SoftwareEngineering #DataEngineering #Databricks
The biggest mental shift with Databricks Serverless is accepting that the compute plane no longer lives inside your Azure subscription.
At first, it feels strange not seeing those VMs in the Azure Portal.
But not having to troubleshoot “Subscription quota exceeded” errors anymore is a massive productivity win.
#Databricks #DataOps #BigData
A little controversial opinion, Serverless isn’t just about speed; it’s a total shift in unit economics.
For bursty, 2-minute ETL jobs, paying the DBU premium is actually cheaper than keeping a Classic idle-cluster alive or paying for the 5-minute startup time of a cold VM.
In the end it is all about the "Total Cost of Ownership."
#DataStrategy #TechStack #Databricks
The biggest mental shift with Databricks Serverless is accepting that the compute plane no longer lives inside your Azure subscription.
At first, it feels strange not seeing those VMs in the Azure Portal.
But not having to troubleshoot “Subscription quota exceeded” errors anymore is a massive productivity win.
#Databricks #DataOps #BigData
Autoscaling on Hybrid clusters always felt a bit reactive you’re constantly waiting for Azure to provision hardware during peak loads.
Serverless compute hits different. The "warm pool" architecture means you actually get the elasticity that cloud marketing has been promising for years.
#Scaling #FinOps #Databricks
The biggest mental shift with Databricks Serverless is accepting that the compute plane no longer lives inside your Azure subscription.
At first, it feels strange not seeing those VMs in the Azure Portal.
But not having to troubleshoot “Subscription quota exceeded” errors anymore is a massive productivity win.
#Databricks #DataOps #BigData
The biggest mental shift with Databricks Serverless is accepting that the compute plane no longer lives inside your Azure subscription.
At first, it feels strange not seeing those VMs in the Azure Portal.
But not having to troubleshoot “Subscription quota exceeded” errors anymore is a massive productivity win.
#Databricks #DataOps #BigData
In Azure Databricks, you can choose where your data sits. The "Hybrid" model lets you use your own private network (VNet) for maximum security.
The catch? You are now the "landlord." You have to manage the security rules (NSGs) and ensure you don’t run out of IP addresses. Great for security, but heavy on manual maintenance.
#AzureDatabricks #DataEngineering #CloudSecurity
Why pay for a platform?
Because of Photon Engine. Running Spark tasks up to 20x faster because of a C++ vectorized engine isn't just a "nice to have" it’s a massive cost saver for enterprise-scale workloads.
#PySpark#TechStack#Databricks
Why Databricks ?
There comes a point in every data project where manual tuning doesn't scale.
If you're tired of OOM errors and wasted spend on idle nodes, it’s time.
Features like Serverless and Autoscaling make sure you only pay for what you actually compute. Efficient scaling is the goal.
#FinOps #DataPipelines #Databricks
Why databricks ?
The "plumbing" of Big Data is the biggest time-sink. Setting up Spark on-prem or manual VMs is a headache.
Moving to a managed SaaS like Databricks means spinning up clusters in seconds across AWS, Azure, or GCP.
#CloudComputing#DataArchitecture
Why databricks ?
The "plumbing" of Big Data is the biggest time-sink. Setting up Spark on-prem or manual VMs is a headache.
Moving to a managed SaaS like Databricks means spinning up clusters in seconds across AWS, Azure, or GCP.
#CloudComputing#DataArchitecture
Why Databricks ?
Stop thinking of Databricks as just "Cloud Spark."
It’s the difference between buying an engine and buying a Car.
With the optimized runtime and Delta Lake, you’re getting a Lakehouse architecture that handles ACID transactions on top of your raw data.
#BigData #Lakehouse #Databricks