Sagar Lakshmipathy @saawgr - Twitter Profile

saawgr retweeted

17 days ago

⚡ Speed is Iceberg’s biggest Achilles’ heel. Most teams chase Velox or Gluten for Spark speed. Yet the real wins sit in fast I/O and layers untouched. We’ve spent years making lakehouse pipelines fast—but this isn’t a table-format debate. It’s slow I/O, shuffles, and reprocessing driving costs up. Speed is a stack: ⚡ SIMD 📊 Columnar processing 🔍 Query plans 💾 Fast I/O 🗄️ Metadata 📐 Storage layout 🔗 Indexes 🔄 Merging Full-stack engine: 3–4× faster than OSS Spark on Iceberg, better price/perf than Photon, no lock-in.

byte_array's tweet photo. ⚡ Speed is Iceberg’s biggest Achilles’ heel.

Most teams chase Velox or Gluten for Spark speed. Yet the real wins sit in fast I/O and layers untouched.

We’ve spent years making lakehouse pipelines fast—but this isn’t a table-format debate.

It’s slow I/O, shuffles, and reprocessing driving costs up.

Speed is a stack:
⚡ SIMD
📊 Columnar processing
🔍 Query plans
💾 Fast I/O
🗄️ Metadata
📐 Storage layout
🔗 Indexes
🔄 Merging

Full-stack engine: 3–4× faster than OSS Spark on Iceberg, better price/perf than Photon, no lock-in.

1

19

2

19

2K

saawgr retweeted

Vinoth Chandar

@byte_array

18 days ago

🧠 Anthropic and OpenAI may have taught Databricks and big data vendors a pricing lesson. 🤖 Frontier models price inference by tokens, not GPU FLOPs. it’s intuitive: bigger context or more work costs more. Cloud data platforms still price by compute: DBUs, EC2 hours, slot time. Ask: “How many DBUs will this query use on Databricks vs EMR?” 🤷 Ask: “How much data will it scan?” most engineers can answer fast.

byte_array's tweet photo. 🧠 Anthropic and OpenAI may have taught Databricks and big data vendors a pricing lesson.

🤖 Frontier models price inference by tokens, not GPU FLOPs. it’s intuitive: bigger context or more work costs more.

Cloud data platforms still price by compute: DBUs, EC2 hours, slot time.

Ask: “How many DBUs will this query use on Databricks vs EMR?” 🤷

Ask: “How much data will it scan?” most engineers can answer fast.

2

5

1

0

285

saawgr retweeted

Rajesh Mahindra @rmahindra

23 days ago

Data engineers: hours tuning Spark jobs, yet Hudi, Iceberg, or Parquet reads still take minutes? We've all been there. Most pipelines bottleneck pre-join. Details below 👇

1

4

2

1

303

saawgr retweeted

Vinoth Chandar

@byte_array

3 months ago

Everyone assumes usage-based pricing in cloud data is fair and efficient. ⚖️ But it has a real problem: It can stop vendors for building faster engines. Traditional models priced on value—Oracle earned more for standout features. Now, with EMR or Databricks, bills hinge on compute usage. Customers win from compute efficiency (lower costs), but vendors lose revenue, pushing them to own the compute layer for pricing control. Sure, usage models offer flexibility, but they misalign incentives long-term. What's better? We need outcome-based pricing that rewards real value, like queries executed or data processed. 🚀📊

byte_array's tweet photo. Everyone assumes usage-based pricing in cloud data is fair and efficient. ⚖️

But it has a real problem: It can stop vendors for building faster engines.

Traditional models priced on value—Oracle earned more for standout features.

Now, with EMR or Databricks, bills hinge on compute usage. Customers win from compute efficiency (lower costs), but vendors lose revenue, pushing them to own the compute layer for pricing control.

Sure, usage models offer flexibility, but they misalign incentives long-term.

What's better? We need outcome-based pricing that rewards real value, like queries executed or data processed. 🚀📊

0

9

4

1

671

Who to follow

Bharath Muthupandiyan

@Bharabytes

Our brief time under the sun.

Raunaq Salat

@raunaqsalat

Football Performance Analyst.

Logan

@ImLogan3

Once upon a time used to eat, sleep and dream about Cricket. Robotics Engineer! 🤖| Food, Sports and Movies.

Sagar Lakshmipathy @saawgr

4 months ago

@bayer04_en @grok whats the meaning of this post?

1

0

15

saawgr retweeted

Vinoth Chandar

@byte_array

7 months ago

🚀 Quanton now also powers Apache Iceberg natively — delivering 3× faster Spark workloads! When we launched Quanton, the goal was ambitious: make Spark truly lakehouse-optimized — faster, smarter, and format-aware. 👇

byte_array's tweet photo. 🚀 Quanton now also powers Apache Iceberg natively — delivering 3× faster Spark workloads!

When we launched Quanton, the goal was ambitious: make Spark truly lakehouse-optimized — faster, smarter, and format-aware.

👇 https://t.co/4quZYhsVJ1

1

14

5

1

998

saawgr retweeted

Onehouse @Onehousehq

7 months ago

💸 Most teams running Apache Spark™ are burning 30-70% of their compute budget, and they don’t even know it. Why? Because Spark’s defaults are built for throughput, not efficiency. On Nov 18, join us for a live session on The True Cost of Spark, and how to cut it in half. We’ll unpack: ⚙️ Where Spark leaks money (and how to see it in your own jobs) 📉 Real-world fixes that delivered 50–60% cost savings 🚀 How to achieve 2-3x better price/performance, no code changes required 📅 Nov 18, 10am PT 👉Save your spot: https://t.co/MaGCoCucpu #ApacheSpark #DataEngineering #ETL #Lakehouse

Onehousehq's tweet photo. 💸 Most teams running Apache Spark™ are burning 30-70% of their compute budget, and they don’t even know it.

Why? Because Spark’s defaults are built for throughput, not efficiency.

On Nov 18, join us for a live session on The True Cost of Spark, and how to cut it in half.

We’ll unpack:
⚙️ Where Spark leaks money (and how to see it in your own jobs)
📉 Real-world fixes that delivered 50–60% cost savings
🚀 How to achieve 2-3x better price/performance, no code changes required

📅 Nov 18, 10am PT

👉Save your spot: https://t.co/MaGCoCucpu

#ApacheSpark #DataEngineering #ETL #Lakehouse

0

4

3

0

192

saawgr retweeted

Apache Hudi

@apachehudi

9 months ago

📊 @ApacheSpark + @apachehudi users! Which Spark APIs are the most common for reading, writing, and managing Hudi tables?

0

2

1

0

175

saawgr retweeted

Apache Hudi

@apachehudi

9 months ago

🚀 [New Blog] Performing append-only write operations is quite easy in #ApacheHudi! Since v0.14 (2023-09), you don’t need to set a record key field to start writing to Hudi tables. Auto key generation lowers the barrier for getting started with a data lakehouse—perfect for append-only writes. Check out the deep dive blog by Hudi PMC member @_xushiyan on the design: https://t.co/AJyzbNucql #DataEngineering #DataLakehouse #OpenSource

apachehudi's tweet photo. 🚀 [New Blog] Performing append-only write operations is quite easy in #ApacheHudi!

Since v0.14 (2023-09), you don’t need to set a record key field to start writing to Hudi tables. Auto key generation lowers the barrier for getting started with a data lakehouse—perfect for append-only writes.

Check out the deep dive blog by Hudi PMC member @_xushiyan on the design: https://t.co/AJyzbNucql

#DataEngineering #DataLakehouse #OpenSource

0

5

3

1

315

saawgr retweeted

Sagar Lakshmipathy @saawgr

about 1 year ago

Supplementing Onehousehq’s Open Engines feature (https://t.co/Yt4JvFiTFv), I had the opportunity to write a deep-dive style blog (https://t.co/SJziAu8Gyp) comparing popular streaming engines like Apache Flink, Spark’s Structured Streaming and Kafka Streams. #dataeng #streaming

0

11

2

1

251

Sagar Lakshmipathy @saawgr

about 1 year ago

Supplementing Onehousehq’s Open Engines feature (https://t.co/Yt4JvFiTFv), I had the opportunity to write a deep-dive style blog (https://t.co/SJziAu8Gyp) comparing popular streaming engines like Apache Flink, Spark’s Structured Streaming and Kafka Streams. #dataeng #streaming

0

11

2

1

251

saawgr retweeted

Onehouse @Onehousehq

about 1 year ago

Trying to pick the right streaming engine? Check out our no-fluff breakdown of @ApacheFlink , #KafkaStreams, and #SparkStructuredStreaming Spoiler: they all rock… but in very different ways. 👉 https://t.co/ebjZMgS1JV #dataengineering #streaming #opensource #realtimedata #Flink

0

3

1

0

148

saawgr retweeted

Onehouse @Onehousehq

about 1 year ago

🚨 Announcing Open Engines™, a quick + reliable way to deploy @trinodb, @raydistributed, and @ApacheFlink making it easy to choose the right engine for analytics, streaming, or ML/DS. Read the details👉 https://t.co/Gvg7J95NYR

0

8

5

0

367

saawgr retweeted

Onehouse @Onehousehq

over 1 year ago

🥦 SQL Server CDC makes it possible to keep analytics fresh and up-to-date. But you have to hook up the change stream to your analytics data store to keep it current. ✅ ✅✅ You also need a streaming platform to deliver the news, and a flexible data store. Kafka? Check. Onehouse? Double-check. 🤓 Our new solution guide shows you how to do just that. Check it out! #onehouse #dataengineering #nolockin #datalakehouse #opensource https://t.co/VbMqjzWGl2

Onehousehq's tweet photo. 🥦 SQL Server CDC makes it possible to keep analytics fresh and up-to-date. But you have to hook up the change stream to your analytics data store to keep it current.

✅ ✅✅ You also need a streaming platform to deliver the news, and a flexible data store. Kafka? Check. Onehouse? Double-check.

🤓 Our new solution guide shows you how to do just that. Check it out!

#onehouse #dataengineering #nolockin
#datalakehouse #opensource

https://t.co/VbMqjzWGl2

0

2

1

0

135

Sagar Lakshmipathy @saawgr

over 1 year ago

Have you been keeping up with Hudi docs lately? Esp. in the past few months the team made some great strides. If you are looking at your spark UI for Hudi writes and wondering, "wait, what's going on?" Here's what's going on: https://t.co/oYmBiBUtUS

saawgr's tweet photo. Have you been keeping up with Hudi docs lately? Esp. in the past few months the team made some great strides. If you are looking at your spark UI for Hudi writes and wondering, "wait, what's going on?"

Here's what's going on: https://t.co/oYmBiBUtUS https://t.co/PpoY7hDnAv

0

4

1

2

210

saawgr retweeted

Onehouse @Onehousehq

over 1 year ago

☝️ What if you could use the catalog(s) and query engine(s) of your choice, against a single source of truth? 🐎 In this blog post, Po Hong shows you how to make that vision real. 😃 And see our cool infographic for a few useful data architecture principles. https://t.co/XgdyjlAEBA #onehouse #dataengineering #nolockin #universaldatalakehouse #apachextable #opensource

Onehousehq's tweet photo. ☝️ What if you could use the catalog(s) and query engine(s) of your choice, against a single source of truth?

🐎 In this blog post, Po Hong shows you how to make that vision real.

😃 And see our cool infographic for a few useful data architecture principles.

https://t.co/XgdyjlAEBA

#onehouse #dataengineering #nolockin
#universaldatalakehouse #apachextable #opensource

0

1

0

233

saawgr retweeted

Robin Moffatt 🍻🏃🥓

@rmoff

almost 2 years ago

This looks pretty neat - a TUI for Apache Kafka https://t.co/XXqPTcQaNc

2

194

45

141

20K

Sagar Lakshmipathy @saawgr

almost 2 years ago

RT @Onehousehq: 🎉 Exciting News! For Onehouse and those rooting for the open data lakehouse 🎉 We are happy to announce our $35M Series B r…

0

4

0

saawgr retweeted

Onehouse @Onehousehq

almost 2 years ago

🏇 You can mix and match data lakehouse table formats and query engines for best performance. 🧱 In one case, Hudi drove Databricks query execution time from 1 minute to 4 seconds. ⚒️ Find out how in this technical deep dive from Sagar Lakshmipathy of Onehouse. https://t.co/4KuXobRiMj #onehouse #dataengineering #nolockin #universaldatalakehouse #apachehudi #apachextable #opensource

Onehousehq's tweet photo. 🏇 You can mix and match data lakehouse table formats and query engines for best performance.

🧱 In one case, Hudi drove Databricks query execution time from 1 minute to 4 seconds.

⚒️ Find out how in this technical deep dive from Sagar Lakshmipathy of Onehouse.

https://t.co/4KuXobRiMj

#onehouse #dataengineering #nolockin
#universaldatalakehouse #apachehudi
#apachextable #opensource

0

4

1

2

226

Sagar Lakshmipathy @saawgr

over 2 years ago

Here’s a blog I go in detail about this architecture: https://t.co/bRtc5hIoTs

0

1

0

26

Sagar Lakshmipathy @saawgr

over 2 years ago

In the past few weeks I saw several users trying to use @OnetableOSS to migrate between table formats and also catalog them while they are at it.

1

0

37

Sagar Lakshmipathy @saawgr

over 2 years ago

For example, a @apachehudi table can be translated to and also be catalogued as an #Iceberg table on AWS Glue directly without running a crawler.

1

0

43

Sagar Lakshmipathy

@saawgr

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users