Apache Spark

Verified account

@ApacheSpark

Lightning-fast unified analytics engine

Joined June 2013

1 Following

44.1K Followers

202 Posts

19 days ago

For a long time, streaming architecture advice boiled down to two engines: one for high-throughput ETL, another when you need millisecond latency. At Data Engineering Open Forum 2026, Indrajit Roy (@databricks) walked through how Apache Spark Structured Streaming took a different path from day one: micro-batch processing. 🔸 Micro-batch model: Records arrive on a stream; the engine waits briefly, forms a batch, processes it, then repeats. 🔸 Batch query on stream slices: Each step is effectively running a small batch query over the latest slice of data. 🔸 One engine, different tradeoffs: The design challenges the “two streaming engines” default instead of accepting it as fixed. Watch the full keynote: https://t.co/mpQstlPVSF #ApacheSpark #StructuredStreaming #DataEngineering #OpenSource

0

29

4

13

2K

20 days ago

#DataAISummit Session Spotlight ➡️ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows. 🗓️ June 15–18 📍 San Francisco 🔗 Session details: https://t.co/fWzAL6uUcQ #ApacheSpark #DataAISummit

ApacheSpark's tweet photo. #DataAISummit Session Spotlight ➡️ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows.

🗓️ June 15–18
📍 San Francisco

🔗 Session details: https://t.co/fWzAL6uUcQ

#ApacheSpark #DataAISummit

0

10

3

3

986

23 days ago

#DataAISummit (June 15-18) Session Spotlight 👇 Get a year in review and the roadmap for Apache Spark Structured Streaming in open source: what's shipping in Spark 4.1 and what's ahead in 4.2 for mission-critical streaming ingestion and ETL pipelines. Jerry Peng and Anish Shrigondekar (@databricks) will cover recent advances and what's next! 🔗Details: https://t.co/FUejfH20yH #ApacheSpark #DataAISummit #StructuredStreaming #DataEngineering

ApacheSpark's tweet photo. #DataAISummit (June 15-18) Session Spotlight 👇

Get a year in review and the roadmap for Apache Spark Structured Streaming in open source: what's shipping in Spark 4.1 and what's ahead in 4.2 for mission-critical streaming ingestion and ETL pipelines.

Jerry Peng and Anish Shrigondekar (@databricks) will cover recent advances and what's next!

🔗Details: https://t.co/FUejfH20yH

#ApacheSpark #DataAISummit #StructuredStreaming #DataEngineering

0

5

0

1

726

23 days ago

At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. 👇 Real-time isn’t just for streaming specialists anymore. Express the logic. Let the engine handle the rest. 📹 Full video: https://t.co/c4erC4bdai #ApacheSpark

ApacheSpark's tweet photo. At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. 👇

Real-time isn’t just for streaming specialists anymore. Express the logic. Let the engine handle the rest.

📹 Full video: https://t.co/c4erC4bdai

#ApacheSpark

0

8

2

5

834

Who to follow

Databricks is the Data and AI company, helping organizations build and scale data and AI apps, analytics and agents.

Verified account

@Data_AI_Summit

#DataAISummit (formerly #SparkAISummit) is the global event for the data community. The conference is organized by @Databricks.

CTO @Databricks and prof @UCBerkeley. Working on data + AI, @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS, @GEPA_ai, https://t.co/Ot18rRHhkc.

24 days ago

#DataAISummit Session Spotlight 👇 Apache Spark™ 4.2: unified batch + streaming for AI workloads—feature pipelines, multimodal data, planner-level optimizations. 🎤 DB Tsai & Xiao Li | 🗓️ June 15-18 | 📍 San Francisco Session details: https://t.co/TXHCFk5Ivi #ApacheSpark #DataAISummit

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇

Apache Spark™ 4.2: unified batch + streaming for AI workloads—feature pipelines, multimodal data, planner-level optimizations.

🎤 DB Tsai & Xiao Li | 🗓️ June 15-18 | 📍 San Francisco

Session details: https://t.co/TXHCFk5Ivi

#ApacheSpark #DataAISummit

0

10

2

4

822

24 days ago

#DataAISummit Session Spotlight 👇 Andreas Neumann and @lisancao will cover Spark Declarative Pipelines (4.1). Declare what your pipeline does, and Spark manages execution, parallelization, checkpoints, and failure recovery. 🗓️ June 15–18 | 📍 San Francisco 🔗 Session details: https://t.co/k7aL1owlOW 🎟️ Register: https://t.co/shoIN3p2Gc #ApacheSpark #DataAISummit

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇

Andreas Neumann and @lisancao will cover Spark Declarative Pipelines (4.1). Declare what your pipeline does, and Spark manages execution, parallelization, checkpoints, and failure recovery.

🗓️ June 15–18 | 📍 San Francisco

🔗 Session details: https://t.co/k7aL1owlOW
🎟️ Register: https://t.co/shoIN3p2Gc

#ApacheSpark #DataAISummit

1

10

3

2

997

25 days ago

Stateless: 4.1. Stateful + RTM: upstream. If you have a streaming workload that "shouldn't be on Spark" because it needed ms, pull the RC and try it. The next move is yours.

0

1

0

1

465

25 days ago

For a decade, “streaming on Spark” meant micro-batches. Fine for ETL. A wall if your latency budget was under a second. Spark 4.1 stops that. Real-Time Mode (SPARK-50708) 👇

1

64

12

27

6K

25 days ago

How: • Continuous execution — long-lived tasks • Simultaneous scheduling — stage N+1 on N’s first record • Streaming shuffle — in-memory handoff, no batch boundary

1

4

0

3

978

26 days ago

Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three. @lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing 👇 🔹 SDP: declare intent, not triggers/checkpoints 🔹 RTM: one engine for sub-sec + batch 🔹 Connect: pyspark-client; prod = URL change 🔗 Read more: https://t.co/B6jjJ28ieD #ApacheSpark

ApacheSpark's tweet photo. Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three.

@lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing 👇

🔹 SDP: declare intent, not triggers/checkpoints
🔹 RTM: one engine for sub-sec + batch
🔹 Connect: pyspark-client; prod = URL change

🔗 Read more: https://t.co/B6jjJ28ieD

#ApacheSpark

0

20

3

8

1K

26 days ago

#DataAISummit Session Spotlight 👇 Structured Streaming: year in review + roadmap. Real-Time Mode, stateful transforms, Spark 4.2 ahead. 🎤 Jerry Peng & Anish Shrigondekar 🗓️ June 15–18 📍 San Francisco 🔗 Details: https://t.co/qwuMTrTSeh 🎟️ Register: https://t.co/SQkHYvLD2u

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇

Structured Streaming: year in review + roadmap. Real-Time Mode, stateful transforms, Spark 4.2 ahead.

🎤 Jerry Peng & Anish Shrigondekar
🗓️ June 15–18
📍 San Francisco

🔗 Details: https://t.co/qwuMTrTSeh
🎟️ Register: https://t.co/SQkHYvLD2u https://t.co/1PbsDcd96b

1

13

0

4

805

26 days ago

#DataAISummit Session Spotlight 👇 Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production. The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns. 🎤 Andreas Neumann & Lisa Cao 📆 June 15-18 📍 San Francisco Session details: https://t.co/k7aL1ovNZo #ApacheSpark #DataAISummit #DataEngineering #Spark

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇

Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production.

The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns.

🎤 Andreas Neumann & Lisa Cao
📆 June 15-18
📍 San Francisco

Session details: https://t.co/k7aL1ovNZo

#ApacheSpark #DataAISummit #DataEngineering #Spark

0

13

5

6

1K

27 days ago

#DataAISummit Session Spotlight 👇 Apache Spark™ 4.2: unified batch + streaming for AI workloads: feature pipelines, multimodal data, planner-level optimizations. 🎤 DB Tsai & Xiao Li | 🗓️ June 15–18 | 📍 San Francisco 🔗 Session details: https://t.co/LXLFbSuqDh #ApacheSpark

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇

Apache Spark™ 4.2: unified batch + streaming for AI workloads: feature pipelines, multimodal data, planner-level optimizations.

🎤 DB Tsai & Xiao Li | 🗓️ June 15–18 | 📍 San Francisco

🔗 Session details: https://t.co/LXLFbSuqDh

#ApacheSpark https://t.co/eNeM8WW554

1

22

4

5

2K

27 days ago

Spark 4.1 for agents 👇 🔹 SDP: triggers/checkpoints/DAG off the agent; dry-run fails fast 🔹 RTM: sub-second + batch, one engine (stateless in 4.1) 🔹 Connect: pyspark-client, no local JVM; sandbox→prod = URL Agent owns intent. Spark absorbs the rest. 🔗 Read more: https://t.co/B6jjJ28ieD #ApacheSpark #DataEngineering

ApacheSpark's tweet photo. Spark 4.1 for agents 👇

🔹 SDP: triggers/checkpoints/DAG off the agent; dry-run fails fast
🔹 RTM: sub-second + batch, one engine (stateless in 4.1)
🔹 Connect: pyspark-client, no local JVM; sandbox→prod = URL

Agent owns intent. Spark absorbs the rest.

🔗 Read more: https://t.co/B6jjJ28ieD

#ApacheSpark #DataEngineering

3

44

8

28

4K

about 1 month ago

Apache Spark 4.1 is out today. 🚀 AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run. Three architectural changes in 4.1 shrink that surface area. 1️⃣ Spark Declarative Pipelines (SDP) 2️⃣ Real-Time Mode 3️⃣ Spark Connect + Project Feather Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship. 👉 Get started: https://t.co/dnakBRz8IE #ApacheSpark #DataEngineering #OSS #AIagents

5

82

19

50

9K

about 1 month ago

#DataAISummit Session Spotlight (June 15–18 | San Francisco)👇 What's New in Apache Spark™ 4.1? 🔧 Spark Declarative Pipelines (SDP) ⚡ Structured Streaming Real-Time Mode 🐍 PySpark 🔗 Spark Connect & SQL Session details: https://t.co/cqcVvuarTp Register: https://t.co/shoIN3ouQE #ApacheSpark #DataAISummit

ApacheSpark's tweet photo. #DataAISummit Session Spotlight (June 15–18 | San Francisco)👇

What's New in Apache Spark™ 4.1?
🔧 Spark Declarative Pipelines (SDP)
⚡ Structured Streaming Real-Time Mode
🐍 PySpark
🔗 Spark Connect & SQL

Session details: https://t.co/cqcVvuarTp
Register: https://t.co/shoIN3ouQE

#ApacheSpark #DataAISummit

0

21

5

3

2K

about 1 month ago

Apache Spark is great at petabytes. It can be heavy at 100 megabytes. Project Feather is a new SPIP to fix that. 👇 Three lines of work, all targeting Spark in local mode: 1️⃣ Compilation and scheduling. Skip unnecessary shuffles when the planner knows a scan is one file. Mark itSinglePartitionand let the next aggregate run in place. 2️⃣ Arrow-baseddf.cache. Swap the row-oriented cache for Apache Arrow IPC. Columnar, compressed, iterable. 3️⃣ Shuffle-free execution. On a single node, replace blocking shuffle with in-process channels and Java virtual threads. No disk round-trip. Prototype today: a filter-and-sort query on a small in-memory table runs in 150 ms instead of 330 ms. One stage instead of two. The win compounds as the optimizations stack. 🔗 Project Feather: https://t.co/9mo1gMj5Fq The SPIP is open for comment. Pull the prototype, run it against your hardest small-data pipeline, file the bug we missed. ✍ Authors: Daniel Tenedorio and Liang-Chi Hsieh. #ApacheSpark #SPIP #OpenSource #DataEngineering #ApacheArrow

1

49

7

20

4K

about 1 month ago

#DataAISummit Session Spotlight 👇 Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026 At Data + AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs — including Native Arrow UDFs/UDTFs and built-in faulthandler + profiling. 📍 June 15–18 · SF Add to your agenda: https://t.co/DVAf8drwz6 #ApacheSpark #PySpark #DataAISummit #DataEngineering #OpenSource

ApacheSpark's tweet photo. #DataAISummit Session Spotlight 👇
Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026

At Data + AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs — including Native Arrow UDFs/UDTFs and built-in faulthandler + profiling.

📍 June 15–18 · SF
Add to your agenda: https://t.co/DVAf8drwz6

#ApacheSpark #PySpark #DataAISummit #DataEngineering #OpenSource

0

13

3

4

1K

Last Seen Users on Sotwe

Trends for you

Most Popular Users