For a long time, streaming architecture advice boiled down to two engines: one for high-throughput ETL, another when you need millisecond latency.
At Data Engineering Open Forum 2026, Indrajit Roy (@databricks) walked through how Apache Spark Structured Streaming took a different path from day one: micro-batch processing.
๐ธ Micro-batch model: Records arrive on a stream; the engine waits briefly, forms a batch, processes it, then repeats.
๐ธ Batch query on stream slices: Each step is effectively running a small batch query over the latest slice of data.
๐ธ One engine, different tradeoffs: The design challenges the โtwo streaming enginesโ default instead of accepting it as fixed.
Watch the full keynote: https://t.co/mpQstlPVSF
#ApacheSpark #StructuredStreaming #DataEngineering #OpenSource
#DataAISummit Session Spotlight โก๏ธ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows.
๐๏ธ June 15โ18
๐ San Francisco
๐ Session details: https://t.co/fWzAL6uUcQ
#ApacheSpark #DataAISummit
#DataAISummit (June 15-18) Session Spotlight ๐
Get a year in review and the roadmap for Apache Spark Structured Streaming in open source: what's shipping in Spark 4.1 and what's ahead in 4.2 for mission-critical streaming ingestion and ETL pipelines.
Jerry Peng and Anish Shrigondekar (@databricks) will cover recent advances and what's next!
๐Details: https://t.co/FUejfH20yH
#ApacheSpark #DataAISummit #StructuredStreaming #DataEngineering
At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. ๐
Real-time isnโt just for streaming specialists anymore. Express the logic. Let the engine handle the rest.
๐น Full video: https://t.co/c4erC4bdai
#ApacheSpark
#DataAISummit Session Spotlight ๐
Andreas Neumann and @lisancao will cover Spark Declarative Pipelines (4.1). Declare what your pipeline does, and Spark manages execution, parallelization, checkpoints, and failure recovery.
๐๏ธ June 15โ18 | ๐ San Francisco
๐ Session details: https://t.co/k7aL1owlOW
๐๏ธ Register: https://t.co/shoIN3p2Gc
#ApacheSpark #DataAISummit
Stateless: 4.1. Stateful + RTM: upstream.
If you have a streaming workload that "shouldn't be on Spark" because it needed ms, pull the RC and try it. The next move is yours.
For a decade, โstreaming on Sparkโ meant micro-batches. Fine for ETL. A wall if your latency budget was under a second.
Spark 4.1 stops that. Real-Time Mode (SPARK-50708) ๐
Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three.
@lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing ๐
๐น SDP: declare intent, not triggers/checkpoints
๐น RTM: one engine for sub-sec + batch
๐น Connect: pyspark-client; prod = URL change
๐ Read more: https://t.co/B6jjJ28ieD
#ApacheSpark
#DataAISummit Session Spotlight ๐
Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production.
The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns.
๐ค Andreas Neumann & Lisa Cao
๐ June 15-18
๐ San Francisco
Session details: https://t.co/k7aL1ovNZo
#ApacheSpark #DataAISummit #DataEngineering #Spark
#DataAISummit Session Spotlight ๐
Apache Sparkโข 4.2: unified batch + streaming for AI workloads: feature pipelines, multimodal data, planner-level optimizations.
๐ค DB Tsai & Xiao Li | ๐๏ธ June 15โ18 | ๐ San Francisco
๐ Session details: https://t.co/LXLFbSuqDh
#ApacheSpark
Apache Spark 4.1 is out today. ๐
AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run.
Three architectural changes in 4.1 shrink that surface area.
1๏ธโฃ Spark Declarative Pipelines (SDP)
2๏ธโฃ Real-Time Mode
3๏ธโฃ Spark Connect + Project Feather
Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship.
๐ Get started: https://t.co/dnakBRz8IE
#ApacheSpark #DataEngineering #OSS #AIagents
Apache Spark is great at petabytes. It can be heavy at 100 megabytes. Project Feather is a new SPIP to fix that. ๐
Three lines of work, all targeting Spark in local mode:
1๏ธโฃ Compilation and scheduling. Skip unnecessary shuffles when the planner knows a scan is one file. Mark itSinglePartitionand let the next aggregate run in place.
2๏ธโฃ Arrow-baseddf.cache. Swap the row-oriented cache for Apache Arrow IPC. Columnar, compressed, iterable.
3๏ธโฃ Shuffle-free execution. On a single node, replace blocking shuffle with in-process channels and Java virtual threads. No disk round-trip.
Prototype today: a filter-and-sort query on a small in-memory table runs in 150 ms instead of 330 ms. One stage instead of two. The win compounds as the optimizations stack.
๐ Project Feather: https://t.co/9mo1gMj5Fq
The SPIP is open for comment. Pull the prototype, run it against your hardest small-data pipeline, file the bug we missed.
โ Authors: Daniel Tenedorio and Liang-Chi Hsieh.
#ApacheSpark #SPIP #OpenSource #DataEngineering #ApacheArrow
#DataAISummit Session Spotlight ๐
Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026
At Data + AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs โ including Native Arrow UDFs/UDTFs and built-in faulthandler + profiling.
๐ June 15โ18 ยท SF
Add to your agenda: https://t.co/DVAf8drwz6
#ApacheSpark #PySpark #DataAISummit #DataEngineering #OpenSource