.@SourcetableApp CTO @andrewgrosser shares his recommended tech stack for serious startups - a "wicked combination" that includes:
- S3 + Cassandra for data
- Daft for processing
- Python, WASM, Ray
Learn how they built the first AI-powered spreadsheet:
https://t.co/ZX8fYQzfwt
First-class observability in Daft.
Operators, Tasks, Rows, Memory are all surfaced in a dashboard that ships with the install.
+ OTel endpoints for your existing collector.
+ Stuck detection.
+ DAFT_TRACE for console debugging.
~45 PRs across the observability stack.
https://t.co/HWyjBiaePN
daft.VideoFile is perfect for Physical AI.
Open X-Embodiment aggregates over a million episodes. DROID alone runs 350+ hours of multi-camera 60fps footage. That's hundreds of millions of frames across a single dataset, and most action-model training doesn't need them all.
- read_video_frames — filter on keyframes; supports S3, GCS, & YouTube URLs.
- video_metadata — resolution, fps, duration, frame count from file headers.
- video_frames(start_time, end_time) — decode a 10-second window from a 90-minute file.
Frames land as Image columns in the same DataFrame.
Feed them to a vision model, compute embeddings, and write to Iceberg.
Check out the blog
https://t.co/Ucn3SzF12g
VLAs are dead, long live World Action Models
So declares @DrJimFan, the most credible researcher in robotics today.
https://t.co/UOFvpoz41l
👆We just published a short blog where @ykdojo breaks down the video. It certainly helped me correct my mental model.
So turns out I'm not the only one who builds on @daftengine 😆
In fact, theres a TON of projects that leverage daft natively to power their AI & data processing.
Daft is the Data Engine for AI.
> I say it because its true.
> I keep saying it because the Daft community keeps giving back!
Check out all these projects! (link in the comments)
Probably my favorite episode yet!
Just finished filming our latest episode of Zero Shot Espresso with @danimberman who is an @ApacheAirflow PMC, developed the @kubernetesio executor, and now helps technical teams ship production AI as a consultant.
🚢 Daft v0.7.10
30 contributors (a release record!)
41 new features and functions.
Distributed as_of joins, SimHash dedupe, temporal arithmetic, C++ extensions.
https://t.co/Blit46bYww
The fastest H3 geospatial indexing in Daft wasn't written by the Daft team.
Developed by Garrett Weaver, daft-h3 runs 3–16x faster than simply wrapping h3-py in a Python UDF. That speed up is thanks to Daft's Native Extensions powered by Apache Arrow's C Data Interface.
Most image embedding pipelines are actually two pipelines stitched together.
Script one: PySpark reads images from S3, resizes them, joins with metadata, writes to Delta Lake.
Script two: PyTorch loads ResNet, generates embeddings on GPU, writes back to Delta Lake.
Two frameworks. Two sets of dependencies. Two GPU configs. Serialization overhead at every boundary.
With Daft, it's one script. download → resize → join → embed → write. daft.cls handles GPU placement and batching. No handoff.
Eventual was ranked #47 globally on Paraform’s Talent Density Index.
What I liked most about this wasn’t the ranking itself, but how they define it: not by who looks impressive on paper, but by who’s actually developing people the market is fighting for.
A friend put it better than I could:
“Honestly, it’s a testament to the talent you’re recruiting and fostering.”
Feels right.
Grateful to be building alongside this team.
https://t.co/86OlaDaTMf
daft.File is lazy — Nothing opens until a UDF calls .open() or .to_tempfile().
Filter millions of files by path and MIME type. Then open only the survivors. Markdown, PDFs, code, audio, video — same interface.
https://t.co/tH8KmVavT1
been lurking into data/ai/ml stuff lately and came across @daftengine , pretty cool ngl.
distributed query engine for running data + AI workloads at scale (text, images, embeddings, all of it). turns messy data into structured outputs without a ton of infra glue.
plus it’s open source, which makes it even better.
might try building something cool with it @Sammy_Sidhu
We @TeraflopAI have worked together with @johngfriedman and @daftengine to open-sourced all major filings from SEC EDGAR completely for free on @huggingface. It is now more important than ever to push for open dataset releases.
Daft v0.7.9.
8 new temporal functions for Spark-compatible date arithmetic. video_frames() for column-level video decoding. Native UUID type.
Plus byte-level dashboard observability and initial ASOF join support.
8 million SEC filings. 43 billion tokens. 590 GB spanning 20 years of corporate financial data.
Processed on 12 cores in under 24 hours for $1.10.
@EnricoShippole, @TeraflopAI, and @daftengine open-sourced the full dataset on Hugging Face.
“Daft is a distributed query engine that’s going to be replacing Spark.”
I remember this talk like it was yesterday 👇️
Two years ago @ SF Systems Meetup, @colin_ho99 and @desmondcheongzx walked up and floored the crowd, debuting v1 of Swordfish - our local execution engine.
Even at the time, Daft demonstrated dramatically lower memory than Spark on TPCH. 17 months later we've delivered multiple iterations of Swordfish and Flotilla (distributed) with compounding adoption across top labs and startups.