Late night rant: Spark is an awesome piece of software. But a horrible developer experience.
What happened to OSS that was simply `apt install` and 🚀? Why should software be excused for slow local performance because it was built for "production scale"?
So much of "big data" JVM-based tooling was hacked together on the giant datacenters of tech giants. The world has changed, and so too must our big data tooling.
⭐️ Rust: self-contained compiled native binaries that have no dependencies. Hello, clean installs, my old friend.
🐍 Python: the undeniable winner of iterative plumbing for data/ML. Build with a Python API in mind. Using the JVM through a Py4J gateway should be an automatic disqualification.
☁️ Cloud: Build cloud-first, lightweight, ephemeral software. Cattle vs Pets. S3, not NFS/HDFS. Spot instances, not machines on a rack.
🤓 Dev UX: build for the single developer, on their laptop, then think about scaling. A docker-compose local dev story is lazy bundling of overly complicated software.
☀️ Open Formats: let software TALK to each other, so devs can choose the right tool for the right job, and so devs can keep building better tooling. This is why JSON is awesome. Arrow is awesome. Iceberg is awesome. Parquet and CSV are (I begrudgingly admit) somewhat awesome. And please build flexible SDKs for these formats, in C++ or Rust, not just for the JVM.
First-class observability in Daft.
Operators, Tasks, Rows, Memory are all surfaced in a dashboard that ships with the install.
+ OTel endpoints for your existing collector.
+ Stuck detection.
+ DAFT_TRACE for console debugging.
~45 PRs across the observability stack.
https://t.co/HWyjBiaePN
A conversation with @sirupsen on scaling Shopify, building turbopuffer, and the future of databases.
0:00 - Scaling Shopify through flash sales and outages
8:13 - How top infrastructure teams collaborated in the 2010s
10:35 - Engineering principles from Logrus and on-call
17:38 - The story behind Simon’s famous-ish blog, Napkin Math
23:05 - Why new database companies keep winning
32:21 - How Simon became a fan of databases
35:45 - AI coding, and where agents still fail
42:10 - Hiring P99 engineers in the AI era
48:45 - What’s next for databases
This weeks Physical AI Newsletter is packed with updates.
Definitely check out the survey on World Action Models.
Not only does it clarify the differences between VLAs, World Models, and World Action Models, but it also contextualizes the algorithm and training strategies for all of the models being released.
daft.VideoFile is perfect for Physical AI.
Open X-Embodiment aggregates over a million episodes. DROID alone runs 350+ hours of multi-camera 60fps footage. That's hundreds of millions of frames across a single dataset, and most action-model training doesn't need them all.
- read_video_frames — filter on keyframes; supports S3, GCS, & YouTube URLs.
- video_metadata — resolution, fps, duration, frame count from file headers.
- video_frames(start_time, end_time) — decode a 10-second window from a 90-minute file.
Frames land as Image columns in the same DataFrame.
Feed them to a vision model, compute embeddings, and write to Iceberg.
Check out the blog
https://t.co/Ucn3SzF12g
We’re launching @JudgmentLabs today and announcing $32M in funding.
As AI agents take on more of the work that creates economic value, they generate massive amounts of production data: the clearest record of how they behave with users, software, and the real world.
Judgment builds infrastructure for improving AI agents from production data.
Probably my favorite episode yet!
Just finished filming our latest episode of Zero Shot Espresso with @danimberman who is an @ApacheAirflow PMC, developed the @kubernetesio executor, and now helps technical teams ship production AI as a consultant.
@danimberman@ApacheAirflow@kubernetesio We chatted about how open-source software is changing in the AI-era, what it's like running a solo-consulting business, and the biggest difference between senior and principal engineers.
🚢 Daft v0.7.10
30 contributors (a release record!)
41 new features and functions.
Distributed as_of joins, SimHash dedupe, temporal arithmetic, C++ extensions.
https://t.co/Blit46bYww
The fastest H3 geospatial indexing in Daft wasn't written by the Daft team.
Developed by Garrett Weaver, daft-h3 runs 3–16x faster than simply wrapping h3-py in a Python UDF. That speed up is thanks to Daft's Native Extensions powered by Apache Arrow's C Data Interface.
The pace of multimodal AI is actually crazy right now
I think this is it I’ve been crying wolf for 4 straight years but I think it’s coming for real now. We’re about to see that ChatGPT moment very, very soon.
Most image embedding pipelines are actually two pipelines stitched together.
Script one: PySpark reads images from S3, resizes them, joins with metadata, writes to Delta Lake.
Script two: PyTorch loads ResNet, generates embeddings on GPU, writes back to Delta Lake.
Two frameworks. Two sets of dependencies. Two GPU configs. Serialization overhead at every boundary.
With Daft, it's one script. download → resize → join → embed → write. daft.cls handles GPU placement and batching. No handoff.
Kind of a bummer that the first result on Google for "how to build a claude skill" is an @AnthropicAI PDF??
My CC can't read this without going into a pypdf install death spiral...
We need a md gist instead... https://t.co/GpBLzKDzEo