๐ My eyes spy a killer ๐ฅ new feature coming out soon in @lancedb: "first-class support for table branching across the Rust, Python and TypeScript SDKs".
The team is COOKING!!
๐ฅณ ๐
https://t.co/myj62XERbs
At @lancedb, these kinds of problems around ease of experimentation, exploration, reusability and reproducibility, along with the immense scale of data that AI researchers are handling on a daily basis, are fundamental to how we think about where AI data infra is headed.
Read the blog for more details, and watch out for more interesting posts, including benchmarks and training experiments, coming soon! ๐๐ฝ
6/6
https://t.co/tqV57Bc9ty
Dataset curation looks like a search problem at first, but for training pipelines the hard part is everything around it: filtering, deduping, sampling, inspection, materializing the subset, and being able to reproduce the decision later.
I wrote about how weโre thinking about this at @lancedb for the next generation of AI data infrastructure:
https://t.co/tqV57Bc9ty
1/6
From a user perspective, the end state is not simply "I found a useful subset."
Itโs a durable training + search artifact: a Lance table, accessible in @lancedb with the selected rows, raw data, embeddings, metadata, indexes, provenance, and version history.
Downstream feature engineering, search, analytics and training workflows can pin to the same curated dataset instead of reconstructing it from an ad hoc notebook state that knows nothing about data provenance.
5/6
Lots of new followers, so allow me to say hi! I'm an AI engineer, with a passion for technical communication and writing good content that helps people get productive with amazing open source tools and frameworks. ๐
I've been writing a lot with AI lately, and it's so much better and more fun than before. Look for more stuff from me over the coming weeks!
So cool! @nvidia's Cosmos 3 team (building next gen Physical AI infrastructure) is building on top of Lance!
"Unified data layer: SILA organizes data curation as a unified columnar Lance dataset (Pace et al., 2025), where each row represents a data sample and each typed column represents a curation signal such as a caption, tag, quality score, or annotation."
Link to the full report: https://t.co/CpcQZZXgHf
Cosmos 3 by @nvidia released today โ a frontier omnimodal world model for Physical AI.
For the data infrastructure behind it, they built on Lance.
SILA, NVIDIA's internal curation platform, processes tens of billions of multimodal training candidates as a single Lance dataset. Curation signals, embeddings, and vector indexes all in one table. No separate vector DB.
One table from raw data to training-ready.
@raveeshbhalla@LinghuaJ@georgehe0@cocoindex_io No, just got back to Toronto. But I'll make it a point to let folks know when I'm next in SF, life always brings me back there eventually ๐
It's been amazing to see the vision that @LinghuaJ and @georgehe0 have coming to fruition. They're building out in the open, engaging with users, and are approaching building a thriving OSS community in all the right ways.
I can't wait to see where @cocoindex_io goes in the coming months! Will definitely be writing more about it, but if you haven't already, do check it out!
Catching up with the amazing @tech_optimist this week on ai & data infra and rust ๐ฆ for ai. Excited for what's coming on @cocoindex_io and @lancedb and many amazing projects in the space.
when friends in town, we crack coconut together !
@sdhilip@cocoindex_io@LinghuaJ This is really cool! Glad it's showing such value, it really solving SUCH a relevant problem for AI/data engineers. I was hooked the moment I saw it myself.
Congrats @sdhilip on shipping his 10th projects in production and thanks for sharing what you build with @cocoindex_io continuously with the community!
CocoIndex helps agents with fresh data views ๐๐ซ๐จ๐ฆ ๐ฒ๐จ๐ฎ๐ซ ๐ฌ๐จ๐ฎ๐ซ๐๐. It is an ๐ข๐ง๐๐ซ๐๐ฆ๐๐ง๐ญ๐๐ฅ ๐๐ง๐ ๐ข๐ง๐ that continuously takes codebase, meeting notes and any dynamic contents and transform them into insights for ai.
We are grateful to have such a great community, many of our users have been with us for long time since our early launches and continuously giving us feedbacks . Thank you all and let's go!