This is generally right, but in practice it's important to differentiate between point reads (get this key) and scans (get this range of keys).
SQL-like workloads have a mix of both, and their proportion changes the trade-offs here substantially.
Let's go back to the 70s!
📣 Introducing databow: one command-line tool to query them all, built with Rust and ADBC.
Query any SQL source that has an ADBC driver (30+ and counting) right from your terminal with one simple CLI.
To install 👉 uv tool install databow
Link to announcement in the comments👇
dbt Fusion is the Python-to-Rust rewrite that started at SDF Labs before it was acquired by dbt Labs. After a huge refactoring effort in the last months, we are announcing dbt Core 2.0 – the open-source slice of the Fusion codebase that is continously updated w/ Copybara.
A pre-condition that can make a lexer/parser run much faster is assuming that the input is already validated and the job of the parser is just extracting the contents from the input.
I just had the chance to watch Samyak Sarnayak's talk about cancellation safety and async Rust (and how a `&mut` can lead to a deadlock). If this is a topic that interests you, I recommend checking it out: https://t.co/TX85yIgNgE
When a partitioning change to our petabyte-scale ClickHouse cluster caused critical billing jobs to stall, standard metrics showed no obvious errors. Here's how we identified severe lock contention in ClickHouse's query planner and built upstream patches to fix it. https://t.co/C4UF6RJTp6
get-if-not-match is important for building fast databases on object storage. used in e.g. tpuf for the WAL check to make sure the cache has the latest data.
of the big 3 (Azure/S3/GCS), it may surprise many that Azure comes out the winner!
(S3X is S3 one-zone, GCR is GCP's equivalent)
The founders of FloeDB (@markcusack + Kurt Westerfeld) gave an interesting talk with @CMUDB about their new @ApacheIceberg-compatible query engine. Two key takeaways from their talk:
1⃣ Floe is a hard fork of @YellowbrickData.
2⃣ Floe is building a "catalog-of-catalogs"
https://t.co/BzovMq4AVP
At AWS we're big Rust users. Lambda, DSQL, S3, EC2, Bedrock, and many more run Rust code.
Dial9 is a new tool, built at AWS, for diving deep into the performance of tokio-based applications.
Good work, Russel!
The 2nd edition of Designing Data-Intensive Applications, by @martinkl and me, is finished and sent to the printers! Ebooks available next week, and print books in 3–4 weeks. Sigh of relief. 😅
(BTW, this is a good opportunity to support your favourite local bookshop!)
Much of the credit should go to sqlglot’s test suite. For projects of this type, the test suite is the “source” and the code can be generated pretty much any way you like.
Ever wondered how an engine actually reads an Iceberg table?
Iceberg read path in one line:
Catalog → Metadata → Manifest list → Manifest files → Data files
Apache Iceberg Read Path (Engine → Table)
When an engine reads an Iceberg table, it walks this chain from top to bottom:
1) Catalog
The starting point.
Stores a pointer to the table’s current metadata file, which represents the latest snapshot reference.
2) Metadata File
Defines the table schema, lists snapshots, and references the manifest list for the snapshot being read.
3) Manifest List
Tracks all manifest files associated with the selected snapshot.
4) Manifest Files
Contain metadata about data files, including partition values and file-level statistics, which help determine which files should be read.
5) Data Files
The actual table data is stored in object storage. This is what the engine ultimately reads.
Why this matters
During reads, Iceberg resolves the snapshot through the catalog and metadata layers, then uses manifest metadata to identify the exact set of data files for that snapshot.
Using Cap’n Proto in Rust to get zero-copy deserialization, but Codex insists on writing de/serialization functions that build a struct full of heap-allocated strings. Because that’s what dominates the training set of model and programmers everywhere.
SIMD can produce insane yields but it's worth bearing in mind that at least some of the yield isn't from the SIMD instructions, it's from disciplining the programmer into writing branchless functions and pipelines
the SIMD is the cheese at the end of the bit-hacking maze
A somewhat academic talk about the AI usecases driving changes in @ApacheParquet and new formats in "Column Storage for the AI Era"
Recording: https://t.co/f4HxgyMZcb
Slides: https://t.co/sq3auKoojo