Creator of Sqlite on pull requests: "You say, oh, it's free. No. It's not free. What you're doing is asking me ... to maintain it for you, to to document it for you, to test it for you, to maintain it for you for the next 25 years. That's not free." Yep.
Wise words from a wiser man than me. I've told people for the past decade and I have recent posts on here saying the same: the merge button is the easy part. Its the decade+ (Richard says 25 years) that follows where you've accepted the transfer of maintenance thats hard.
fun fact: tijdens de keynote hakt Apple een stukje 3k, 4k, 5k en 6kHz eruit wanneer ze "Siri" zeggen, zodat niet iedereens HomePods terug beginnen te praten 🗣️🚫
a stroll on planet pgsql helped me come across this, yet another amazing blog by @BonesMoses.
pg19 is getting query hints, except they're called "plan advice". the implementation is better than any hint system i've seen elsewhere. it took so long because it was to make sure if they add hints, they don't suck.
now, pg19 brings `pg_plan_advice`. it lets you nudge the postgres planner toward a specific scan or join strategy without overriding its judgment entirely.
if you'd like to learn more, you should read shaun's blog on this.
smolvm has hit stable release: v1.0.0!
You can now fork smolvm.
It means you can fork to create virtual machines off of an existing one in less than 100ms, with all the processes cloned and running.
smolvm is the first to have this feature + cross platform compatibility (macOS and linux natively).
Here's a demo of a counter continuing on a forked clone while I only started it on the original!
"This is a protectionist tale as old as time. And the justifications are just as tired: It's about quality! It's about attribution! It's about workers! Spare me. It's about you, your insecurities, and your privileges." https://t.co/SP6DubrXXh
I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem.
As an experiment, I rewrote the Ghostty core render state in Go, with access to identically laid out data structures as Ghostty and the exact same validation tests. I made a purposely naive renderer (simple, correct, but slow). 88ms per frame with 150,000 allocations (horrendous, lol)!
I then kickstarted a Ralph loop to bring the frame times down. I told it it can't modify input data structures or the public API or tests (they're correct), but it can do anything else it wants. It got to work.
It has worked for about 4 hours. I've spent around $350 on this experiment so far. The results?
88ms => 1.5ms
150K allocs => ~500 allocs
Incredible right? Nope.
My hand-written renderer I ported has frame times (same benchmark) of ~20us (0.020ms) and 0 allocations in the update path.
This is the problem with psychosis and lacking systems understanding. If you don't understand the system, you're going to accept that this is an incredible result. If you understand the system, you'll see better solutions immediately and can do roughly 75x better on throughput.
The people who blindly trust agent output are in the former camp. They're sheeple, overdrinking from a fountain of mediocrity.
Standard disclaimer: I use AI all the time. I like AI. The point I'm making is to not blindly accept results. Think. Analyze. Learn.
"Because we don't necessarily know at this point"
- commit from 2004 that still exists in Postgres today.
The below screenshot is from the `analyze.c` file in the Postgres source code. The number 300 is a hardcoded value inside of Postgres's ANALYZE code. The rationale is based on a paper entitled "Random sampling for histogram construction: how much is enough?" written in 1998 when data sizes were much smaller and hardware was much slower. The question the paper answers is: to build statistics enabling optimization of queries of unindexed data, how many rows does ANALYZE need to sample to build accurate enough statistics?
The answer is 300-ish samples for each bin you want in your equi-height histogram. Why? The paper shows that required sample size grows linearly with the number of bins but only logarithmically with table size for most cases, so you see diminishing returns beyond a few hundred samples per bin.
For instance, the default `statistics_target` is 100. That means Postgres aims to sample 300 x 100 values to build an equi-height histogram with 100 bins and while also storing the 100 most common values.
(Check out the previous post for deets on how Postgres uses equi-height histogram and most common values)
Why all this work for unindexed data?
Because in 1998, indexes were extraordinarily costly to build and maintain. Indexes took up valuable disk space, used the limited IOPs during writes and builds. Additionally table scans were slow and blocking. In 1998, hard drive performance was measured in RPMs, so talking IOPs was variable because random page seeks required waiting for the disk to rotate, and location on disk was unknown. The tests for this paper ran on Pentium 200MHz with 64MB of RAM, and a 7.2k RPM SCSI drive.
Postgres users continue to benefit from this work during the era of constrained resources. Indexes aren't free today, and you can have too many indexes, but they aren't as costly as they were. Also, unindexed data isn't as costly as it was.
The paper also acknowledges the problem is "provably difficult by establishing a limit on the achievable accuracy of estimation in the worst-case." Thus, "we devise a simple estimator which we believe is optimal." This number is a tradeoff between accuracy and performance. A smaller multiplier would lead to less accurate statistics, which could cause the planner to make bad decisions. A larger multiplier would lead to more accurate statistics, but it would also make ANALYZE slower. And remember, ANALYZE was much, much slower back then.
What does statistics_target control?
The statistics target controls the number of values stored for Most Common Values and the Equi-height Histogram. The following is true:
```
statistics_target = 100 → 30,000 samples, 100 MCVs, 100 buckets
statistics_target = 500 → 150,000 samples, 500 MCVs, 500 buckets
statistics_target = 1000 → 300,000 samples, 1000 MCVs, 1000 buckets
```
This value is set by default at the database level, and can be overridden at the column level.
```
-- Per-column override:
ALTER TABLE requests ALTER COLUMN status_code SET STATISTICS 500;
ANALYZE requests;
```
For larger databases, there is usually at least one column where a per column setting may be the right approach. Don't raise the global default just because one column needs more granularity. Given the performance gains of the underlying hardware, the performance gains from changing column statistics aren't as significant as they once were.
Here’s an early signal to what I said
For almost an year now, we at @soketlabs have been working on curating a frontier scale pretraining data corpus along with finding the best architecture that fits the diversity of languages along with being compute optimal for both training and inference.
Sharing one of the many successes we have encountered. Our current version of the arch (yeah, its not a clone of deepseek or any other known arch) is at least 30% compute optimal to Deepseek’s sparse-MoE. These are just initial results and we hope to find a lot more. Shows that we have a lot more to learn about these architectures.
Also excited about the pretraining data we have curated but more on that later
Research efforts take time but they also yield exponential outcomes and thats what most people in India should be building towards