Saurabh Sawant @sausaw - Twitter Profile

sausaw retweeted

2 days ago

Creator of Sqlite on pull requests: "You say, oh, it's free. No. It's not free. What you're doing is asking me ... to maintain it for you, to to document it for you, to test it for you, to maintain it for you for the next 25 years. That's not free." Yep. Wise words from a wiser man than me. I've told people for the past decade and I have recent posts on here saying the same: the merge button is the easy part. Its the decade+ (Richard says 25 years) that follows where you've accepted the transfer of maintenance thats hard.

58

6K

396

828

270K

sausaw retweeted

Seth Rosen

@sethrosen

11 days ago

I just open sourced my "Is this slop?" simple test

116

19K

1K

456K

sausaw retweeted

luuk de leest @luuk58

11 days ago

fun fact: tijdens de keynote hakt Apple een stukje 3k, 4k, 5k en 6kHz eruit wanneer ze "Siri" zeggen, zodat niet iedereens HomePods terug beginnen te praten 🗣️🚫

luuk58's tweet photo. fun fact: tijdens de keynote hakt Apple een stukje 3k, 4k, 5k en 6kHz eruit wanneer ze "Siri" zeggen, zodat niet iedereens HomePods terug beginnen te praten 🗣️🚫 https://t.co/x13WbNPztr

115

25K

970

2K

1M

sausaw retweeted

Devi Parikh

@deviparikh

11 days ago

Claude Code wrote a JavaScript Madhubani generator.

36

799

49

281

51K

Who to follow

Krishnaraj Rao

@BravePedestrian

Journalist. Closet economist. Wannabe superhero. RTI activist. Campaigner for good governance & judicial reforms. Goofball. Sometimes dirty-minded. Tinkerer.

Rodrigo Bonzerr Lopez

@BonzerrLopez

Indie game developer of @datusquest2023 https://t.co/jVFha7NuNq

lysandroc

@lysandroc_

learning whenever possible

Saurabh Sawant @sausaw

11 days ago

https://t.co/94FvQPRVIq is kind of a big deal

0

1

15

sausaw retweeted

Noah Smith 🐇🇺🇸🇺🇦🇹🇼

@Noahpinion

13 days ago

Pretty damn impressive.

60

3K

358

427

136K

sausaw retweeted

Charles Patterson

@CharlesPattson

13 days ago

Glass on web has been achieved. The greatest invention since the drop shadow.

68

6K

177

3K

880K

sausaw retweeted

antirez @antirez

13 days ago

[blog] A new era for software testing: https://t.co/J52JqzKyfh

23

616

72

536

62K

sausaw retweeted

winit @hiwinit

13 days ago

a stroll on planet pgsql helped me come across this, yet another amazing blog by @BonesMoses. pg19 is getting query hints, except they're called "plan advice". the implementation is better than any hint system i've seen elsewhere. it took so long because it was to make sure if they add hints, they don't suck. now, pg19 brings `pg_plan_advice`. it lets you nudge the postgres planner toward a specific scan or join strategy without overriding its judgment entirely. if you'd like to learn more, you should read shaun's blog on this.

hiwinit's tweet photo. a stroll on planet pgsql helped me come across this, yet another amazing blog by @BonesMoses.

pg19 is getting query hints, except they're called "plan advice". the implementation is better than any hint system i've seen elsewhere. it took so long because it was to make sure if they add hints, they don't suck.

now, pg19 brings `pg_plan_advice`. it lets you nudge the postgres planner toward a specific scan or join strategy without overriding its judgment entirely.

if you'd like to learn more, you should read shaun's blog on this.

1

28

4

11

2K

sausaw retweeted

BinBin

@binsquares

14 days ago

smolvm has hit stable release: v1.0.0! You can now fork smolvm. It means you can fork to create virtual machines off of an existing one in less than 100ms, with all the processes cloned and running. smolvm is the first to have this feature + cross platform compatibility (macOS and linux natively). Here's a demo of a counter continuing on a forked clone while I only started it on the original!

31

286

24

243

55K

sausaw retweeted

DHH

@dhh

19 days ago

"This is a protectionist tale as old as time. And the justifications are just as tired: It's about quality! It's about attribution! It's about workers! Spare me. It's about you, your insecurities, and your privileges." https://t.co/SP6DubrXXh

66

692

55

198

108K

sausaw retweeted

Vala Afshar

@ValaAfshar

20 days ago

“A bird does not sing because it has an answer. It sings because it has a song.”

5

108

11

16

14K

sausaw retweeted

Dialed In Bookkeeping

@DIBookkeeping

21 days ago

If you had toddlers who want to work on your computer, show them this https://t.co/8jivomcgJx

26

1K

106

2K

228K

sausaw retweeted

kai

@ssvankai

22 days ago

accidentally stumbled upon this crazy paper abstract

436

27K

2K

5K

1M

sausaw retweeted

Austin

@IamAroke

23 days ago

Raw SQL will always outperform your ORM.

35

225

2

21

27K

sausaw retweeted

Mitchell Hashimoto

@mitchellh

22 days ago

I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem. As an experiment, I rewrote the Ghostty core render state in Go, with access to identically laid out data structures as Ghostty and the exact same validation tests. I made a purposely naive renderer (simple, correct, but slow). 88ms per frame with 150,000 allocations (horrendous, lol)! I then kickstarted a Ralph loop to bring the frame times down. I told it it can't modify input data structures or the public API or tests (they're correct), but it can do anything else it wants. It got to work. It has worked for about 4 hours. I've spent around $350 on this experiment so far. The results? 88ms => 1.5ms 150K allocs => ~500 allocs Incredible right? Nope. My hand-written renderer I ported has frame times (same benchmark) of ~20us (0.020ms) and 0 allocations in the update path. This is the problem with psychosis and lacking systems understanding. If you don't understand the system, you're going to accept that this is an incredible result. If you understand the system, you'll see better solutions immediately and can do roughly 75x better on throughput. The people who blindly trust agent output are in the former camp. They're sheeple, overdrinking from a fountain of mediocrity. Standard disclaimer: I use AI all the time. I like AI. The point I'm making is to not blindly accept results. Think. Analyze. Learn.

308

9K

978

2K

794K

sausaw retweeted

Crunchy Data

@crunchydata

24 days ago

"Because we don't necessarily know at this point" - commit from 2004 that still exists in Postgres today. The below screenshot is from the `analyze.c` file in the Postgres source code. The number 300 is a hardcoded value inside of Postgres's ANALYZE code. The rationale is based on a paper entitled "Random sampling for histogram construction: how much is enough?" written in 1998 when data sizes were much smaller and hardware was much slower. The question the paper answers is: to build statistics enabling optimization of queries of unindexed data, how many rows does ANALYZE need to sample to build accurate enough statistics? The answer is 300-ish samples for each bin you want in your equi-height histogram. Why? The paper shows that required sample size grows linearly with the number of bins but only logarithmically with table size for most cases, so you see diminishing returns beyond a few hundred samples per bin. For instance, the default `statistics_target` is 100. That means Postgres aims to sample 300 x 100 values to build an equi-height histogram with 100 bins and while also storing the 100 most common values. (Check out the previous post for deets on how Postgres uses equi-height histogram and most common values) Why all this work for unindexed data? Because in 1998, indexes were extraordinarily costly to build and maintain. Indexes took up valuable disk space, used the limited IOPs during writes and builds. Additionally table scans were slow and blocking. In 1998, hard drive performance was measured in RPMs, so talking IOPs was variable because random page seeks required waiting for the disk to rotate, and location on disk was unknown. The tests for this paper ran on Pentium 200MHz with 64MB of RAM, and a 7.2k RPM SCSI drive. Postgres users continue to benefit from this work during the era of constrained resources. Indexes aren't free today, and you can have too many indexes, but they aren't as costly as they were. Also, unindexed data isn't as costly as it was. The paper also acknowledges the problem is "provably difficult by establishing a limit on the achievable accuracy of estimation in the worst-case." Thus, "we devise a simple estimator which we believe is optimal." This number is a tradeoff between accuracy and performance. A smaller multiplier would lead to less accurate statistics, which could cause the planner to make bad decisions. A larger multiplier would lead to more accurate statistics, but it would also make ANALYZE slower. And remember, ANALYZE was much, much slower back then. What does statistics_target control? The statistics target controls the number of values stored for Most Common Values and the Equi-height Histogram. The following is true: ``` statistics_target = 100 → 30,000 samples, 100 MCVs, 100 buckets statistics_target = 500 → 150,000 samples, 500 MCVs, 500 buckets statistics_target = 1000 → 300,000 samples, 1000 MCVs, 1000 buckets ``` This value is set by default at the database level, and can be overridden at the column level. ``` -- Per-column override: ALTER TABLE requests ALTER COLUMN status_code SET STATISTICS 500; ANALYZE requests; ``` For larger databases, there is usually at least one column where a per column setting may be the right approach. Don't raise the global default just because one column needs more granularity. Given the performance gains of the underlying hardware, the performance gains from changing column statistics aren't as significant as they once were.

crunchydata's tweet photo. "Because we don't necessarily know at this point"
- commit from 2004 that still exists in Postgres today.

The below screenshot is from the `analyze.c` file in the Postgres source code. The number 300 is a hardcoded value inside of Postgres's ANALYZE code. The rationale is based on a paper entitled "Random sampling for histogram construction: how much is enough?" written in 1998 when data sizes were much smaller and hardware was much slower. The question the paper answers is: to build statistics enabling optimization of queries of unindexed data, how many rows does ANALYZE need to sample to build accurate enough statistics?

The answer is 300-ish samples for each bin you want in your equi-height histogram. Why? The paper shows that required sample size grows linearly with the number of bins but only logarithmically with table size for most cases, so you see diminishing returns beyond a few hundred samples per bin.

For instance, the default `statistics_target` is 100. That means Postgres aims to sample 300 x 100 values to build an equi-height histogram with 100 bins and while also storing the 100 most common values.

(Check out the previous post for deets on how Postgres uses equi-height histogram and most common values)

Why all this work for unindexed data?

Because in 1998, indexes were extraordinarily costly to build and maintain. Indexes took up valuable disk space, used the limited IOPs during writes and builds. Additionally table scans were slow and blocking. In 1998, hard drive performance was measured in RPMs, so talking IOPs was variable because random page seeks required waiting for the disk to rotate, and location on disk was unknown. The tests for this paper ran on Pentium 200MHz with 64MB of RAM, and a 7.2k RPM SCSI drive.

Postgres users continue to benefit from this work during the era of constrained resources. Indexes aren't free today, and you can have too many indexes, but they aren't as costly as they were. Also, unindexed data isn't as costly as it was.

The paper also acknowledges the problem is "provably difficult by establishing a limit on the achievable accuracy of estimation in the worst-case." Thus, "we devise a simple estimator which we believe is optimal." This number is a tradeoff between accuracy and performance. A smaller multiplier would lead to less accurate statistics, which could cause the planner to make bad decisions. A larger multiplier would lead to more accurate statistics, but it would also make ANALYZE slower. And remember, ANALYZE was much, much slower back then.

What does statistics_target control?

The statistics target controls the number of values stored for Most Common Values and the Equi-height Histogram. The following is true:

```
statistics_target = 100 → 30,000 samples, 100 MCVs, 100 buckets
statistics_target = 500 → 150,000 samples, 500 MCVs, 500 buckets
statistics_target = 1000 → 300,000 samples, 1000 MCVs, 1000 buckets
```

This value is set by default at the database level, and can be overridden at the column level.

```
-- Per-column override:
ALTER TABLE requests ALTER COLUMN status_code SET STATISTICS 500;
ANALYZE requests;
```

For larger databases, there is usually at least one column where a per column setting may be the right approach. Don't raise the global default just because one column needs more granularity. Given the performance gains of the underlying hardware, the performance gains from changing column statistics aren't as significant as they once were.

2

184

17

104

26K

sausaw retweeted

Abhishek Upperwal

@upperwal

25 days ago

Here’s an early signal to what I said For almost an year now, we at @soketlabs have been working on curating a frontier scale pretraining data corpus along with finding the best architecture that fits the diversity of languages along with being compute optimal for both training and inference. Sharing one of the many successes we have encountered. Our current version of the arch (yeah, its not a clone of deepseek or any other known arch) is at least 30% compute optimal to Deepseek’s sparse-MoE. These are just initial results and we hope to find a lot more. Shows that we have a lot more to learn about these architectures. Also excited about the pretraining data we have curated but more on that later Research efforts take time but they also yield exponential outcomes and thats what most people in India should be building towards

upperwal's tweet photo. Here’s an early signal to what I said

For almost an year now, we at @soketlabs have been working on curating a frontier scale pretraining data corpus along with finding the best architecture that fits the diversity of languages along with being compute optimal for both training and inference.

Sharing one of the many successes we have encountered. Our current version of the arch (yeah, its not a clone of deepseek or any other known arch) is at least 30% compute optimal to Deepseek’s sparse-MoE. These are just initial results and we hope to find a lot more. Shows that we have a lot more to learn about these architectures.

Also excited about the pretraining data we have curated but more on that later

Research efforts take time but they also yield exponential outcomes and thats what most people in India should be building towards

7

110

13

45

7K

Saurabh Sawant @sausaw

27 days ago

Cursor is so back. Composer 2.5 is incredible.

0

1

0

30

Saurabh Sawant

@sausaw

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users