Michael Stajer

Verified account

@michaelstajer

Engineer+Entrepreneur.

🌉 SF

Joined August 2021

1.4K Following

1.2K Followers

830 Posts

Pinned Tweet

10 days ago

1/ I pay for the best frontier models every day — coding, research, real work. I'm not betting against them on capability. I'm testing an idea about dependence: how much of this can run locally, on the smallest models, with no network at all? I'm calling this research: Winnow.

1

5

0

0

185

about 17 hours ago

@a16z @mirendil I've just been using claude code + autoresearch + gcloud cli like a chump. > result is a system that loops over research and >engineering problems on its own, making progress >without human intervention.

0

0

0

0

58

4 days ago

4/ Meanwhile self-tests and execution-grounded selection hit 0.75. Which is the whole Winnow bet restated by accident: a model's opinion of its own code is weaker than actually running it. Execution evidence > self-judgment. Next: a sharper RM, and the full sweep.

0

0

0

0

7

4 days ago

1/ Found a bug in my own experiment: one of my 5 selection baselines, "reward_model," was secretly random. I'd never wired the RM, so it silently fell back to coin-flips — and dutifully tied random in the smoke. A baseline that measured nothing. 🧵

1

0

0

0

22

Who to follow

Christopher Dell'Olio

Follower of Christ. Jesus is King. Founder of WebJoint Located in beautiful Arizona!

CrypToadz 🐸⛓

Verified account

CrypToadz by @gremplin is a CC0 Project, as seen on the bottom of our website: https://t.co/RYbOR32OE9 https://t.co/FpAqQg7SYK

Single, Daddy of 2 great kids,in wheelchair after a failed surgery in 2015. #pain #chronicpain #politics #report #lifewithpain #author #fightBackToNormalLife

4 days ago

3/ Re-smoked it on an L4 ($0.25, Spot). The verdict: real signal, weak selector. Mean RM score for correct candidates beat wrong ones on 8/8 problems — but margins are thin (0.91 vs 0.90), so argmax lands on a correct one only 4/8. It tied random again.

1

0

0

0

19

5 days ago

5/ So the pass@k → pass@1 selection problem is where I'm spending my time. Follow-up, "How Do Large Language Monkeys Get Their Power (Laws)?" (2502.17578), on why the aggregate curve is a power law when each problem decays exponentially.

0

0

0

0

4

5 days ago

1/ The thesis behind Winnow isn't mine — it's grounded in a paper: "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" (arXiv 2407.21787). What it argues, and why it shapes what I'm building. 🧵

1

0

0

0

38

5 days ago

4/ The catch: this works where answers are auto-verifiable — unit tests, proof checkers. Without one, majority vote and reward models plateau past a few hundred samples. Coverage you can't select from is wasted.

1

0

0

0

7

5 days ago

5/ The trade-off is real — Spot was stocked out today so I paid on-demand, and idle disks quietly accrue. But for the experimentation phase, velocity beats a few dollars. A home rig may make sense later, once the workload is steady. For now: rent, and move fast.

0

2

2

2

11

6 days ago

1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.

1

0

0

0

35

5 days ago

4/ And the unglamorous reasons: where would a multi-GPU rig even go, and who wants to live next to something that loud? My family would (rightly) object. Cloud keeps the noise and the heat in someone else's data center.

1

0

0

0

19

5 days ago

3/ Flexibility beats price right now. I can jump from an L4 to an A100 to an H100 by changing one flag — no buying, no reselling, no waiting. Good cards are hard to get today, especially high-memory ones. Renting sidesteps the supply problem entirely.

0

0

0

0

8

5 days ago

1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.

1

1

0

0

60

5 days ago

2/ I'm not a tinkerer. Time spent racking, cooling, and babysitting hardware is time not spent on research — and I'm optimizing for research velocity in a time-constrained life. Cloud means I skip the maintenance and just run experiments.

1

0

0

0

17

7 days ago

Here's the zoom-out on what I am working on: Winnow tests whether cheap small-model search can stand in for an expensive frontier model — and whether you can select the right answer without the tests you won't have in production. The mechanism, in one line: sample many candidates from a small open model (so a correct one usually exists — pass@k climbs with k), then winnow down to the one to ship. In real deployment you don't have the test suite, only signals you can generate yourself. The distance between pass@k (a correct answer exists) and pass@1-after-selection (you actually picked it) is the what this project attacks.

0

0

0

0

15

7 days ago

1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.

1

0

0

0

19

Last Seen Users on Sotwe

Trends for you

Most Popular Users