1/ I pay for the best frontier models every day β coding, research, real work. I'm not betting against them on capability. I'm testing an idea about dependence: how much of this can run locally, on the smallest models, with no network at all? I'm calling this research: Winnow.
@a16z@mirendil I've just been using claude code + autoresearch + gcloud cli like a chump.
> result is a system that loops over research and
>engineering problems on its own, making progress
>without human intervention.
4/ Meanwhile self-tests and execution-grounded selection hit 0.75. Which is the whole Winnow bet restated by accident: a model's opinion of its own code is weaker than actually running it. Execution evidence > self-judgment. Next: a sharper RM, and the full sweep.
1/ Found a bug in my own experiment: one of my 5 selection baselines, "reward_model," was secretly random. I'd never wired the RM, so it silently fell back to coin-flips β and dutifully tied random in the smoke. A baseline that measured nothing. π§΅
3/ Re-smoked it on an L4 ($0.25, Spot). The verdict: real signal, weak selector. Mean RM score for correct candidates beat wrong ones on 8/8 problems β but margins are thin (0.91 vs 0.90), so argmax lands on a correct one only 4/8. It tied random again.
5/ So the pass@k β pass@1 selection problem is where I'm spending my time. Follow-up, "How Do Large Language Monkeys Get Their Power (Laws)?" (2502.17578), on why the aggregate curve is a power law when each problem decays exponentially.
1/ The thesis behind Winnow isn't mine β it's grounded in a paper: "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" (arXiv 2407.21787). What it argues, and why it shapes what I'm building. π§΅
4/ The catch: this works where answers are auto-verifiable β unit tests, proof checkers. Without one, majority vote and reward models plateau past a few hundred samples. Coverage you can't select from is wasted.
5/ The trade-off is real β Spot was stocked out today so I paid on-demand, and idle disks quietly accrue. But for the experimentation phase, velocity beats a few dollars. A home rig may make sense later, once the workload is steady. For now: rent, and move fast.
1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.
4/ And the unglamorous reasons: where would a multi-GPU rig even go, and who wants to live next to something that loud? My family would (rightly) object. Cloud keeps the noise and the heat in someone else's data center.
3/ Flexibility beats price right now. I can jump from an L4 to an A100 to an H100 by changing one flag β no buying, no reselling, no waiting. Good cards are hard to get today, especially high-memory ones. Renting sidesteps the supply problem entirely.
1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.
2/ I'm not a tinkerer. Time spent racking, cooling, and babysitting hardware is time not spent on research β and I'm optimizing for research velocity in a time-constrained life. Cloud means I skip the maintenance and just run experiments.
Here's the zoom-out on what I am working on:
Winnow tests whether cheap small-model search can stand in for an expensive frontier model β and whether you can select the right answer without the tests you won't have in production.
The mechanism, in one line: sample many candidates from a small open model (so a correct one usually exists β pass@k climbs with k), then winnow down to the one to ship.
In real deployment you don't have the test suite, only signals you can generate yourself. The distance between pass@k (a correct answer exists) and pass@1-after-selection (you actually picked it) is the what this project attacks.
1/ A fair question for someone building toward local, offline models: why am I renting cloud GPUs instead of building a home rig? It's a deliberate trade-off for this phase. The goal is local; the research that gets me there just moves faster in the cloud.