Benjamin Babik

@localoptimiser

future big account

Joined October 2024

24 Following

11 Followers

405 Posts

Benjamin Babik @localoptimiser

23 minutes ago

Okay this model is small enough I can put my limited money where my big mouth is. I'll bet that 12B holds up in reasoning benches because of reasoning ~alpha~ is overtrained/oversampled and I predict regimes such as SEAL, Spurious Reward, Absolute Zero Reasoner et al yield wins.

about 13 hours ago

Gemma 4 dropped a 12B. I put it on RTX 5090 against its 31B sibling. when you cut a model from 31B to 12B, what do you actually lose? ~ reasoning barely moves GSM8K (math) 97.5 > 96.4 (−1.1) ARC-C (sci reasoning) 97.6 > 94.0 (−3.6) ~ knowledge falls off a cliff MMLU (world knowledge) 87.8 > 78.9 (−8.9) HellaSwag (commonsense) 92.0 > 81.6 (−10.4) ~~~ parameters store facts, not thinking. the 19B you delete is mostly where the model kept its trivia and world-priors, cut it and recall collapses, while the reasoning machinery stays nearly whole. a 12B reasons almost like its big brother. It just knows less. 122 tok/s vs 53 (2.3x faster generation), ~10GB instead of ~24, meaning that you get 20GB+ free on a 32GB card for long context or a second model. so it depends of your workload: reasoning / math / agentic loops = the 12B is nearly free broad-knowledge Q&A with no retrieval = that's the one job worth paying for the 31B.

witcheer's tweet photo. Gemma 4 dropped a 12B.
I put it on RTX 5090 against its 31B sibling.

when you cut a model from 31B to 12B, what do you actually lose?

~ reasoning barely moves
GSM8K (math) 97.5 > 96.4 (−1.1)
ARC-C (sci reasoning) 97.6 > 94.0 (−3.6)

~ knowledge falls off a cliff
MMLU (world knowledge) 87.8 > 78.9 (−8.9)
HellaSwag (commonsense) 92.0 > 81.6 (−10.4)

~~~
parameters store facts, not thinking. the 19B you delete is mostly where the model kept its trivia and world-priors, cut it and recall collapses, while the reasoning machinery stays nearly whole.

a 12B reasons almost like its big brother. It just knows less.

122 tok/s vs 53 (2.3x faster generation), ~10GB instead of ~24, meaning that you get 20GB+ free on a 32GB card for long context or a second model.

so it depends of your workload:

reasoning / math / agentic loops = the 12B is nearly free

broad-knowledge Q&A with no retrieval = that's the one job worth paying for the 31B.

25

403

47

193

35K

0

0

0

0

5

Benjamin Babik @localoptimiser

about 7 hours ago

@maxleiter They're not made out of weights and they don't make the words.

0

0

0

0

1K

Benjamin Babik @localoptimiser

about 7 hours ago

@joshwhiton @AdrienneLaF Consciousness is made up. That's why it evades physical probing. It's not because we don't know how yet, it's because it doesn't exist in the world. It exists in the mind, where the colour of an apple, the warmth of a fire, and the odour of methane exist.

1

1

0

0

19

Benjamin Babik @localoptimiser

about 7 hours ago

@joshwhiton @AdrienneLaF LLMs have no BASIS on which to conceive of consciousness or anything else. They can't be conscious, they can't BE. They don't learn to imagine the world made of objects circa 36 months. They don't test if a caregiver is independent by saying a gestalt "no" to everything.

1

0

0

0

16

Benjamin Babik @localoptimiser

about 11 hours ago

@spockwoz Are you seriously suggesting that agents will simply sit between APIs that provide well typed, reliable interfaces to various programs, services and hardware devices?

0

0

0

0

7

Benjamin Babik @localoptimiser

about 15 hours ago

No he isn't. Models aren't getting cheaper. Cheaper models are available, but the frontier where the illusion of "you don't have to write any code or worry about anything technical" plays out all day everyday prices are going up. What is more, mistakes cost more there too.

1 day ago

garry tan is so right about not building massive rails factories for agents but nobody talks about what actually goes in its place after building this way for a while the shift is actually super simple 1. your backend code should just be dumb hands and feet. no complex business logic, no nested if/else loops trying to predict what the model will do. just clean deterministic apis, db reads, auth, and sandboxes. the plumbing. 2. all the actual brains and workflow procedures live in markdown skills. the first time an agent solves a weird problem, it takes a minute. but instead of throwing that away you freeze the procedure by stripping out the specific data. next time someone asks for the same shape you serve it instantly and deterministically. zero agent latency, zero model cost. 3. and the golden rule for keeping the agent from burning your house down is that you never trust its self report. if the agent says tests passed or the write succeeded, you don't believe it. you rerun the check in your dumb code. you let go of control on the way out but you buy it back on the way in. build the harness, not the factory

7

117

8

156

24K

2

0

0

0

22

Benjamin Babik @localoptimiser

about 15 hours ago

A good harness with a local model will work *with* you to produce software. No harness and a frontier model over an API will *pretend* to do everything for you and eventually fail catastrophically, every single time. Garry Tan is living proof of this. Look at his stupid website.

0

0

0

0

8

Benjamin Babik @localoptimiser

1 day ago

@Dan_Jeffries1 Unfortunately, yes both. ICL is more efficient than training an adapter for all the numbers that matter. Skills files are not necessarily good examples of ICL but the point is that they can be. You can ~autoresearch~ a best-in-class preamble provided you have a way to validate.

0

0

0

0

9

Benjamin Babik @localoptimiser

1 day ago

@Sentio_xbt @atmoio Yes because that's what tacit means. Doing things and learning all the ways they go wrong. Self organising systems learn to correct errors. We don't learn how to walk, we learn how to not fall over.

0

0

0

0

0

Benjamin Babik @localoptimiser

1 day ago

@Maciej_M @MLStreetTalk @VictorTaelin With all due respect isn't this just TodoMVC "but make it for recruitment" with an unjustifiably complicated architecture?

1

0

0

0

16

Benjamin Babik @localoptimiser

1 day ago

@stevibe @malikwas1f FWIW I've been doing similar with YOLO etc for a long time and this kind of setup beats SOTA multimodal every time. I'm honestly at a loss why anyone bothers with the agro of multimodal training when it always turns out dogshit.

0

0

0

0

100

Benjamin Babik @localoptimiser

1 day ago

This is the way.

1 day ago

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

84

3K

271

3K

141K

0

0

0

0

9

Benjamin Babik @localoptimiser

1 day ago

@mattshumer_ @OpenAI Man that is a lot of tokens for a meal planner.

0

0

0

0

109

Benjamin Babik @localoptimiser

1 day ago

@Swizec @kentcdodds Funny you think retirement funds aren't already invested in this?

0

0

0

0

17

Benjamin Babik @localoptimiser

1 day ago

@imrobertjames @kentcdodds Training on all of our content doesn't make it not theft.

0

0

0

0

12

Benjamin Babik @localoptimiser

2 days ago

I wish I was a cracked dev so I could have this sort of entirely avoidable problem.

Vercel Developers

3 days ago

A prompt can cost a million times more than an HTTP request, so token theft is a high-margin business for attackers. How we protect our AI endpoints ↓ https://t.co/Nhb1kPKbwD

12

219

9

201

202K

0

0

0

0

6

Benjamin Babik @localoptimiser

2 days ago

@melvynx @Avenoxai Worse than 0%?!?

0

0

0

0

73

Benjamin Babik @localoptimiser

2 days ago

@melvynx They're all trained on benchmarks. I just had it one shot a complicated change to a language grammar. It did it, updated the docs, the LSP and the highlighter. Hands off just did it. I am more than happy to have a local model benchmaxxxed on tasks much like the tasks I want done.

0

0

0

0

669

Benjamin Babik @localoptimiser

2 days ago

@NikilKuruvilla @DanielMiessler @atmoio And so really, if we're honest about it, simulation theory is quite unsophisticated. It's not so much a theory but a science fiction trope. It doesn't survive a few minutes thinking about it. And it only survives if you allude to a pseudo-math that allows it, as our math doesn't.

0

0

0

0

12

Benjamin Babik @localoptimiser

2 days ago

@NikilKuruvilla @DanielMiessler @atmoio This could all be happening outside and it would NEVER be detectable inside. All possible worlds would exist, in principle indefinitely. Nobody would ever know, and every possible substep of the world would execute necessarily eventually as readily as you could add 1 to infinity.

1

0

0

0

7

Last Seen Users on Sotwe

Trends for you

Most Popular Users