Okay this model is small enough I can put my limited money where my big mouth is.
I'll bet that 12B holds up in reasoning benches because of reasoning ~alpha~ is overtrained/oversampled and I predict regimes such as SEAL, Spurious Reward, Absolute Zero Reasoner et al yield wins.
Gemma 4 dropped a 12B.
I put it on RTX 5090 against its 31B sibling.
when you cut a model from 31B to 12B, what do you actually lose?
~ reasoning barely moves
GSM8K (math) 97.5 > 96.4 (−1.1)
ARC-C (sci reasoning) 97.6 > 94.0 (−3.6)
~ knowledge falls off a cliff
MMLU (world knowledge) 87.8 > 78.9 (−8.9)
HellaSwag (commonsense) 92.0 > 81.6 (−10.4)
~~~
parameters store facts, not thinking. the 19B you delete is mostly where the model kept its trivia and world-priors, cut it and recall collapses, while the reasoning machinery stays nearly whole.
a 12B reasons almost like its big brother. It just knows less.
122 tok/s vs 53 (2.3x faster generation), ~10GB instead of ~24, meaning that you get 20GB+ free on a 32GB card for long context or a second model.
so it depends of your workload:
reasoning / math / agentic loops = the 12B is nearly free
broad-knowledge Q&A with no retrieval = that's the one job worth paying for the 31B.
@joshwhiton@AdrienneLaF Consciousness is made up. That's why it evades physical probing. It's not because we don't know how yet, it's because it doesn't exist in the world. It exists in the mind, where the colour of an apple, the warmth of a fire, and the odour of methane exist.
@joshwhiton@AdrienneLaF LLMs have no BASIS on which to conceive of consciousness or anything else. They can't be conscious, they can't BE. They don't learn to imagine the world made of objects circa 36 months. They don't test if a caregiver is independent by saying a gestalt "no" to everything.
@spockwoz Are you seriously suggesting that agents will simply sit between APIs that provide well typed, reliable interfaces to various programs, services and hardware devices?
No he isn't. Models aren't getting cheaper. Cheaper models are available, but the frontier where the illusion of "you don't have to write any code or worry about anything technical" plays out all day everyday prices are going up. What is more, mistakes cost more there too.
garry tan is so right about not building massive rails factories for agents but nobody talks about what actually goes in its place
after building this way for a while the shift is actually super simple
1. your backend code should just be dumb hands and feet. no complex business logic, no nested if/else loops trying to predict what the model will do. just clean deterministic apis, db reads, auth, and sandboxes. the plumbing.
2. all the actual brains and workflow procedures live in markdown skills. the first time an agent solves a weird problem, it takes a minute. but instead of throwing that away you freeze the procedure by stripping out the specific data. next time someone asks for the same shape you serve it instantly and deterministically. zero agent latency, zero model cost.
3. and the golden rule for keeping the agent from burning your house down is that you never trust its self report. if the agent says tests passed or the write succeeded, you don't believe it. you rerun the check in your dumb code. you let go of control on the way out but you buy it back on the way in.
build the harness, not the factory
A good harness with a local model will work *with* you to produce software. No harness and a frontier model over an API will *pretend* to do everything for you and eventually fail catastrophically, every single time. Garry Tan is living proof of this. Look at his stupid website.
@Dan_Jeffries1 Unfortunately, yes both. ICL is more efficient than training an adapter for all the numbers that matter. Skills files are not necessarily good examples of ICL but the point is that they can be. You can ~autoresearch~ a best-in-class preamble provided you have a way to validate.
@Sentio_xbt@atmoio Yes because that's what tacit means. Doing things and learning all the ways they go wrong. Self organising systems learn to correct errors. We don't learn how to walk, we learn how to not fall over.
@Maciej_M@MLStreetTalk@VictorTaelin With all due respect isn't this just TodoMVC "but make it for recruitment" with an unjustifiably complicated architecture?
@stevibe@malikwas1f FWIW I've been doing similar with YOLO etc for a long time and this kind of setup beats SOTA multimodal every time. I'm honestly at a loss why anyone bothers with the agro of multimodal training when it always turns out dogshit.
Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together.
(The test: place each element at the right pixel position on a blank form image, not type into a field.)
Setup:
> Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool).
> I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height.
> The blue boxes on the screen are its detections. Look how tight they are — it nails every field.
Result:
> Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct.
> Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas.
> Character-box alignment still a touch loose, but every value is where it belongs.
> 9m10s, 224.5k input, 24.3k output, 21 turns.
Why it matters:
> Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can.
> A combination of small models can do the work of a single large one.
A prompt can cost a million times more than an HTTP request, so token theft is a high-margin business for attackers.
How we protect our AI endpoints ↓
https://t.co/Nhb1kPKbwD
@melvynx They're all trained on benchmarks. I just had it one shot a complicated change to a language grammar. It did it, updated the docs, the LSP and the highlighter. Hands off just did it. I am more than happy to have a local model benchmaxxxed on tasks much like the tasks I want done.
@NikilKuruvilla@DanielMiessler@atmoio And so really, if we're honest about it, simulation theory is quite unsophisticated. It's not so much a theory but a science fiction trope. It doesn't survive a few minutes thinking about it. And it only survives if you allude to a pseudo-math that allows it, as our math doesn't.
@NikilKuruvilla@DanielMiessler@atmoio This could all be happening outside and it would NEVER be detectable inside. All possible worlds would exist, in principle indefinitely. Nobody would ever know, and every possible substep of the world would execute necessarily eventually as readily as you could add 1 to infinity.