Which local models can actually handle tool calling?
I built a framework to find out.
15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking.
Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too.
Only two models went all green: the 27B dense and the distilled 27B.
The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two.
The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit.
The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output.
Small models hallucinate data.
Big models ignore data.
The 27B just threaded it through.
Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together.
(The test: place each element at the right pixel position on a blank form image, not type into a field.)
Setup:
> Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool).
> I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height.
> The blue boxes on the screen are its detections. Look how tight they are — it nails every field.
Result:
> Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct.
> Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas.
> Character-box alignment still a touch loose, but every value is where it belongs.
> 9m10s, 224.5k input, 24.3k output, 21 turns.
Why it matters:
> Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can.
> A combination of small models can do the work of a single large one.
A bit of background: I started this project 60 days ago (as per the Codex session history as I /resume today).
Back then I tried using Qwen3.5 0.8B to identify screen regions, but no matter how many grid guidelines I overlaid on the screenshot, it just couldn't do what I wanted.
Today, delegating the locate-region task to LocateAnything-3B, it suddenly all feels possible again.
I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept).
In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself.
It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.
@largePrawn Agreed! In the form-filling context, the data is likely confidential/private, so having the capability to run all of this locally is a key advantage.
@pipobarraca I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.
@dison_franco I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.
@infosecatrandom I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.