stevibe

Verified account

@stevibe

LLM. Local AI addict. Building @BenchLocalAI Builds things nobody asked for. Benchmarks things for fun.

Joined July 2009

1.3K Following

22.7K Followers

3.5K Posts

Pinned Tweet

2 months ago

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

114

2K

255

2K

421K

stevibe retweeted

about 6 hours ago

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

googlegemma's tweet photo. Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇 https://t.co/gf4FZv0WZb

225

7K

932

3K

1M

about 6 hours ago

@ronaldmannak Mac Studio M5 Ultra please 🙏

2

7

0

0

343

about 6 hours ago

0

3

0

0

358

Who to follow

bigplayer.eth（电你牛子的王老师）

Pepe Patron Saint of @BIPLAYER_DAO 33! @33daoweb3

Verified account

allin @HyperliquidX 努力做一个alpha博主，所有的推文都不构成投资建议带🍚的推文是恰饭，不带的是正常的推文 #版本弃子 #链上电子乞丐 #话痨 wait for the wind collab➡️https://t.co/O5qim99YIe

Superogers/超级罗杰斯

Verified account

微博：超级罗杰斯鸿儒战队创始人 Co-founder @My3amclub 电报群：https://t.co/l6C1ZHj7RB Youtube：https://t.co/3VqREmq6py

about 7 hours ago

@vr8vr8 On it!

0

0

0

0

96

1 day ago

Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together. (The test: place each element at the right pixel position on a blank form image, not type into a field.) Setup: > Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool). > I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height. > The blue boxes on the screen are its detections. Look how tight they are — it nails every field. Result: > Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct. > Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas. > Character-box alignment still a touch loose, but every value is where it belongs. > 9m10s, 224.5k input, 24.3k output, 21 turns. Why it matters: > Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can. > A combination of small models can do the work of a single large one.

82

2K

260

3K

134K

about 7 hours ago

@JTBeers Let me check the model card, thank you!

0

1

0

0

11

about 7 hours ago

@mr_r0b0t @NVIDIAAI NVIDIA released a super useful model, incredible!

1

4

0

0

174

about 7 hours ago

A bit of background: I started this project 60 days ago (as per the Codex session history as I /resume today). Back then I tried using Qwen3.5 0.8B to identify screen regions, but no matter how many grid guidelines I overlaid on the screenshot, it just couldn't do what I wanted. Today, delegating the locate-region task to LocateAnything-3B, it suddenly all feels possible again.

1

23

0

2

1K

about 8 hours ago

I explored a further possibility with local models: Qwen3.6 35B A3B + NVIDIA LocateAnything-3B as a local Computer Use agent (proof of concept). In the demo, I asked it to switch my Mac to light mode. It did. Then back to dark. Did that too — finding the right toggle in System Settings, clicking it, and verifying the change itself. It's fully screenshot-based, so no Accessibility API needed. If it's on screen, the agent can see it and act on it. This runs entirely on your own hardware — private, local, built from two small open models.

24

431

43

392

24K

about 7 hours ago

@CharlesRDog Posted to the comment!

0

0

0

0

9

about 7 hours ago

Repo here https://t.co/iWZ19g5JHD

2

37

1

34

1K

about 8 hours ago

@CharlesRDog Please wait, I am packing it up!

0

6

0

0

580

about 12 hours ago

@iam_Dyeus I'm going to share another exploration next! Something related.

0

0

0

0

161

about 12 hours ago

@JakoveHr Yes, there are so many small models out there nowadays, worth trying the combinations.

0

2

0

0

236

about 22 hours ago

@u1tra_instinct @AgentSparko Glad you liked it!

0

2

1

0

355

about 22 hours ago

@largePrawn Agreed! In the form-filling context, the data is likely confidential/private, so having the capability to run all of this locally is a key advantage.

0

7

0

0

1K

about 22 hours ago

@shopqit True, looking forward to the Nemotron 3 Ultra too!

0

0

0

0

846

about 22 hours ago

@pipobarraca I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.

0

2

0

0

351

about 22 hours ago

@dison_franco I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.

0

1

0

0

230

about 22 hours ago

@muhammad_o7 I'm sure we'll have that soon! @ivanfioravanti @Prince_Canuma

2

3

0

0

907

about 22 hours ago

@infosecatrandom I added a tool for this test that requires the name/description of the area the main model wants to inspect, and it will trigger LocateAnything to return the x, y, width, and height of that area. So the main model will use the tool when needed.

0

0

0

0

229

Last Seen Users on Sotwe

Trends for you

Most Popular Users