@atl_flaneur@0xSero Tailscale SSH handles everything for you. No adding ssh key. Just works and works great with Hermes agent. I often ask my agent to ssh in and install stuff or change things on remote machines
@mr_r0b0t@NVIDIAAI@NousResearch How does Nemotron-3-Ultra compare to mimo-v2.5-pro ? I am happy on mimo but constantly looking for something better. Worth the switch?
BenchLocal 0.3.0 is out. It introduces a second type of benchpack, and it opens up a lot.
Until now, every benchpack was a table type: fixed inputs, expected outputs, scored field by field. That's perfect for extraction, tool-calling, and reasoning suites, but it quietly boxed every benchmark into whatever fits in a row and a column.
Today I'm adding a whole new kind: web type benchpacks.
A web type benchpack is, at its core, a web app. That sounds like a small thing. It isn't. It means a benchmark can now render anything a browser can: canvases, forms, diagrams, interactive UIs, and ask a model to actually operate in that space. Visual tasks, spatial reasoning, multi-step tool use: the kinds of capabilities that never fit neatly into a table.
The first one is live: FormSight.
You may have seen me running it on the latest model drops โ the test where a multimodal model has to fill out a real paper form. Not "read the form." Place every character in the right box, tick the right checkboxes, get the alignment right. Vision + spatial reasoning + tool use + long context, all at once. It's brutal, and it's been the most revealing test I've built. Now anyone can run it themselves on BenchLocal 0.3.0.
Here's the part I'm most proud of, and the principle behind the entire web type design:
Your API credentials never leave your machine.
With benchlocal/web-sdk, a website can talk to your local BenchLocal. The site sends an inference request โ BenchLocal runs it locally โ the result goes back to the site. The web app drives the benchmark; your machine does the inference. Keys stay home.
That solves two things people keep running into:
> Local models that are a pain to expose to a web app: BenchLocal bridges them.
> Cloud models you'd never want to hand an API key to a third-party site: now you don't have to.
So you get the full flexibility of the web (build essentially any test you can imagine: visual, diagrammatic, interactive) with the safety of local-only credentials.
This is the foundation. FormSight is the first web type benchpack, and I can already picture visual-reasoning suites, diagram tasks, and UI-operation tests built the same way.
BenchLocal is open source, MIT licensed. Try 0.3.0, run FormSight, and tell me what breaks.
โ Implicit caching is now live on Qwen3.7-Max โ kicks in automatically, no setup needed.
โก๏ธFaster + cheaper out of the box.
Need higher, more deterministic hit rates? Try explicit caching instead. ๐
๐Best practices ๐ย ๏ผhttps://t.co/3hSs6zquBH
@HermesAgentTips@AnthropicAI If you treat your developer community poorly, they will remember it does not matter if you have a new shiny thing to show them. Once an asshole, always an asshole. Just like their logo.