Transparent, interoperable FOSS AI is the great equalizer. I invest//advocate//build for developers accelerating progress toward expanded human potential ๐๐
the AI diffusion bottleneck is reliability. not capability.
most teams don't have the resources to measure agents.
the right way to transition to agents safely is open evals infrastructure. that's what @silverstreamAI@ServiceNowRSRCH@nvidia@IBM@thealliance_ai are doing
Bench for Claude Code was #1 Product of the Day on Product Hunt ๐
Also featured in the @ProductHunt newsletter
Thanks to everyone who supported us ๐
Built to store, review, and share your Claude Code sessions
More coming soon
Bench for Claude Code was #1 Product of the Day on Product Hunt ๐
Also featured in the @ProductHunt newsletter
Thanks to everyone who supported us ๐
Built to store, review, and share your Claude Code sessions
More coming soon
Agents touch real systems: databases, APIs, permissions, configs.
We've been using Bench to give customers curated traces and get full observability into what our agents did and why. Share traces directly in PRs.
Now we're releasing it. Live on Product Hunt today.
https://t.co/DLAlvH69jH
At GTC, the same question kept coming up:
Is there a way to track what Claude Code does and share it?
Tomorrow, weโre launching the answer on Product Hunt.
The @silverstreamAI and @ServiceNowRsch teams built the infrastructure and observability, we host a managed visualization layer compatible with CUBE: https://t.co/hebBPv3zgm
if you're running agents in production right now, what has stopped you from creating broader evals?
the AI diffusion bottleneck is reliability. not capability.
most teams don't have the resources to measure agents.
the right way to transition to agents safely is open evals infrastructure. that's what @silverstreamAI@ServiceNowRSRCH@nvidia@IBM@thealliance_ai are doing
right now every team builds eval infra from scratch. no way to compare results across models. no standard way to measure failure. every team starts from a vibecoded shell.
wrap a benchmark once, run it everywhere. no custom integration. built on MCP, Gym and @opentelemetry . not another benchmark. infrastructure for all of them.
we entered the 90s, the age of optical flow
Lucas-Kanade works!
dense opt flow didn't :(
ego motion: floor features move 80px downward, distant windows move 40px upward
@LiTianleli We've been evaluating Grok on enterprise tools: ServiceNow, Oracle, Odoo, obscure high revenue SaaS, full computer use with DOM replay, Grok has lower FP for task refusals (good TP) but pixel control is lagging.
happy to share finetuning data if useful.