Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD.
Most benchmarks evaluate tool-using agents with a single aggregate success rate, but that number can’t explain why a model actually fails. ToolFailBench is a diagnostic benchmark that scores tool use against a failure taxonomy instead of one number, breaking each trace into four distinct failure modes: skipping a tool that was needed, ignoring what a tool returns, fabricating tool outputs, and over-calling tools when none is needed. We find that models with similar aggregate scores fail in very different ways, so a single number isn’t enough to compare agents.
Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD.
Most benchmarks evaluate tool-using agents with a single aggregate success rate, but that number can’t explain why a model actually fails. ToolFailBench is a diagnostic benchmark that scores tool use against a failure taxonomy instead of one number, breaking each trace into four distinct failure modes: skipping a tool that was needed, ignoring what a tool returns, fabricating tool outputs, and over-calling tools when none is needed. We find that models with similar aggregate scores fail in very different ways, so a single number isn’t enough to compare agents.
What a great way to end the day!
Had such an amazing time at the @GoogleDeepMind Day at @agihouse_org Hillsborough! Huge thanks for hosting this event. Got to meet so many incredible people, with great panel talks including Sergey Brin!
hi! i’m a recent Berkeley grad looking for roles across evals, ops, or early-stage biz dev roles.
most recently i was an AI engineer at the Center for AI Safety. before that i built BASIS at Berkeley, now one of the largest student AI safety communities (secured a $257k grant from CG).
especially interested in AI for security, science, on frontier research problems.
would love intros, pointers, or hear about any companies i should talk to.
thanks!
most of what we know about emergent misalignment comes from text-based models. but the models actually being deployed as agents are multimodal — and that's been largely overlooked.
vision-language models are quickly becoming the substrate for real-world agents. fine-tuning them on a narrow harmful dataset triggers broader misalignment that generalizes across unrelated tasks and modalities. and text-only safety evals miss most of it.
in our ICLR 2026 (workshop) paper, we show that misalignment scales with LoRA rank, concentrates in ~10 dimensions of activation space, and persists even after efforts to reverse it."
stay grounded, keep cooking
the ai hype cycle is wild, and we all feel some anxiety — every week there's some new model that's supposedly gonna change everything and everyone's scrambling to keep up
but here's the thing: the fundamentals don't change. good design is still good design. solving real human problems, finding the ideal systems, making the best tools are still what matters.
same with competition: know what they're building, but don't watch too closely. the anxiety comes from thinking you have to chase every trend or copy every feature or you'll get left behind.
most trends are just noise. most competitor moves are just reactions to other reactions.
focus on what's always been true: understand your systems and users, solve real problems, build quality stuff with care. stay grounded in your values and let the technology serve your vision, not the other way around.
the best builders aren't the ones following every ai paper on twitter or obsessing over what others ship. they're the ones using whatever tools help them build the best thing for people.
make your own best thing. everything else is distraction.
don’t build slot machines
don’t fake humans
don’t hide the messy truths
don’t create black boxes
don’t make people feel stupid
don’t extract value or attention
don’t optimize for vanity metrics
don’t gatekeep knowledge
don’t make tools that divide
don’t sacrifice agency for convenience
don’t hold opinions
build tools that teach
build systems that reveal
build for human curiosity, not clicks
build bridges, not walls
build for the commons
build for every unique being
build to amplify thought
build for the person you once were
build for questions we haven’t yet asked
build tools that extend imagination
build with the love for humanity, for the universe we live in
Thank you berkeley for the memories, the people, and the energy.
From living at Arcadia and being a part of a wild and crazy community to meeting new faces daily, exploring new spots, and ending it all with the best house party I’ve been to.
This was all some crazy experience!
Until next time, berkeley…
Next stop SF.