Unless you are operating at very large scale, prefix aware routing is fairly easy for most voice agent use cases. You can cache the system instruction on agent start, while the rest of the pipeline is spinning up. After that, sticky HTTP connections are all you need. There are fancier ways to do it. But they probably aren’t necessary!
@beknabdik This is just not true. Did you look at the benchmark? It is designed to test the “big stable system prompt” plus multi-turn. Cache hit rate should be >85%. So part of what this benchmark tests is how efficient cached prefill is.
This weekend, join us in SF for our 4th WeaveHacks hackathon!
Sponsored by @OpenAIDevs for the first time (+ @dkundel judging!), @cursor_ai ,@Redisinc and @CopilotKit , Hackers will get over $150 in credits to build multi-agent orchestration systems + Over $15K in prizes!
Looking at these numbers last night, I was realizing that I need to update the TTFT numbers for Nemotron 3 Super and Nemotron 3 Nano. The numbers in the aiewf-evals table are from the launch day for each of those models, and vLLM/SGLang have much better support now for hybrid/Mamba prefix caching, and more optimized kernels for Blackwell, now!
@slobkebap Let me know how it goes and if you have any issues. There’s both a Python and a (more optimized) c++ server in the repo. The c++ server only supports the en model but I can fix that this weekend!
https://t.co/8RP0CEZmYx
@stolsvik 40 languages. Here’s the screenshot from the model card. (Also linked to from the post.) What languages do you need for what you’re building?
More information ...
ASSERT launch post:
https://t.co/1F7dJutwDR
ASSERT on GitHub: https://t.co/9tEJjB1fHx
Microsoft Build session about ASSERT, evals, and security governance: https://t.co/KTEgusI3Mb
Microsoft announced a bunch of interesting new AI models and tools this week. Model launches alway get lots of attention. But don't sleep on the new ASSERT evals framework that launched today.
I'm on record as arguing that 2026 is the year of evals.
Evals are the glue for all the "jobs to be done" at every level of AI: model training; testing and deciding on what models to use and how to use them; and testing and improving AI agents in production.
Evals unify our work on those different layers of the stack.
These days, when we talk about evals, observability, and testing, we're talking about overlapping parts of a large set of tools we're still early on in figuring out.
As the AI engineering ecosystem matures, diversifies, and increases massively in scale, we really, really need good evaluation (observability, monitoring, testing, data management) frameworks.
I got a chance to test the new Microsoft ASSERT evals framework before it was released, and it has some very nice core ideas.
1) ASSERT is open in two important ways. First, the team is serious about broad support for models, frameworks, and use cases. Microsoft spent time understanding voice agent use cases and building Pipecat support, for example. Second, the code is completely open source, released under an open MIT license.
2) We're all working in and with agentic coding tools today. That means we are planning in natural language, and all of our software development and ops tools have to evolve for these new, natural language, workflows. ASSERT takes descriptions of desired agent behavior and generates specifications for the ASSERT suite of tools to run against.
In a world where "English is the programming language," how we actually make natural language "code" precise enough and repeatable enough is perhaps the big unsolved tooling problem that all of us are working towards in different ways. This is true whether we work on coding agents, AI opps tooling, orchestration frameworks, or vertical applications.
3) Microsoft describes ASSERT as a policy-driven framework. Rather than eval against generic performance metrics, ASSERT aims to generate stable but adaptable evaluation criteria for specific agents.
"Policy-driven" also implies a full loop design. Policy (generated from specific requirements) -> evaluation -> optimization -> monitoring in production -> improving the policy description -> evaluation -> ...
4) Enterprise agents need to be evaluated along many dimensions: task completion, individual conversation turn behavior, latency, mode-specific metrics like audio disfluencies, and safety/security. Microsoft designed ASSERT to be used together with a new safety governance toolkit called Agent Control Specification.
5) Finally, ASSERT is integrated into the Microsoft Foundry ecosystem. Today, AI engineering tools have to be open source and vendor neutral to get attention from developers and gain widespread adoption. *And* it's equally important to give enterprise customers tools that work as a coherent stack.
This is hard to do well. There are real tensions between open source development versus engineering a great full stack developer experience. However, if you sweat the details on both ends, you benefit from a full spectrum of feedback about real-world development pain points. It's more work, but it's worth it!
Kudos to Microsoft for embracing this and committing to an open, community oriented approach, plus doing the extra work to build the full stack for enterprise customers.
Fantastic commentary on the Microsoft MAI tech report released yesterday. Wonderful to see this level of detail about a large scale training effort, and this kind of super helpful analysis from @eliebakouch.
microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.
this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab.
the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale
let's look at all of this in this likely very long thread 🧵
Arresting the downhill slide into sloppification of generally cited AI benchmarks is possibly one of the most important things someone thoughtful, stubborn, and well connected could work on today.
We don't talk enough about how the marketing of a typical model release today focuses on benchmarks that are somewhere between useless and actively misleading.
The incentives here are totally understandable. But the impact on every part of the ecosystem is significant. The most useful hills are not climbed. Adoption of genuinely good models for tasks they are well suited for is slower than it should be. Teams struggle to turn a great POC into a production agent deployed at scale.
This might not be solvable. Perhaps it's one of those things it's better to be sanguine about. See also, "the optimal amount of fraud is not zero, and "democracy is the worst form of government except for all the others. But I hope someone like @willdepue tilts at this windmill.
i know the solution to the AI benchmark problem but nobody is gonna like it
it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc
you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved