kwindla

@kwindla

Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // ᓚᘏᗢ // @pipecat_ai

San Francisco, CA

Joined September 2008

3.9K Following

14.6K Followers

6.4K Posts

kwindla

@kwindla

about 1 hour ago

Unless you are operating at very large scale, prefix aware routing is fairly easy for most voice agent use cases. You can cache the system instruction on agent start, while the rest of the pipeline is spinning up. After that, sticky HTTP connections are all you need. There are fancier ways to do it. But they probably aren’t necessary!

kwindla

@kwindla

about 6 hours ago

https://t.co/dZw9EWdoGt

kwindla

@kwindla

about 2 hours ago

@beknabdik This is just not true. Did you look at the benchmark? It is designed to test the “big stable system prompt” plus multi-turn. Cache hit rate should be >85%. So part of what this benchmark tests is how efficient cached prefill is.

kwindla

@kwindla

about 2 hours ago

@mehuljindal18 I need to spend more time with the big Qwen 3.5! What are you using it for?

Who to follow

Devon ☀️

@devonzuegel

I've gone to look for myself. If I return before I get back, please ask me to wait. Prev @GitHub @NotionHQ @Affirm @StanfordReview. Now building @Esmeralda_Inst

Jenny Fielding

@jefielding

Managing Partner, @EverywhereVC investing in pre-seed founders building the future of money, health and work 🚀

Natalie Sandman

@NatSandman

COO at Comulate. I ♡ AI & insurance. Former GP @sparkcapital ✧ @Garden_Intel @MateriaAI @DurableAI @comulate @zealapi @withshepherd @poggio_labs @medallion

kwindla

@kwindla

about 4 hours ago

Hang out with @altryne and crew this weekend in SF!

Alex Volkov

@altryne

3 days ago

This weekend, join us in SF for our 4th WeaveHacks hackathon! Sponsored by @OpenAIDevs for the first time (+ @dkundel judging!), @cursor_ai ,@Redisinc and @CopilotKit , Hackers will get over $150 in credits to build multi-agent orchestration systems + Over $15K in prizes!

altryne's tweet photo. This weekend, join us in SF for our 4th WeaveHacks hackathon!

Sponsored by @OpenAIDevs for the first time (+ @dkundel judging!), @cursor_ai ,@Redisinc and @CopilotKit , Hackers will get over $150 in credits to build multi-agent orchestration systems + Over $15K in prizes! https://t.co/Co2KuHzGIy

548

kwindla

@kwindla

about 6 hours ago

Looking at these numbers last night, I was realizing that I need to update the TTFT numbers for Nemotron 3 Super and Nemotron 3 Nano. The numbers in the aiewf-evals table are from the launch day for each of those models, and vLLM/SGLang have much better support now for hybrid/Mamba prefix caching, and more optimized kernels for Blackwell, now!

411

kwindla

@kwindla

about 6 hours ago

Hoisting this Pareto frontier chart out of the article.

587

kwindla

@kwindla

about 7 hours ago

@LazyCoda I don't think it's actually quicker than that.

kwindla

@kwindla

about 13 hours ago

https://t.co/9b5cQpa2lq

228

190

23K

kwindla

@kwindla

about 10 hours ago

@FurkanGozukara Yes, definitely, for low latency use cases (voice agents). What are you using these models for?

kwindla

@kwindla

about 11 hours ago

@slobkebap Let me know how it goes and if you have any issues. There’s both a Python and a (more optimized) c++ server in the repo. The c++ server only supports the en model but I can fix that this weekend! https://t.co/8RP0CEZmYx

131

kwindla

@kwindla

about 11 hours ago

@FurkanGozukara Yes, much better. Whisper is a very old model at this point. Nemotron ASR models are both much more accurate and much faster.

216

kwindla

@kwindla

about 11 hours ago

@stolsvik 40 languages. Here’s the screenshot from the model card. (Also linked to from the post.) What languages do you need for what you’re building?

kwindla

@kwindla

about 12 hours ago

@rithm84 This is a continuation of the Parakeet work, from the same (great) team at NVIDIA!

225

kwindla

@kwindla

about 12 hours ago

@altryne Big launch day today! I have low latency benchmarks and code for Ultra, too, but don’t want to post everything all at once. 😁

243

kwindla

@kwindla

1 day ago

@eve_bouff minority report!

645

kwindla

@kwindla

1 day ago

More information ... ASSERT launch post: https://t.co/1F7dJutwDR ASSERT on GitHub: https://t.co/9tEJjB1fHx Microsoft Build session about ASSERT, evals, and security governance: https://t.co/KTEgusI3Mb

572

kwindla

@kwindla

1 day ago

Microsoft announced a bunch of interesting new AI models and tools this week. Model launches alway get lots of attention. But don't sleep on the new ASSERT evals framework that launched today. I'm on record as arguing that 2026 is the year of evals. Evals are the glue for all the "jobs to be done" at every level of AI: model training; testing and deciding on what models to use and how to use them; and testing and improving AI agents in production. Evals unify our work on those different layers of the stack. These days, when we talk about evals, observability, and testing, we're talking about overlapping parts of a large set of tools we're still early on in figuring out. As the AI engineering ecosystem matures, diversifies, and increases massively in scale, we really, really need good evaluation (observability, monitoring, testing, data management) frameworks. I got a chance to test the new Microsoft ASSERT evals framework before it was released, and it has some very nice core ideas. 1) ASSERT is open in two important ways. First, the team is serious about broad support for models, frameworks, and use cases. Microsoft spent time understanding voice agent use cases and building Pipecat support, for example. Second, the code is completely open source, released under an open MIT license. 2) We're all working in and with agentic coding tools today. That means we are planning in natural language, and all of our software development and ops tools have to evolve for these new, natural language, workflows. ASSERT takes descriptions of desired agent behavior and generates specifications for the ASSERT suite of tools to run against. In a world where "English is the programming language," how we actually make natural language "code" precise enough and repeatable enough is perhaps the big unsolved tooling problem that all of us are working towards in different ways. This is true whether we work on coding agents, AI opps tooling, orchestration frameworks, or vertical applications. 3) Microsoft describes ASSERT as a policy-driven framework. Rather than eval against generic performance metrics, ASSERT aims to generate stable but adaptable evaluation criteria for specific agents. "Policy-driven" also implies a full loop design. Policy (generated from specific requirements) -> evaluation -> optimization -> monitoring in production -> improving the policy description -> evaluation -> ... 4) Enterprise agents need to be evaluated along many dimensions: task completion, individual conversation turn behavior, latency, mode-specific metrics like audio disfluencies, and safety/security. Microsoft designed ASSERT to be used together with a new safety governance toolkit called Agent Control Specification. 5) Finally, ASSERT is integrated into the Microsoft Foundry ecosystem. Today, AI engineering tools have to be open source and vendor neutral to get attention from developers and gain widespread adoption. *And* it's equally important to give enterprise customers tools that work as a coherent stack. This is hard to do well. There are real tensions between open source development versus engineering a great full stack developer experience. However, if you sweat the details on both ends, you benefit from a full spectrum of feedback about real-world development pain points. It's more work, but it's worth it! Kudos to Microsoft for embracing this and committing to an open, community oriented approach, plus doing the extra work to build the full stack for enterprise customers.

kwindla's tweet photo. Microsoft announced a bunch of interesting new AI models and tools this week. Model launches alway get lots of attention. But don't sleep on the new ASSERT evals framework that launched today.

I'm on record as arguing that 2026 is the year of evals.

Evals are the glue for all the "jobs to be done" at every level of AI: model training; testing and deciding on what models to use and how to use them; and testing and improving AI agents in production.

Evals unify our work on those different layers of the stack.

These days, when we talk about evals, observability, and testing, we're talking about overlapping parts of a large set of tools we're still early on in figuring out.

As the AI engineering ecosystem matures, diversifies, and increases massively in scale, we really, really need good evaluation (observability, monitoring, testing, data management) frameworks.

I got a chance to test the new Microsoft ASSERT evals framework before it was released, and it has some very nice core ideas.

1) ASSERT is open in two important ways. First, the team is serious about broad support for models, frameworks, and use cases. Microsoft spent time understanding voice agent use cases and building Pipecat support, for example. Second, the code is completely open source, released under an open MIT license.

2) We're all working in and with agentic coding tools today. That means we are planning in natural language, and all of our software development and ops tools have to evolve for these new, natural language, workflows. ASSERT takes descriptions of desired agent behavior and generates specifications for the ASSERT suite of tools to run against.

In a world where "English is the programming language," how we actually make natural language "code" precise enough and repeatable enough is perhaps the big unsolved tooling problem that all of us are working towards in different ways. This is true whether we work on coding agents, AI opps tooling, orchestration frameworks, or vertical applications.

3) Microsoft describes ASSERT as a policy-driven framework. Rather than eval against generic performance metrics, ASSERT aims to generate stable but adaptable evaluation criteria for specific agents.

"Policy-driven" also implies a full loop design. Policy (generated from specific requirements) -> evaluation -> optimization -> monitoring in production -> improving the policy description -> evaluation -> ...

4) Enterprise agents need to be evaluated along many dimensions: task completion, individual conversation turn behavior, latency, mode-specific metrics like audio disfluencies, and safety/security. Microsoft designed ASSERT to be used together with a new safety governance toolkit called Agent Control Specification.

5) Finally, ASSERT is integrated into the Microsoft Foundry ecosystem. Today, AI engineering tools have to be open source and vendor neutral to get attention from developers and gain widespread adoption. *And* it's equally important to give enterprise customers tools that work as a coherent stack.

This is hard to do well. There are real tensions between open source development versus engineering a great full stack developer experience. However, if you sweat the details on both ends, you benefit from a full spectrum of feedback about real-world development pain points. It's more work, but it's worth it!

Kudos to Microsoft for embracing this and committing to an open, community oriented approach, plus doing the extra work to build the full stack for enterprise customers.

kwindla

@kwindla

1 day ago

Fantastic commentary on the Microsoft MAI tech report released yesterday. Wonderful to see this level of detail about a large scale training effort, and this kind of super helpful analysis from @eliebakouch.

elie

@eliebakouch

2 days ago

microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab. the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale let's look at all of this in this likely very long thread 🧵

eliebakouch's tweet photo. microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.

this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab.

the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale

let's look at all of this in this likely very long thread 🧵

260

260K

kwindla

@kwindla

7 days ago

Arresting the downhill slide into sloppification of generally cited AI benchmarks is possibly one of the most important things someone thoughtful, stubborn, and well connected could work on today. We don't talk enough about how the marketing of a typical model release today focuses on benchmarks that are somewhere between useless and actively misleading. The incentives here are totally understandable. But the impact on every part of the ecosystem is significant. The most useful hills are not climbed. Adoption of genuinely good models for tasks they are well suited for is slower than it should be. Teams struggle to turn a great POC into a production agent deployed at scale. This might not be solvable. Perhaps it's one of those things it's better to be sanguine about. See also, "the optimal amount of fraud is not zero, and "democracy is the worst form of government except for all the others. But I hope someone like @willdepue tilts at this windmill.

will depue

@willdepue

7 days ago

i know the solution to the AI benchmark problem but nobody is gonna like it it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved

643

130

162K

kwindla

@kwindla

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users