Several people asked about the compute required for this research. Training takes about half a day on 2-3 GPUs. Ofc we used more compute during research and experimentation, but each iteration itself is lightweight. If you can SFT a 7B model (like Moshi), you can try it too.
Weâre excited to introduce KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI, accepted at #ICASSP2026! ð¢
Blog https://t.co/arVz1TGpJJ
Paper https://t.co/0EwpyRXeCs
Can a speech AI think deeply without pausing to process?
In real conversation, we donât wait until weâve fully worked out what we want to sayâwe start talking, and our thoughts catch up as the sentence unfolds.
Fast speech-to-speech models achieve this, but their reasoning tends to stay shallow. Cascaded pipelines that route through a knowledgeable LLM are smarter, but the added latency breaks the flowâthey fall back to "think, then speak."
In our new paper, we propose a way to break this trade-off. We call it KAME (Turtle in Japanese).
A speech-to-speech model handles the fast response loop and starts replying immediately. In parallel, a backend LLM runs asynchronously, generating response candidates that are continuously injected as "oracle" signals in real time.
This shifts the AI paradigm from "think, then speak" to "speak while thinking."
The backend LLM is completely swappable. You can plug in GPT-4.1, Claude Opus, or Gemini 2.5 Flash depending on the task without changing the frontend. In our experiments, Claude tended to score higher on reasoning, while GPT did better on humanities questions.
Try the model yourself here: https://t.co/uDA0nvvjhS
For the past few years, humans have been doing âprompt engineeringâ to coax the best performance out of different LLMs. In this work, we explored what happens if we train an AI to do that job instead.
By training a Conductor model with RL, we found that it naturally learns to write highly effective, custom instructions for a whole pool of other models. It essentially learns to âmanageâ them in natural language.
What surprised me most was how it dynamically adapts. For simple factual questions, it just queries one model. But for hard coding problems, it autonomously spins up a whole pipeline of planners, coders, and verifiers.
Really excited to see where this paradigm of âAI managing AIâ goes next, especially as we start moving from single-agent chain-of-thought to multi-agent âchain-of-commandâ.
Link to our #ICLR2026 paper: https://t.co/EwbjjRPLUb
Along with our TRINITY paper which we announced earlier, this work also powers our new multi-agent system: Sakana Fugu (https://t.co/2m8VRdOYqG) ð¡
Weâre excited to introduce KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI, accepted at #ICASSP2026! ð¢
Blog https://t.co/arVz1TGpJJ
Paper https://t.co/0EwpyRXeCs
Can a speech AI think deeply without pausing to process?
In real conversation, we donât wait until weâve fully worked out what we want to sayâwe start talking, and our thoughts catch up as the sentence unfolds.
Fast speech-to-speech models achieve this, but their reasoning tends to stay shallow. Cascaded pipelines that route through a knowledgeable LLM are smarter, but the added latency breaks the flowâthey fall back to "think, then speak."
In our new paper, we propose a way to break this trade-off. We call it KAME (Turtle in Japanese).
A speech-to-speech model handles the fast response loop and starts replying immediately. In parallel, a backend LLM runs asynchronously, generating response candidates that are continuously injected as "oracle" signals in real time.
This shifts the AI paradigm from "think, then speak" to "speak while thinking."
The backend LLM is completely swappable. You can plug in GPT-4.1, Claude Opus, or Gemini 2.5 Flash depending on the task without changing the frontend. In our experiments, Claude tended to score higher on reasoning, while GPT did better on humanities questions.
Try the model yourself here: https://t.co/uDA0nvvjhS
Weâre excited to introduce KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI, accepted at #ICASSP2026! ð¢
Blog https://t.co/arVz1TGpJJ
Paper https://t.co/0EwpyRXeCs
Can a speech AI think deeply without pausing to process?
In real conversation, we donât wait until weâve fully worked out what we want to sayâwe start talking, and our thoughts catch up as the sentence unfolds.
Fast speech-to-speech models achieve this, but their reasoning tends to stay shallow. Cascaded pipelines that route through a knowledgeable LLM are smarter, but the added latency breaks the flowâthey fall back to "think, then speak."
In our new paper, we propose a way to break this trade-off. We call it KAME (Turtle in Japanese).
A speech-to-speech model handles the fast response loop and starts replying immediately. In parallel, a backend LLM runs asynchronously, generating response candidates that are continuously injected as "oracle" signals in real time.
This shifts the AI paradigm from "think, then speak" to "speak while thinking."
The backend LLM is completely swappable. You can plug in GPT-4.1, Claude Opus, or Gemini 2.5 Flash depending on the task without changing the frontend. In our experiments, Claude tended to score higher on reasoning, while GPT did better on humanities questions.
Try the model yourself here: https://t.co/uDA0nvvjhS
Weâre launching the beta for our new commercial AI product: Sakana Fugu ð¡, a multi-agent orchestration system!
Blog: https://t.co/36Ud311KCP
Fugu hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench, and has been our internal secret weapon. It dynamically coordinates frontier models, autonomously selecting the optimal agent combinations and roles for each task.
Available as an OpenAI-compatible API, you can seamlessly integrate Fugu into your existing workflows with minimal changes.
ð Fugu Mini: High-speed orchestration optimized for latency
ð¡ Fugu Ultra: Full model pool utilization for deep, complex reasoning
Apply for the beta test here: https://t.co/1fjuAha7ci