Gemma4 E4B is perfect for voice agents - except it's not. Fused TTS+LLM, good knowledge - what's the issue?
It ain't obvious at first.
It takes a two-turn conversation to see the problem: that second turn needs user context from the first turn. Unless we want to build up (and blow out) all user context via voice, we'll need to transcribe the first turn's voice input. No big deal.
OK, so should we run the model twice - first to respond fast, then to transcribe? Unfortunately, the model sometimes hears things differently. It's hella confusing when the response and transcript don't fit together.
So transcribe first, then run it again to respond? Now we're losing all those fused-voice speed benefits.
Maybe transcribe and respond within the same request? But now we're wasting time waiting for the transcription stream before getting to the response.
OK, so maybe ask Gemma to first respond and then transcribe what it actually heard within the same request? Now we're aggregating and chunking streams, but yeah, that works!
... most of the time. Unless Gemma messes up the output format, which it eventually will, since we keep rewriting the history, and after a few turns Gemma's output starts to align with the assistant output it sees, which only contains the response but no transcript (because the transcript, by definition, gets assigned to the user).
System prompt? Stop kidding me. You didn't actually read that last run-on sentence either, did you?
The solution? Back to chained STT before LLM. And now there's no need for Gemma4 E4B in the first place.
@aryanlabde $200 claude max is a utility bill for larger projects. Use for planning, codex or anything on OR or even local models work great for execution
claude max double quota hack - make sure your quotas reset on mon or tue. whenever something's obviously broken, @claudeai resets the limits on thu. /s
Life lesson - It's easier to adapt your workflows to the AI agent than the other way around. If something wasn't part of their training, it's impossible to make it work consistently. My notes from one year of agent-first coding:
1/ Choose your stack and language wisely. @claudeai , @ChatGPTapp , @GeminiApp & Co know everything on the surface level, but they have different strengths. We initially used Claude with Java, but switched to TS and Python - felt like productivity doubled.
2/ Agents do 90% of the work, but you don't know with 10% is missing. Use Superpowers, agent frameworks, etc - doesn't matter. There's always something that falls off the plate once the project grows. Checklists, hooks, and CICD are all great, but context-engineering remains an art.
3/ AI models change all the time. Just because you you finally made it work tday doesn't mean it keeps working tomorrow. @AnthropicAI made headlines recently, but this holds true in general. Thinking budgets, alignment, caches, etc change all the time. Feels like parenting a toddler, but without the reward of growth.
It's an awesome tech when it works, but you can't rely on it without guardrails and common sense yet.
A year ago I was running 1-2 build servers for an entire team - now that's 1-2 per developer. Coding agents can be a big productivity boost, but sometimes with unexpected side-effects. The massive growth of CI usage is one of these.