Little golang package to make deploying cross platform apps with ONNX models a little easier: https://t.co/rmSDyfRoWL standing on the shoulders of https://t.co/W3YpxjOurX.
I feel startups are the same. When you’re small the problems are hard, when you’re big the problems are hard. I think it’s just the nature of problem solving the limiting factor. Now AI is taking more of the easy things — we’re left just with just the hard problems it can’t solve (or getting it to solve is not worth ROI). Thus the problems are harder and the competition can solve all the easy things too now. More than ever differentiation on hard problems is required.
If you’re willing to burn ~30% of a cpu core on a custom semantic VAD and hack through the bugs in gpt-realtime2 you can build something that feels much more responsive and natural than OpenAI semantic VAD and as a bonus it keeps your computers toasty.
Stack: WebRTC carries full-duplex opus audio from local server that also connects with WebSocket to GPT Realtime2 with OpenAI VAD disabled. Local server decodes the different PCM/sample-rate paths for all the different detectors+openai and also encodes for browser playback of assistant voice. Server runs local Silero VAD (ML), assistant echo-aware barge-in gating with non word utterance detection and auto continuation. It uses a tuned multi-checkpoint Smart Turn threshold curve (smart turn is a ML model for end of turn detection, but running it 7 times at different times is much better). Server playhead telemetry drives deterministic interrupt, truncate, cancel, and context-repair logic, and works around Realtime API bugs and edge cases.
@zebassembly I tried this but found that it was still worth having apply_patch on edit with gpt5.5 using lark. Way more reliable as it doesn’t have to json escape everything.
Little trick for harness devs that has worked nicely for me, when i need to inject important “pushed” information mid conversation, i create a fake tool call request and result for the pull version of that thing.
Like get_modified_files, or get_chat_notifications. Agents seem to trust tools more than injected system messages.
@tunguz lol timely, I just posted how if you give up about 30% of a core to custom semantic VAD + encoding/decoding/resampling/detecting you can make your realtime voice assistant feel a lot more realistic.
@threepointone Somewhat similar I’ve been working on a join where the main agent decides it’s got enough info from subs and writes a summary that ends up replacing where the fork started off.
@r0ck3t23 Yeah but do I want an AI dog therapist for pennies on the dollar or a human one who costs $20/hr, is only available in work hours, and can only serve one customer at a time. Even if there are new jobs, they are not going to us meat bags. Clueless logic.