Been seeing double responses in @OpenAI@OpenAIDevs for gpt 5.4 api for a while after itโs release. This is bugging me in production with random repeated responses. Why is it still unresolved?
Hey @royalenfield - a thread about how your customer support handles a stranded rider on a warranty-covered motorcycle.
My Himalayan 450 (KL 52 U 2633) failed mid-ride on 9 June with ECU-TPS failure. Stuck at Teknik Motorcycles, Sarjapur Rd, Bangalore since.
@Elevenlabs be doing everything except improving their TTS. Comeonn u guys are lacking there. Pullout v3 out of betaโฆ add streaming end point to v3โฆmore practical emotional/sound tags etc can point out a bunch of niche flag bugs too lmk
> I was out to Shillong for five days.
> Uninstalled X cuz my storage was full.
> Reinstalled back now, sees Clawdbot, moltbot, moltbook, molthub bs everywhere.
> Uninstalls again.
btw hereโs rainbow falls in Meghalaya
Tailwind lays of 75% of their team. the reason is so ironic:
> their css framework became extremely popular w AI coding agents, 75m downloads/mo
> that meant nobody would visit their docs where they promoted paid offerings
> resulting in 40% drop in traffic & 80% revenue loss
Before jumping into ๐๐๐๐ญ๐จ๐ซ ๐๐ฎ๐๐ง๐ญ๐ข๐ณ๐๐ ๐๐๐ซ๐ข๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐๐ฎ๐ญ๐จ๐๐ง๐๐จ๐๐๐ซ๐ฌ,
let me explain why I implemented this from scratch.
For the past few months, Iโve been deep into audio deep learning:
understanding how text-to-speech, speech-to-text, and audio-to-audio models work.
Along the way, I covered mel spectrograms, Fourier transforms, signal processing basics, and even infra pieces like WebSockets and WebRTC.
A few weeks ago, I implemented a ๐๐๐ ๐ฆ๐จ๐๐๐ฅ ๐๐ซ๐จ๐ฆ ๐ฌ๐๐ซ๐๐ญ๐๐ก. I initially coded NVIDIAโs Tacotron2 architecture
Code
https://t.co/O29ldXlD6X
but instead of training it, I moved on to a ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ซ-๐๐๐ฌ๐๐ ๐๐๐ ๐ฆ๐จ๐๐๐ฅ, which I trained for ~24 hours on an A100. [will share the code stuff in the future]
After that, I started building a speech-to-text model.
While working on the STT architecture, one block kept appearing repeatedly:
๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ ๐๐๐๐ญ๐จ๐ซ ๐๐ฎ๐๐ง๐ญ๐ข๐ณ๐๐ซ๐ฌ
To understand RVQ, I needed VQ-VAE.
To understand VQ-VAE, I needed Variational Autoencoders.
And to understand VAEs properly, I first implemented a vanilla autoencoder.
So I ended up implementing all three:
Autoencoder โ Variational Autoencoder โ Vector Quantized VAE
๐๐ฎ๐ญ๐จ๐๐ง๐๐จ๐๐๐ซ๐ฌ compress data into a latent space and reconstruct it back, while ๐๐๐๐ฌย add a probabilistic structure by learning a distribution and regularizing it with a prior, enabling generation but often causing blur and instability.
AE Code
https://t.co/Fvua50p5oH
VAE Code
https://t.co/yzF2RPHKWD
๐๐โ๐๐๐ changes this by making the latent space discrete. The encoder maps inputs to the nearest vector in a learned codebook, turning each latent into a token from a fixed vocabulary.
VQ-VAE Code
https://t.co/kKaENveqHg
Why this matters:
โข Speech and audio are inherently discrete
โข Quantization yields efficient compression
โข Discrete tokens integrate seamlessly with Transformers
Unlike VAEs, VQโVAEs use a deterministic encoder, no KL loss, and learn the token prior separately. They train via reconstruction, codebook, and commitment losses with straightโthrough gradients , making them fundamental to speech synthesis, recognition, and modern audio tokenizers.
Iโm going to implement RVQ next and share a detailed post about building TTS and STT models from scratch, voice cloning, and fine-tuning audio models.
If youโre into deep learning or computer science,ย ๐๐จ๐ฅ๐ฅ๐จ๐ฐ ๐ฆ๐ for insightful content
โ and donโt forget to ๐ซ๐๐ฉ๐จ๐ฌ๐ญ this!
Had this deep convo on male character development, how increasingly impressionable someone becomes with loses/heartbreaks. That feeling of losing someone who knows 99.9% of you and the ego that fires up to prove that everything they knew about you is wrong by evolving yourself.