@thibaudz Are you familiar with APIs ? You could just use https://t.co/oynALRmeHM for that. Call the whisper model and parse the JSON response. You can use VTT or SRT for transcription parameter and that will give you the actual caption file.
Link to the model: https://t.co/xblfCXApNy
@levelsio That can also be fine tuned with a character/word aware model. For example, instead of using dreambooth try to fine tune your current model with images using the following caption "..., wearing a tshirt with the message anti social social club" https://t.co/JE4fTyjSJX
@levelsio Nice! Using Waifu diffusion training code + BLIP captioned pics from Unsplash I suppose? Or simply textual inversion? Looking forward to see the final result, keep up the great work!
@thibaudz@cloneofsimo Pretty interesting results, face still looking a bit blurry but that could be fixed with GFPGAN.
How many pictures + steps per picture? Also whats you hit rate with best working prompt?