Diffusion autoencoders have been theoretically backed in terms of learning informative representations, and exploring the modellability of the learnt repre makes a lot sense! The tweak on leveraging frozen SSL model is cool.
💡 The idea
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
Preprint of today: Vavilala et al., "Generative Blocks World: Moving Things Around in Pictures" -- https://t.co/2UV2B2qzXL
I have a soft spot for reviving old ideas in modern methods -- block world via primitives now with Diffusion models for generating/editing images.
@tkipf Yes - AudioSlots is yet another motivator behind our work. Though they are not directly comparable due to the problem domain, they share a common high level idea to encode constituent sources as separate latent entries which I can really appreciate
Pumped to see a comeback of GMVAE among a sea of VQ!
https://t.co/ZqPqIWRkTj
Speaking of, Wei-Ning's https://t.co/W9q7jHCzeP on TTS has a substantial impact to my research on style transfer via (unsupervised) disentanglement. But it seems overshadowed by his own work HuBERT😅
@92HsChoi The two-stage paradigm relies on 1st's reconstruction and 2nd's distribution modelling. Not noising posterior defo gives better 1st reconstruction. Not sure about the effect for 2nd but I think having a proper prior during 1st training is more impactful.🤔
主著論文がAPSIPA Trans.にアクセプトされました🙌
Our paper has been accepted for publication in APSIPA Transactions!!🚀
A big thanks to the co-authors (professors!), reviewers, and everyone who supported this work. Special mention to @jun_luolo, @_tai_shi, and @yoshipon0520🙏
At the NeurIPS workshop of Audio Imagination, we present a supervised method as a preliminary step towards answering these questions.
https://t.co/tCnStND5Fw
How to train a model to extract separate entities of rep. associated to individual sources of a music mixture?
How to also divide each entity into subspaces of pitch and timbre?
How to then have the model take arbitrary comb. of these building blocks to sample novel mixtures?
By feeding to the decoder different combinations of pitch and timbre latents, we achieve applications such as:
- instrument swapping between two mixtures or within a mixture
- stem exchange between two mixtures
There goes the one proof to my ISMIR presence.
It was very nice to catch up w/ the Taiwanese Gang, and it's my honour to be confronted by "why are you still doing disentanglement?"
That's right, I will also be presenting DisMix https://t.co/tCnStND5Fw at the NeurIPS Workshop!
@ArxivSound Great to more interests in pitch-timbre disentanglement. Using paired data with shared attributes has shown good results in speech
https://t.co/yXfF1uLmaI
https://t.co/V3cCUNe6DX
They are also proven identifiable under assumptions
https://t.co/J0PRYyK2fy
https://t.co/IvBF3gGCPp
``Self-Supervised Multi-View Learning for Disentangled Music Audio Representations,'' Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello, https://t.co/vWc7Ulfz2q