@___Harald___@antferdom how do typical robotics models work for self driving? like simple diffusion policy which is pretrained on all data and finetuned on good data? in my mind this would already completly solve self driving. (with high context lenght)
@tokenpilled65B@madebyollin@Shauray7 maybe its a eval issue. JiT atleast for me doesnt produce good fine details for far away objects which might not impact FID much but which might impact perceptual losses
@tokenpilled65B@madebyollin@Shauray7 im still working on the arch before scaling. i did a small test scale with 1.4b params total. with the decoder having a dim of 1920
@tokenpilled65B@madebyollin@Shauray7 im also training tokenizer right now and LPIPS and Dino loss cant help my large patch size issues a little but not much
@madebyollin@Shauray7 JiT allows large patch sizes to work which doesnt mean that large patch sizes are equally good. also JiT really only works well for Imagenet as subjects are close. As soon as a face is only handled by like 1 patch it just cant model it properly