@jeremyphoward@ggerganov I understand that the amount of memory is a bottleneck on consumer GPUs, but wouldn't the inference speed still be better with less active parameters during generation?
@SSBM_Arte I think it happened to me once when I had paste an image that wouldn't properly upload, but refreshing the page fixed it iirc. It could also be that a previous message is open for editing i guess
@jaxmorphy i mean, 3 denoising steps is not a lot. you can still see the stage outline really well. do you plan on using rolling diffusion/diffusion forcing?
@y0b1byte i think the DreamerV3 paper mentions that it uses the same set of hyperparameters for every experiment, so the comparison might not be entirely fair
@spikedoanz@filipviz normalize by average won't work with negative logits. you could offset everything by the smallest logit maybe, and it would also get you the translation invariance property of softmax
@rami_mmo this seems contrary to what's commonly though about quantized latents, and about why VQVAE was made in the first place. do you think KL works better here because the minecraft scenery is not that diverse? (i think you allude to this in the article)
@torchcompiled@giffmana I think the issue here is that for Transformer they plot the cumulative training time (124M+354M+757M+1.4B), instead of comparing to just the 1.4B trained from scratch, which seems to take about the same amount of TPU hours as the Tokenformer 1.4B, so the graph seems disingenuous