1/ we are excited to release ERNIE-Image, after 3 months of building from scratch.
an 8b text-to-image model from baidu's ernie image team. honestly, we didn't expect an 8b dit to get this far, this fast.
strong instruction following. best-in-class text rendering. runs on a 24gb gpu.
huge thanks to the ERNIE-Image team, this wouldn't exist without an incredibly talented group of people who shipped fast and cared deeply.
thread below. ๐ ๐
Introducing Modular Diffusers ๐ฅ
The `DiffusionPipeline` abstraction in Diffusers has established a standard in the community. But it has also limited flexibility.
Modular Diffusers breaks those shackles & enables the next gen. of creative user workflows ๐งจ
Details โฌ๏ธ
@NoonienStar for that we need to wait for the condition pipeline to be merged. But for I2V and control it will lower the number of frames by a lot, this is just text to video, with image to video or video to video with those constraints probably it will be 10-8s for that resolution.
Finally got some time to play with LTX2. With diffusers, you can generate 20-second videos with 24 GB of VRAM and 10-second videos with 16 GB GPUs, both with less than 32 GB of RAM. Here are some recipes to suit your needs: https://t.co/kmJAXlydlK
I've created a new repository with diffusers recipes, starting with Z-Image. It has easy copy & paste code with benchmarks (RAM, VRAM, inference time), so you can choose the best optimization for your environment: https://t.co/pfofFby4Av
While also testing Z-Image-Turbo, I tried by mistake a prompt with an hexadecimal color and it worked, so I tested it more and it also understands colors and gradients!
I was reading the Z Image Turbo report and saw that it can understand multiple languages, so I tried Spanish, which is my native language, and it delivered everything I asked for.
If you want to use the new FLUX.2-dev with 8โ12 GB GPUs, you can do it with Diffusers. You'll need to use the remote text encoder and this script: https://t.co/KRpEDSHWEK
In case you're wondering, the remote text encoder is free to use, you just need a hf token.
@xhinker@RisingSayak that's true and you're not the first one to ask for this, I'm thinking of opening a repo with examples and best practices for the popular models so people can just browse it. Also there's some other efforts we're doing to bring a better experience to the users.
@xhinker@RisingSayak I'll do some tests but my experience wasn't the same one, I tested WAN2.1 and saw a quality drop and for Flux I tested with a 24GB GPU and it was slower than using group offloading. It has been a while since then so I'll do some new benchmarks and see if something changed.
@KristjanRetter I thought that was implied in my answer sorry. We don't have a limit on how many loras you can load, so yes, you can use any other loras you want but if there's one that fails you can open an issue with it.
Qwen-Image-Lightning 8-Steps runs in 22s and using less than 16GB with a 3090. You can find the models and the code to test it here: https://t.co/HZGuSpSMrT
so it turns out we don't need to do anything special for the text encoder, this is with both the transformer and text encoder using bitsandbytes with 4-bit quantization, using under 16GB of VRAM and in ~1m40s with a 3090