@GlosPazura@ModelScope2022 Sorry for the inconveniency. Could you DM me how to produce your decode rate, such as which quantized gguf you use, and which quantized version of qwen 3.5 9B you are comparing with? Plus the specific llamacpp version which LM Studio use would be great.
💥 Introducing MiniCPM-o 2.6: An 8B size, GPT-4o level Omni Model runs on device
✨ Highlights:
~Match GPT-4o-202405 in vision, audio and multimodal live streaming
~End-to-end real-time bilingual audio conversation ~Voice cloning & emotion control
~Advanced OCR & video understanding
~Offline iPad-compatible multimodal live streaming
🔗 Try it out:
GitHub:https://t.co/gtRJoHOlfd
HF:https://t.co/IY9KgoOqSI
Demo:https://t.co/IzZuyz0qB1
Here is the PR / tech blog:
https://t.co/p1heIyecw1
I've tried to describe most of the interesting implementation details. I believe the performance is quite good and it should run nicely even on low-mid range hardware.
Enjoy your local copilot in the terminal!
Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community.
What we have learned so far:
- Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short:
DiT = [VAE encoder + ViT + DDPM + VAE decoder].
According to the report, it seems there are not much additional bells and whistles.
- "Video compressor network": Looks like it's just a VAE but trained on raw video data. Tokenization probably plays a significant role in getting good temporal consistency. By the way, VAE is a ConvNet, so DiT technically is a hybrid model ;) (1/n)