We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized).
This bug is related to 2 main issues:
1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: https://t.co/oahfxjIsKb).
2. Skipping initialization due to meta device init for DTensors with FSDP-2 (https://t.co/hLC8nnQFc3 will fix this issue upon merging).
The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization.
Check out our experiments at the 7B MoE scale: https://t.co/n8iuUICRux
Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏
Also thanks to @SonglinYang4 for quickly helping in merging the PR.
For people thinking that DeepSeek-OCR is the first model to render text as images, the University of Copenhagen already did this in 2023
Paper is called "Language Modelling with Pixels". They trained a Masked AutoEncoder (MAE) by rendering text as images and masking patches
Deepseek dropped the OCR model they trained last year. Against VL models, they highlight OCR; against OCRs, they highlight conv downsampling tokencount, yet more params.
Quite a scene watching people's reaction.
Model's only better than 0.9b paddle when it comes to math formulas.
@teortaxesTex Gotta whisper this: under heavy data costs way > training, spectral norm is basically the worst you could do from a 2nd-moment view, no? Can't splash cash on data & defend burning it.
@teortaxesTex For celebrity identification, it's pretty sure the model just learned about the name tags. It knows which face goes with which name, but it doesn't actually know who the person is. Typical Qwen superficial work.
If it messes up the order, chances are it has conflicting information from a bad update (like new wiki data layered on old data).
If it gets it right, its knowledge is current as of late 2024.
If its answer is old (from before 2023), its knowledge probably cuts off in 2022.
@teortaxesTex nay, garbage 'audio head parallel decoder' with ~1T tokens wasted.
see: https://t.co/S5MkewnVDd
might be a peculiar curse, but all the models with this architecture that I've observed have been poorly trained...
now it also applies to inference, tokenizers...
Know your inference - vllm, *.cpp, ggml, hf? Everyone believes their inference implementation is correct. Verify it yourself. Otherwise, you're no different from someone who types `ollama run deepseek-r1`.
https://t.co/L5bHPGbn1m
This suggests that, in reality, NO 'serious' LLM training is actually centered around the Hugging Face ecosystem - many who claim to surpass Meta LLaMA3.1, don't even know how to train a model properly - script kiddies
@teortaxesTex LLM lacks control over paralinguistic info, resulting in a monotonous voice similar to that of cascaded TTS systems... <del>but benchmaxxing...</del>
Sounds like and functionally equivalent to a cascaded TTS. LLM failed to manipulate paralinguistic cues. Almost the same as MiniCPM-o-2-6, whether transmitting Text Token or an LLM Hidden, is illusory; no discernible difference.
No omni API works like this.
Is this `E2E omni`?
https://t.co/60RTECijf9
coming soon...
If I understand correctly, 32k ctx window can handle ~10 minutes of audio i/o, whereas the capacity drops to just a few minutes with vision input.
25 tok/s audio & qwen2.5-vl vision tokens are heavy.
https://t.co/60RTECijf9
coming soon...
If I understand correctly, 32k ctx window can handle ~10 minutes of audio i/o, whereas the capacity drops to just a few minutes with vision input.
25 tok/s audio & qwen2.5-vl vision tokens are heavy.