The last thing I worked on at Aleph Alpha!
With @SohirMaskey Constantin Eichenberg @douglasahorr
tl;dr:
- Quantisation-aware training works really well
- If you have a fixed memory budget you should probably go many parameters - few bits
- k-means quant. is better than uniform
Would you rather use 1 million × 16-bit weights, 4 million × 4-bit weights, or even 16 million × 1-bit weights?
In joint work between Aleph Alpha Research and Graphcore, we asked this question of LLMs — the answer encouraged us to embrace the wonder ✨ of 1-bit weights, which can outperform 4-bit and 16-bit weights on a fixed weight memory budget.
In our work
- ⚖️ A scaling laws evaluation prompts us to consider very low-bit formats
- 📈 Scaled-up tests show the power of memory-matched models with 1-bit weights
- ⚡ Kernel benchmarking demonstrates their feasibility for autoregressive inference
Read all about it in our blog and paper (link below! ⬇️)
Massive thanks to our collaborators at Aleph Alpha Research!
Authors: @SohirMaskey, Constantin Eichenberg, @atomicflndr and @douglasahorr
Today we’re releasing Trinity Large, a 400B MoE LLM with 13B active parameters, trained over 17T tokens
The base model is on par with GLM-4.5 Base, while being significantly faster at inference because it’s sparser and hybrid
The architecture we picked is one of my favorites: 3:1 local/global with SWA, NoPE on the global layers and RoPE on the local layers, gated attention, depth-scaled sandwich norm, and smooth training with Muon.
Our dataset is also high quality, curated by @datologyai .
We trained it on 2,000 B300s for a month on @PrimeIntellect infrastructure.
This is a preview release with an instruct model only — we’re ramping up RL on it.
When @latkins approached us a couple of months ago to train this model together, I thought he was crazy — but then he hired @stochasticchasm, and here we are.
@NWalhan@KuittinenPetri@Aleph__Alpha tech report is coming soon, but the positional encoding is standard rope (for all sub-transformers). I don't have the loss curve of these particular checkpoints at hand right now, but I can show you a cpt curve from a different HAT model i'm currently training; it's quite boring
Curious how to accelerate inference of some of the recent byte level models like HAT/HNet/BLT?
Check out this vllm fork developed by my friends and colleagues, Pablo and Lukas!
To my knowledge first demonstration of inference speedups from dynamic chunking in byte models!
First high-performance inference for hierarchical byte models.
@LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵
🤯 MERZ AND MACRON JUST CONFIRMED A PAN-EUROPEAN LEGAL ENTITY IS COMING
It's now up to all of us to ensure the solution that gets passed into law is fit for purpose for European startups.
That means: EU–INC. 🚀
Support us and we all will get this done together. 🇪🇺🤝
Our work on tokenizer free LLMs: Hierarchical Autoregressive Transformers (HAT)!
We recently dropped HAT models on HF, pretrained from scratch! https://t.co/S1AavLetGN
You can try them with both HF Inference AND our vllm fork: https://t.co/HOPFj3vQv9
🧵
(1/6)
And imo what's the coolest is that we made it ready for production grade inference with our own vllm fork (more details on this soon!): https://t.co/HOPFj3vQv9
So now you can now enjoy all the vllm features like continuous batching, paged attention etc also for HAT!
(5/6)
Seeing this pushback a lot - and it‘s fair!
However, these models don’t have a fixed vocabulary, i.e. there are infinitely many words the model can operate over instead of a finite set of tokens.
I wouldn't really consider these to be tokenizer-free tbh.
Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization).
Then, the bytes of these words are turned into embeddings, which are then processed by a model.