Postdoctoral researcher at @meta Fair and @uwnlp , Interested in going into the inner workings of neural networks, multilingualism, and fairer NLP (he/him)
We present Compute Optimal Tokenization! 🔡
Common in LLM scaling works stick to one tokenizer, sweeping data/model size.
But what happens when we control the tokenizer’s compression rate (bytes/token)?
Here we sweep tokenizers, params, and data across compute budgets: [1/N]
COLM 2026 will host 16(!) workshops:
https://t.co/Lf90oZTfiT
CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest
We present Compute Optimal Tokenization! 🔡
Common in LLM scaling works stick to one tokenizer, sweeping data/model size.
But what happens when we control the tokenizer’s compression rate (bytes/token)?
Here we sweep tokenizers, params, and data across compute budgets: [1/N]
Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣
▶️ Non-archival submissions of two types: Research papers (up to 9 pages)
▶️ Extended abstracts (up to 2 pages)
Submission deadline June 23, 2026 (AoE)
Acceptance notification on July 24, 2026 (AoE)
MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping?
We train >2000 MoE LMs 🫠 to investigate and bring you:
📄🔪🍰 Slicing and Dicing MoEs
Tl;dr: it's all about expert size and count
[1/9]
In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.
1/
The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵
There is life beyond BPE!
🔠🌱🥪
Don’t miss this amazing work from @JulieKallini tackling one of the key challenges of byte-level LLMs: generation speed.
Diffusion and speculative decoding come to the rescue, enabling much faster generation with BLT with similar performance.
Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪
Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow.
We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/
@arimedai@AnthropicAI With lower compression, we are allowing more compute for the same data sample, benefiting performance. But during training a low compressing model needs more compute to process enough data.
We present Compute Optimal Tokenization! 🔡
Common in LLM scaling works stick to one tokenizer, sweeping data/model size.
But what happens when we control the tokenizer’s compression rate (bytes/token)?
Here we sweep tokenizers, params, and data across compute budgets: [1/N]
@Vijay2050977 128k subword tokenizer is constrained to close vocabulary. Latent tokenizer supports any string as a token, while maintaining the set average compression across the sequence.
@jan_metzen That's interesting. We compared different tokenization schemes and got consistent trends, optimal compression varied a bit. You can check appendix C for more details.
Tokens are not a universal unit of data.
In our new work on Compute Optimal Tokenization, we show that when adapting scaling recipes across tokenizers, bytes are the more stable unit. And the compute-optimal compression rate is not necessarily what today’s BPE tokenizers use.
Extremely excited about our work on Compute Optimal Tokenization! This paper categorically nails down the role that compression plays in compute optimality and recommends how to scale models keeping compression in mind. Cool results on multiple languages too!
larger compute prefer smaller vocabulary, interesting.
2 follow-up questions:
1. can we decouple in/out tokenization? to isolate the effect of more-input-tokens vs. finer-prediction-granularity.
(see also https://t.co/0pGh4OGJVM)
2. can we combine it with n-gram embed?