Tomasz Limisiewicz @tomlimi - Twitter Profile

Pinned Tweet

about 1 month ago

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

21

620

95

508

104K

Tomasz Limisiewicz @TomLimi

6 days ago

@unixpickle sad, optimizing pretok is the most fun part

0

1

0

112

TomLimi retweeted

Conference on Language Modeling @COLM_conf

10 days ago

COLM 2026 will host 16(!) workshops: https://t.co/Lf90oZTfiT CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest

COLM_conf's tweet photo. COLM 2026 will host 16(!) workshops:
https://t.co/Lf90oZTfiT

CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest https://t.co/n0XG0xB0Uw

0

74

19

36

15K

Tomasz Limisiewicz @TomLimi

11 days ago

Happy to share that the unprocessed results and code for fitting scaling laws and plotting are now available at: https://t.co/4NwVB4Nurg

Tomasz Limisiewicz @TomLimi

about 1 month ago

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

21

620

95

508

104K

0

24

6

2K

Who to follow

Vilém Zouhar

@zouharvi

PhD @ ETH Zürich | working on (multilingual) evaluation of NLP | on the academic job market | go #vegan

Elizabeth Salesky

@esalesk

Research Scientist @GoogleDeepMind・PhD @jhuclsp・I like bicycles, tokens, and linguistic diversity・https://t.co/x2ZlH1xWty

Hila Gonen

@hila_gonen

Assistant Professor at @UBC_CS https://t.co/2cDfMi1bDR

TomLimi retweeted

Tokenization Workshop (TokShop) @COLM2026 @tokshop2025

15 days ago

Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣 ▶️ Non-archival submissions of two types: Research papers (up to 9 pages) ▶️ Extended abstracts (up to 2 pages) Submission deadline June 23, 2026 (AoE) Acceptance notification on July 24, 2026 (AoE)

tokshop2025's tweet photo. Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣
▶️ Non-archival submissions of two types: Research papers (up to 9 pages)
▶️ Extended abstracts (up to 2 pages)

Submission deadline June 23, 2026 (AoE)
Acceptance notification on July 24, 2026 (AoE) https://t.co/TsWkPYjfmH

1

15

12

1

4K

Tomasz Limisiewicz @TomLimi

18 days ago

@OrgadHadas Congrats! We (@tokshop2025) are also transferring from ICML to COLM this year!

0

2

0

296

TomLimi retweeted

Margaret Li @margs_li

19 days ago

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

margs_li's tweet photo. MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping?

We train >2000 MoE LMs 🫠 to investigate and bring you:

📄🔪🍰 Slicing and Dicing MoEs

Tl;dr: it's all about expert size and count

[1/9] https://t.co/zROgT2TAE3

15

376

56

337

36K

Tomasz Limisiewicz @TomLimi

20 days ago

@yoavgo No wonder LLM adoption is so low in Europe, with blunders like this

0

5

0

246

TomLimi retweeted

Alisa Liu @alisawuffles

22 days ago

In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.

alisawuffles's tweet photo. In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression. https://t.co/QcdLVASkfg

3

198

23

118

26K

Tomasz Limisiewicz @TomLimi

23 days ago

See you there! 🌉🔠

Tokenization Workshop (TokShop) @COLM2026 @tokshop2025

23 days ago

TokShop will be at #COLM2026! 🗓️ October 9th, 2026 📍 San Francisco, USA More details and a call for papers coming soon.

0

14

5

1

2K

0

10

0

349

Tomasz Limisiewicz @TomLimi

24 days ago

@che_shr_cat That's a nice one! 🤖😄

0

1

0

16

TomLimi retweeted

Grigory Sapunov

@che_shr_cat

25 days ago

1/ The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵

che_shr_cat's tweet photo. 1/
The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵 https://t.co/WFOv9KsnDs

6

343

46

302

19K

Tomasz Limisiewicz @TomLimi

26 days ago

There is life beyond BPE! 🔠🌱🥪 Don’t miss this amazing work from @JulieKallini tackling one of the key challenges of byte-level LLMs: generation speed. Diffusion and speculative decoding come to the rescue, enabling much faster generation with BLT with similar performance.

Julie Kallini ✨

@JulieKallini

26 days ago

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

14

737

111

461

97K

1

25

3

12

3K

Tomasz Limisiewicz @TomLimi

30 days ago

@arimedai @AnthropicAI With lower compression, we are allowing more compute for the same data sample, benefiting performance. But during training a low compressing model needs more compute to process enough data.

0

12

Tomasz Limisiewicz @TomLimi

about 1 month ago

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

21

620

95

508

104K

Tomasz Limisiewicz @TomLimi

30 days ago

@Vijay2050977 128k subword tokenizer is constrained to close vocabulary. Latent tokenizer supports any string as a token, while maintaining the set average compression across the sequence.

0

1

0

92

Tomasz Limisiewicz @TomLimi

30 days ago

@Artificially999 mine too! 💙

0

2

Tomasz Limisiewicz @TomLimi

30 days ago

@jan_metzen That's interesting. We compared different tokenization schemes and got consistent trends, optimal compression varied a bit. You can check appendix C for more details.

0

81

TomLimi retweeted

Artidoro Pagnoni

@ArtidoroPagnoni

about 1 month ago

Tokens are not a universal unit of data. In our new work on Compute Optimal Tokenization, we show that when adapting scaling recipes across tokenizers, bytes are the more stable unit. And the compute-optimal compression rate is not necessarily what today’s BPE tokenizers use.

3

69

6

26

8K

TomLimi retweeted

Srini Iyer

@sriniiyer88

about 1 month ago

Extremely excited about our work on Compute Optimal Tokenization! This paper categorically nails down the role that compression plays in compute optimality and recommends how to scale models keeping compression in mind. Cool results on multiple languages too!

0

7

4

2

1K

TomLimi retweeted

You Jiacheng @YouJiacheng

about 1 month ago

larger compute prefer smaller vocabulary, interesting. 2 follow-up questions: 1. can we decouple in/out tokenization? to isolate the effect of more-input-tokens vs. finer-prediction-granularity. (see also https://t.co/0pGh4OGJVM) 2. can we combine it with n-gram embed?

1

32

5

14

4K

Tomasz Limisiewicz

@TomLimi

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users