Benjamin Warner @benjamin_warner - Twitter Profile

Pinned Tweet

over 1 year ago

Today we released ModernBERT, the first encoder to reach SOTA on most common benchmarks across language understanding, retrieval, and code, while running twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT & GTE on long context.

benjamin_warner's tweet photo. Today we released ModernBERT, the first encoder to reach SOTA on most common benchmarks across language understanding, retrieval, and code, while running twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT & GTE on long context. https://t.co/UDPJveQ6is

2

80

12

25

11K

Benjamin Warner @benjamin_warner

1 day ago

@KyrieBlunders Did you use the custom C++ & Cuda registration or Triton wrapping to make the custom kernel compile compatible? https://t.co/GanbaRFVRi

2

1

0

48

Benjamin Warner @benjamin_warner

1 day ago

@NirantK Claude Haiku 4.5 is the latest version of Haiku. There are no newer generations https://t.co/efzl2BhrQ3

1

0

310

Benjamin Warner @benjamin_warner

1 day ago

@johnowhitaker If the models gain agency, then the better comparison would probably be online/mobile banking.

0

23

Who to follow

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

Zach Mueller @ CVPR

@TheZachMueller

Head of Dev Rel at @LambdaAPI. Hardware nerd. Usually yelling at NCCL over things. Posts are my own. https://t.co/UpZIajM9qT

Samuel Müller

@SamuelMullr

Datadog Post-Training and advising at PriorLabs. Ex-Meta, Ex-DeepL, Ex-Amazon. ETH BSc, Cambridge MPhil, PhD from Freiburg. Opinions are my own. (he/him)

Benjamin Warner @benjamin_warner

2 days ago

The current best comparison for agentic coding is ATMs, which increased the demand for bank tellers.

Azeem Azhar

@azeem

2 days ago

Software engineering roles are growing and concentrating in the top-paying US companies via @GergelyOrosz https://t.co/HtwyVIS9KU

0

4

0

2

6K

1

0

1

1K

Benjamin Warner @benjamin_warner

1 day ago

@johnowhitaker Perhaps. Unlike banking, every coding proactivity increase so far: punch cards to assembly, compilers, IDEs, etc has led to an increased demand for programmers.

1

0

40

benjamin_warner retweeted

Tim Dettmers

@Tim_Dettmers

9 days ago

Not to degrade from this work, but TurboQuant is not a competitive method nor a good benchmark. Researcher -- including me -- cannot replicate the TurboQuant paper, and even then, the performance is not great. Please. Just. Stop.

17

456

29

131

61K

Benjamin Warner @benjamin_warner

12 days ago

@latkins Thanks!

1

0

67

Benjamin Warner @benjamin_warner

12 days ago

@latkins Trinity Large has quite different recommended generation settings across Hugging Face/Arcee docs: HF model card: temperature=0.45–0.6 HF generation config: temperature=0.8 Website docs: temperature=0.3 For evaluation, which settings should we be using?

1

0

312

Benjamin Warner @benjamin_warner

13 days ago

Bearish signal on the practical usefulness of Mythos.

Will McGugan

@willmcgugan

13 days ago

My concern for the AI era, or at least this phase of it, is that a generation is being taught that "close enough" is just fine. Take @AnthropicAI for example. Text wrapping in Claude Code has been broken for weeks. Superfluous spaces appear on the left edge. One engineer to another: you know its an out by one error. I refused to believe that nobody has noticed this. The shtick they are selling is that AI can fix this kind of thing. Either they tried to prompt a fix, and Claude ain't good enough to fix an out-by-one error. Or they haven't attempted it because it is "close enough". It can't be the case that AI is only good enough if we lower our standards. It can't. I'm well aware I have both feet firmly planted in my "grumpy old man" phase of life...

willmcgugan's tweet photo. My concern for the AI era, or at least this phase of it, is that a generation is being taught that "close enough" is just fine.

Take @AnthropicAI for example. Text wrapping in Claude Code has been broken for weeks. Superfluous spaces appear on the left edge. One engineer to another: you know its an out by one error.

I refused to believe that nobody has noticed this. The shtick they are selling is that AI can fix this kind of thing. Either they tried to prompt a fix, and Claude ain't good enough to fix an out-by-one error. Or they haven't attempted it because it is "close enough".

It can't be the case that AI is only good enough if we lower our standards. It can't.

I'm well aware I have both feet firmly planted in my "grumpy old man" phase of life...

73

596

43

83

112K

0

261

Benjamin Warner @benjamin_warner

14 days ago

@eliebakouch Weird way of spelling StableAdam

0

3

0

79

Benjamin Warner @benjamin_warner

14 days ago

It's astounding how much worse Claude Opus 4.7 still is at searching for up to date and accurate information compared to GPT-5.5 Thinking.

0

171

Benjamin Warner @benjamin_warner

24 days ago

@Dorialexander Is it truly tokenizer-less, or is utf-8 the tokenizer?

0

120

Benjamin Warner @benjamin_warner

29 days ago

@code_star To bad it got abandoned

0

2

0

94

Benjamin Warner @benjamin_warner

about 1 month ago

@thsottiaux Codex app is missing ssh support and connecting to dev containers

0

70

benjamin_warner retweeted

Timothy B. Lee @binarybits

about 1 month ago

Never trust financial analysis from a guy who thinks Feb 30 is a thing.

1

44

4

7

6K

benjamin_warner retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

about 1 month ago

Excited that @SophontAI + @MedARC_AI has a paper accepted to ICML! Will share more details soon :)

7

61

5

0

5K

benjamin_warner retweeted

Alvaro Bartolome

@alvarobartt

about 1 month ago

IBM Granite just released two multilingual embedding models with 97M and 311M parameters 🤏🏻 ModernBERT-based, 200+ languages, 32K context, and built for retrieval, search, similarity, and code. And... day-zero support on Text Embeddings Inference and friends!

alvarobartt's tweet photo. IBM Granite just released two multilingual embedding models with 97M and 311M parameters 🤏🏻

ModernBERT-based, 200+ languages, 32K context, and built for retrieval, search, similarity, and code.

And... day-zero support on Text Embeddings Inference and friends! https://t.co/GCngeuiA6c

7

455

60

376

43K

Benjamin Warner @benjamin_warner

about 1 month ago

@stochasticchasm Do any Mistral models perform well enough to suspect 30T training token budget?

1

7

0

2K

Benjamin Warner @benjamin_warner

about 1 month ago

A 90% confidence interval of 0.3-3x size would imply GPT 5.5 is anywhere from ~3 trillion parameters to ~27T. The latter is obviously not true, so this size estimation method doesn't seem useful.

Bojie Li

@bojie_li

about 1 month ago

Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: https://t.co/CkwJsXqnsX Paper: https://t.co/eNUdC9ye7w

bojie_li's tweet photo. Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is.

Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time.

For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years.

After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings:

1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size).

2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers.

3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters.

Website: https://t.co/CkwJsXqnsX
Paper: https://t.co/eNUdC9ye7w

71

2K

234

1K

390K

0

13

0

1

2K

benjamin_warner retweeted

Theo - t3.gg

@theo

about 1 month ago

Despite the price increase, GPT-5.5 (xhigh) still came out cheaper than Sonnet on the Artificial Analysis Index. It's more expensive than 5.4, but barely. Also check those 5.5 (medium) numbers, they're closer to a mini model with 5.4-xhigh-level performance

theo's tweet photo. Despite the price increase, GPT-5.5 (xhigh) still came out cheaper than Sonnet on the Artificial Analysis Index.

It's more expensive than 5.4, but barely. Also check those 5.5 (medium) numbers, they're closer to a mini model with 5.4-xhigh-level performance https://t.co/UrRZmWgsaT

34

2K

93

319

242K

Benjamin Warner

@benjamin_warner

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users