Nitay Calderon @NitCal - Twitter Profile

Pinned Tweet

3 months ago

[1/7] Why do frontier LLMs make factual errors? Is it because they never learned the fact… or because they can’t access knowledge they already encoded? In our new paper, we show: The bottleneck is not encoding; it is recall. 🧵👇 Paper: https://t.co/mkkqr0KN4X Many thanks to @_galyo @bd_eyal @zorikgekhman @eran_ofek59358 @GoogleResearch

NitCal's tweet photo. [1/7] Why do frontier LLMs make factual errors?
Is it because they never learned the fact…
or because they can’t access knowledge they already encoded?

In our new paper, we show:
The bottleneck is not encoding; it is recall. 🧵👇

Paper: https://t.co/mkkqr0KN4X

Many thanks to @_galyo @bd_eyal @zorikgekhman @eran_ofek59358 @GoogleResearch

4

124

33

90

13K

NitCal retweeted

DailyPapers

@HuggingPapers

2 days ago

A matter of TASTE Current agent benchmarks are saturated. TASTE reverses how they're built—starting from tool sequences, not hand-written scenarios. Models scoring 90% on current tests crash to 30% on TASTE, facing 2× more tool combinations.

HuggingPapers's tweet photo. A matter of TASTE

Current agent benchmarks are saturated. TASTE reverses how they're built—starting from tool sequences, not hand-written scenarios.

Models scoring 90% on current tests crash to 30% on TASTE, facing 2× more tool combinations. https://t.co/3KhzGc4z66

1

25

5

8

2K

NitCal retweeted

Samuel Schmidgall

@SRSchmidgall

6 days ago

Our posting for joining Google DeepMind as a Research Scientist was down for a few days but now it is back up! Apply here: https://t.co/Yk5iMbMQPu And fill out this form: https://t.co/zdeqryH3hB

7

361

34

341

71K

NitCal retweeted

Vin Howe

@vinhowe

15 days ago

We build on existing work showing that frontier performance on all sorts of transfer is more inconsistent than we might hope, especially after learning from trillions of tokens: https://t.co/mYBiTyVoWk @NitCal https://t.co/Au95cAwhWX @omerNLP https://t.co/AC6IahZYI4 @LChoshen

1

0

368

Who to follow

Yoav Shoham

@yshoham

AI21 Labs, Co-Founder; Stanford University, Professor (emeritus); AI Index @indexingai Founding Chair

AI Professor, Technion. TACL co-editor in Cheif. Chief scientist NowYouKnow. AI consultant.

NitCal retweeted

Vin Howe

@vinhowe

15 days ago

Preprint 🧵! How compartmentalized are LLMs? For data in different formats (English/Chinese, Wiki/Q&A), how much transfer occurs? We provide evidence that LLMs can struggle with this sort of transfer, with consequences like sample inefficiency and capacity competition.

3

9

3

4

2K

NitCal retweeted

Mor Ventura @mor_ventura95

15 days ago

🚨 New preprint alert! 🎨 How do image editing models handle "make it look like a rainy day" vs. "add an umbrella"? While visual models excel at explicit commands, interpreting abstract instructions remains a major bottleneck.🧵👇 [1/10]

1

27

15

0

1K

Nitay Calderon

@NitCal

18 days ago

@Moshe_Friedman_ למה אתה מאמין בקוד? למה שאחרים יאמינו לאנליזה שלך? לא הייתה פעם אחת שאנליזה שנעשתה עם LLMs שהציגו לי סטודנטים הייתה נכונה מא' עד ת'. תמיד היתה שם פונקציונליות סמויה ושגויה שהם בכלל לא היו מודעים לה (למשל התמודדות עם ערכים חסרים, משקול, שינוי מחלקות...)

1

3

0

59

Nitay Calderon

@NitCal

19 days ago

@ziv_ravid @Tyler_Menzer Do you truly trust authors who didnt verify their citations to also verify the AI-generated code or analysis? I dont. Once I see signs of slop, it becomes hard to trust anything else in the paper.

0

45

Nitay Calderon

@NitCal

19 days ago

@mtutek Exactly. I wish conferences would also impose penalties

0

2

0

157

Nitay Calderon

@NitCal

20 days ago

My advisor always says time is our most valuable resource, I tell students I teach/work with that I dont plan to spend more time reading something than the author spent writing it. I support arXiv's decision. Asking authors to polish AI-generated content is a *VERY* low bar.

Thomas G. Dietterich @tdietterich

21 days ago

Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/

140

6K

918

1K

1M

5

107

4

6

5K

Nitay Calderon

@NitCal

20 days ago

@yanaiela I would bet that there is a very high correlation between unpolished papers and unreliable results. If an author can't be bothered to proofread the text, I don't trust that they verified their AI-generated analysis.

2

3

0

193

Nitay Calderon

@NitCal

20 days ago

@ziv_ravid Why? Polishing/checking AI-generated content is a very minimal requirement. The steep punishment is a good way to ensure this. PIs' students should be scared to death of uploading AI-slop papers.

0

13

1

0

635

NitCal retweeted

Gal Yona

@_galyo

23 days ago

It’s 2026 and frontier LLMs STILL hallucinate. Why? In our new ICML 2026 Position Paper, we offer a simple diagnosis and a constructive path forward.

_galyo's tweet photo. It’s 2026 and frontier LLMs STILL hallucinate.
Why?

In our new ICML 2026 Position Paper, we offer a simple diagnosis and a constructive path forward. https://t.co/eO95dydye8

13

229

41

154

13K

NitCal retweeted

Jonathan Karin @JonathanKarin3

23 days ago

1/7 How are cells spatially reorganized between conditions in tissues? Introducing CASEI: a method for inferring condition-associated spatial phenotypes in spatial omics data. w/ Roy Friedman (https://t.co/qm7qK1lBlS) & @mor_nitzan https://t.co/9SNJZKsOkD

JonathanKarin3's tweet photo. 1/7
How are cells spatially reorganized between conditions in tissues?
Introducing CASEI: a method for inferring condition-associated spatial phenotypes in spatial omics data.
w/ Roy Friedman (https://t.co/qm7qK1lBlS) & @mor_nitzan
https://t.co/9SNJZKsOkD https://t.co/pUoGzcJIrP

1

33

7

14

3K

Nitay Calderon

@NitCal

27 days ago

@henrytdowling @_galyo Yes

0

2

0

30

Nitay Calderon

@NitCal

27 days ago

If you care about knowledge in LLMs, and why parametric knowledge remains a fundamentally important research problem, you should read Gal’s tweet 🤯

Gal Yona

@_galyo

27 days ago

.@NitCal will be presenting "Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality" at ICML 2026 next month. (tl;dr: we show encoding is near-saturated on frontier LLMs, but models still struggle to recall encoded facts.) One recurring piece of feedback we've gotten since posting the paper: "you show LLMs struggle with factual recall, but does that even matter when today's agents can use external retrieval?" Here's how I currently think about this, and more broadly about the role of parametric knowledge in today's systems: The theoretical argument for why knowledge matters (true in principle, but I don't know of work that measures this in practice): parametric knowledge is important for making efficient use of search and for knowing how to properly integrate retrieved information. Imagine finding some weird pizza recipe online — can you trust it without knowing a lot about cooking, chemistry, etc.? I think this is going to become a bigger issue moving forward, the more "sloppier" the internet becomes. The realistic case for why knowledge matters: today's agents are far from producing responses that are fully grounded in external evidence. Even when search triggers properly — which it often doesn't — only the "big" claims tend to be grounded, while models still volunteer a lot of extra information from their parametric knowledge. Since models are still poor at "knowing what they know" (more on that in my next post, about our other ICML paper...), our best bet is making models actually more knowledgeable — and our paper reveals where the headroom for that actually lies.

_galyo's tweet photo. .@NitCal will be presenting "Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality" at ICML 2026 next month.

(tl;dr: we show encoding is near-saturated on frontier LLMs, but models still struggle to recall encoded facts.)

One recurring piece of feedback we've gotten since posting the paper: "you show LLMs struggle with factual recall, but does that even matter when today's agents can use external retrieval?"

Here's how I currently think about this, and more broadly about the role of parametric knowledge in today's systems:

The theoretical argument for why knowledge matters (true in principle, but I don't know of work that measures this in practice): parametric knowledge is important for making efficient use of search and for knowing how to properly integrate retrieved information. Imagine finding some weird pizza recipe online — can you trust it without knowing a lot about cooking, chemistry, etc.? I think this is going to become a bigger issue moving forward, the more "sloppier" the internet becomes.

The realistic case for why knowledge matters: today's agents are far from producing responses that are fully grounded in external evidence. Even when search triggers properly — which it often doesn't — only the "big" claims tend to be grounded, while models still volunteer a lot of extra information from their parametric knowledge.

Since models are still poor at "knowing what they know" (more on that in my next post, about our other ICML paper...), our best bet is making models actually more knowledgeable — and our paper reveals where the headroom for that actually lies.

2

30

5

13

4K

0

22

1

7

3K

NitCal retweeted

Gal Yona

@_galyo

27 days ago

.@NitCal will be presenting "Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality" at ICML 2026 next month. (tl;dr: we show encoding is near-saturated on frontier LLMs, but models still struggle to recall encoded facts.) One recurring piece of feedback we've gotten since posting the paper: "you show LLMs struggle with factual recall, but does that even matter when today's agents can use external retrieval?" Here's how I currently think about this, and more broadly about the role of parametric knowledge in today's systems: The theoretical argument for why knowledge matters (true in principle, but I don't know of work that measures this in practice): parametric knowledge is important for making efficient use of search and for knowing how to properly integrate retrieved information. Imagine finding some weird pizza recipe online — can you trust it without knowing a lot about cooking, chemistry, etc.? I think this is going to become a bigger issue moving forward, the more "sloppier" the internet becomes. The realistic case for why knowledge matters: today's agents are far from producing responses that are fully grounded in external evidence. Even when search triggers properly — which it often doesn't — only the "big" claims tend to be grounded, while models still volunteer a lot of extra information from their parametric knowledge. Since models are still poor at "knowing what they know" (more on that in my next post, about our other ICML paper...), our best bet is making models actually more knowledgeable — and our paper reveals where the headroom for that actually lies.

2

30

5

13

4K

Nitay Calderon

@NitCal

about 1 month ago

Our paper got accepted to @icmlconf! 🥳🥳 I also want to say a few warm words about the reviewers and AC. Maybe because our paper was under Policy A (LLM use is prohibited), but the review process felt unusually professional and refreshing, almost like a reminder of pre-2024 peer review 😇 I hope more authors get to experience this kind of review process in the future.

Nitay Calderon

@NitCal

3 months ago

[1/7] Why do frontier LLMs make factual errors? Is it because they never learned the fact… or because they can’t access knowledge they already encoded? In our new paper, we show: The bottleneck is not encoding; it is recall. 🧵👇 Paper: https://t.co/mkkqr0KN4X Many thanks to @_galyo @bd_eyal @zorikgekhman @eran_ofek59358 @GoogleResearch

4

124

33

90

13K

1

55

3

17

6K

NitCal retweeted

roeeaharoni @roeeaharoni

about 1 month ago

Proud of being part of Google Translate (even if it was for a few months as an intern, almost a decade ago!). One of the most fun and rewarding professional experiences of my life, in a truly revolutionary team. PS not many know but lots of the groundwork to LLMs happened there!

0

16

1

0

922

NitCal retweeted

Itay Nakash @itay__nakash

about 1 month ago

🚨New Paper (ACL-26) 'Efficient Agent Evaluation via Diversity-Guided User Simulation' We tackle a core pain in agent evaluation: Current methods aim for coverage (pass@k) but mostly re-run the same conversations → low diversity, high cost. We're excited to introduce DIVERT 🧵

itay__nakash's tweet photo. 🚨New Paper (ACL-26)
'Efficient Agent Evaluation via Diversity-Guided User Simulation'

We tackle a core pain in agent evaluation:
Current methods aim for coverage (pass@k) but mostly re-run the same conversations → low diversity, high cost.

We're excited to introduce DIVERT 🧵 https://t.co/4hZda3I0vk

2

21

8

5

550

NitCal retweeted

Omer Nahum @omer6nahum

3 months ago

Do LLMs have motivation? Motivation is a key lens for explaining human behavior. As LLM behavior becomes more human-like, a natural question arises: could it help understand model behavior too? With @AsaelSklar @GoldsteinYAriel @roireichart 📄 Paper: https://t.co/cdh2qmGNmE 1/5

omer6nahum's tweet photo. Do LLMs have motivation?
Motivation is a key lens for explaining human behavior.
As LLM behavior becomes more human-like, a natural question arises: could it help understand model behavior too?

With @AsaelSklar @GoldsteinYAriel @roireichart
📄 Paper: https://t.co/cdh2qmGNmE
1/5 https://t.co/6amCXyQ26G

3

49

16

25

3K

Nitay Calderon

@NitCal

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users