Andreas Grivas

@andreasgrv

Interested in Bottlenecks in Neural Networks; Unargmaxable Outputs. Postdoc in ML/NLP at the University of Edinburgh.

Edinburgh, Scotland

Joined August 2015

664 Following

513 Followers

565 Posts

Pinned Tweet

Andreas Grivas @andreasgrv

over 2 years ago

How expressive is your deep multi-label classifier? Can it represent all outputs of interest? 🤔 SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫 Check out our paper! https://t.co/7Jw6gLuYYM 🧵(1/7)

andreasgrv's tweet photo. How expressive is your deep multi-label classifier?
Can it represent all outputs of interest? 🤔

SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫

Check out our paper! https://t.co/7Jw6gLuYYM

🧵(1/7) https://t.co/sXqIrQSo2p

21K

andreasgrv retweeted

Piotr Nawrot

@p_nawrot

22 days ago

The best supervisor I could have asked for is building something really exciting at Imperial London. If you're considering a PhD in Efficient NLP — seriously, look into this.

andreasgrv retweeted

Edoardo Ponti @PontiEdoardo

23 days ago

I am moving to @ICComputing at @imperialcollege as an associate professor, where I will be expanding my lab! I am looking for PhDs and postdocs to join me on my quest to build foundation models with adaptive tokenisation and memory (AToM FMs, funded by @ERC_Research)

PontiEdoardo's tweet photo. I am moving to @ICComputing at @imperialcollege as an associate professor, where I will be expanding my lab!

I am looking for PhDs and postdocs to join me on my quest to build foundation models with adaptive tokenisation and memory (AToM FMs, funded by @ERC_Research) https://t.co/n83CA7j9tG

209

14K

Andreas Grivas @andreasgrv

5 months ago

Fantastic opportunity to work on #reliable #LLM #agents in Edinburgh! Antonio is phenomenal and super fun to work with!

antonio vergari ⚔️ @tetraduzione

5 months ago

🚨🚨🚨 I'll hire a #postdoc (1+1 yrs) to work on 💥#reliable #LLM #agents with #neurosymbolic #nesy layers💥 with Huawei Trustworthy Technology and Engineering Laboratory Munich 👉https://t.co/zzVSYWluzm 🙏 please share!

400

Who to follow

Seraphina Goldfarb-Tarrant @ICLR🇧🇷

@seraphinagt

Head of AI Safety @cohere. PhD from @EdinburghNLP @InfAtED. If you don't recognise me it's cause I am invisible https://t.co/oRZvFdJb2p

Tom Sherborne

@tomsherborne

code MTS @cohere ex: @edinburghnlp @allen_ai @cambridgenlp @ucl @apple.

Nikita Moghe

@nikita_moghe

PhD, CDT in NLP, University of Edinburgh. Prev: IIT Madras | University of Mumbai. She/her.

andreasgrv retweeted

Benjamin Minixhofer

@bminixhofer

6 months ago

Bolmo is now on arXiv!

andreasgrv retweeted

Desmond Elliott @delliott

6 months ago

I am grateful that the Carlsberg Foundation is supporting our basic research on tokenization-free language models at the University of Copenhagen. I will be hiring Ph.D students to start in September 2026. Feel free to reach out early if you want to express informal interest.

andreasgrv retweeted

Benjamin Minixhofer

@bminixhofer

6 months ago

We are releasing Bolmo today! Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3. Bolmo also performs competitively in terms of speed & is fully open. I was skeptical of byte-level models for a long time but I finally switched camps🧵

bminixhofer's tweet photo. We are releasing Bolmo today!

Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3.

Bolmo also performs competitively in terms of speed & is fully open.

I was skeptical of byte-level models for a long time but I finally switched camps🧵 https://t.co/3H0IA4bufk

113

20K

andreasgrv retweeted

Edoardo Ponti @PontiEdoardo

6 months ago

Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering: Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3). Promising on coding and char-level understanding!

andreasgrv retweeted

Eleonora Giunchiglia @e_giunchiglia

6 months ago

📣 PhD opening – Fall 2026 The DUCK Lab @imperialcollege is looking for a PhD student to join us! Why 🦆? We work on foundational aspects of #neurosymbolicAI and #SafeAI. 👉 DUCK = Data, Uncertainty, Constraints & Knowledge 📩 Apply by emailing: [email protected]

317

180

24K

andreasgrv retweeted

Pasquale Minervini @PMinervini

6 months ago

This was presented today by the neurosymbolic wizard @EmilevanKrieken at @EurIPSConf, and by @tetraduzione and @PontiEdoardo at @NeurIPSConf! We officially achieved quantum superposition 🚀🚀🚀🚀🚀

PMinervini's tweet photo. This was presented today by the neurosymbolic wizard @EmilevanKrieken at @EurIPSConf, and by @tetraduzione and @PontiEdoardo at @NeurIPSConf! We officially achieved quantum superposition 🚀🚀🚀🚀🚀 https://t.co/0WjWtELAmc

andreasgrv retweeted

Pasquale Minervini @PMinervini

6 months ago

🚀🚀🚀🚀🚀

PMinervini's tweet photo. 🚀🚀🚀🚀🚀 https://t.co/tGXMPRGi2o

andreasgrv retweeted

Ivan Titov @iatitov

6 months ago

Happy to announce one (or more) postdoctoral positions at the U Amsterdam! There’s a lot of flexibility in research direction, including continual learning, memory in LLMs, AI safety, unlearning/editing, reasoning, and interpretability - areas our group is currently focused on.

iatitov's tweet photo. Happy to announce one (or more) postdoctoral positions at the U Amsterdam! There’s a lot of flexibility in research direction, including continual learning, memory in LLMs, AI safety, unlearning/editing, reasoning, and interpretability - areas our group is currently focused on. https://t.co/eFWoc8m4ND

andreasgrv retweeted

Piotr Nawrot

@p_nawrot

6 months ago

We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction. Save the date! -San Diego, Thur 11:00 -Copenhagen, Thur 10:30

p_nawrot's tweet photo. We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction.

Save the date!
-San Diego, Thur 11:00
-Copenhagen, Thur 10:30 https://t.co/CDhxXhGKF2

andreasgrv retweeted

Ivan Titov @iatitov

7 months ago

Excited about the collaboration with Kolya @FelineAutomaton . We’re offering a fully funded PhD at @EdinburghNLP (start Sept 2026), working on language-based state representations for time series, comes with a generous budget for travel and experiments.

andreasgrv retweeted

Sarah Wiegreffe @sarahwiegreffe

7 months ago

I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026! We are #3 in AI and #4 in NLP research on @CSrankings. Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵

sarahwiegreffe's tweet photo. I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026!

We are #3 in AI and #4 in NLP research on @CSrankings.
Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵 https://t.co/RxoJmt26GU

772

169

359

111K

andreasgrv retweeted

Ivan Titov @iatitov

7 months ago

We at @EdinburghUni are looking for new PhD students to join us through the Centre for Doctoral Training in Responsible NLP. Work with us on making AI systems more responsible, trustworthy and safe @EdinburghNLP

iatitov's tweet photo. We at @EdinburghUni are looking for new PhD students to join us through the Centre for Doctoral Training in Responsible NLP. Work with us on making AI systems more responsible, trustworthy and safe
@EdinburghNLP https://t.co/xIOORsOubN

andreasgrv retweeted

GLADIA Research Lab

@GladiaLab

7 months ago

After reading many of the replies, we would like to issue a few clarifications: - we cannot extract training data from the model using our method - LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time - for the same reasons, LLMs are not invertible from the output text we hope this clears up any confusion and we welcome any feedback on the matter. For any further questions, feel free to reach out to the authors: @GiorgosNik02, @tommaso_mncttn, @DonatoCrisosto1, @teelinsan, Yannis Panagakis, @EmanueleRodola

148

478

191K

andreasgrv retweeted

Piotr Nawrot

@p_nawrot

7 months ago

> From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. > Did we find a free lunch? Not quite. > The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop reasoning tasks. Long post but the above sparked some motivation in me to describe what I believe is the most interesting theory I've developed about the efficiency-performance trade-off and the evaluation of Efficient Methods. It matters how easy a given task is for your model When we worked on Sparse Frontier study — a large-scale evaluation of training-free sparse attention — we systematically tested: 6 sparse methods; 4 model sizes (7-72B); 9 tasks; 4 sequence lengths (16-128k). Everything was tightly controlled. At first, some results made no sense. For instance, a 70B model solved a task perfectly across all sparsity methods up to 64k tokens (with 95% sparsity — an impressive ~20× theoretical efficiency gain). But at 128k tokens, performance suddenly collapsed, even with moderate sparsity (around 60–70%). Meanwhile, the 14B model — though never perfect — maintained a consistent 70% accuracy across all sequence lengths for the same task and sparsity methods, again up to 95% sparsity. My intuition has always been that larger model should tolerate sparsity better so what's going on? Why the performance stays constant for 14B and drops for 70B? After some investigation, I developed a theory. Sparse methods inherently reduce model capacity — the more you compress, the less capable the model becomes. To understand how far you can push compression, you have to look at the relationship between initial model capability (C) and task difficulty (D): * If C ≫ D, you can compress aggressively and performance will stay strong. * If C ≈ D, even small compression can break the model’s performance. ⠀ In the example above, the 70B model had enough capacity to achieve 100% accuracy at 64k tokens. But at 128k, with added distractors, the task difficulty increased — pushing the model right to its limit. A bit of compression was enough to tip it over. The 14B model, on the other hand, couldn’t solve every input, but its consistent 70% success rate came from easier samples. Since those inputs were very easy, adding distractors had little impact. The remaining 30% of samples 14B could never solve was challenging and at 128k they pushed 70B model to its limits. Takeaway: When a paper reports “no accuracy drop” on easy benchmarks, that doesn’t mean the method is safe — it just means the benchmark wasn’t hard enough to expose the weaknesses. That’s why *Needle-in-a-Haystack* aren’t meaningful for evaluating sparse attention or token eviction. Modern models already solve them perfectly; they’re too easy. We need benchmarks that push models to their limits, and then apply efficiency mods. [Extra insight / thing to pay attention to] In Sparse Attention and KV Compression context relevance matters a lot I’ve also noticed that in some papers, the evaluation setup changes quietly — for example, switching from 0-shot to 5-shot settings in tasks where extra shots doesn't make a real difference. If the performance gap between 0-shot and 5-shot is within the standard deviation, those extra shots don’t add meaningful information. But they can make compression methods appear stronger. Why? Because in these cases, the “extra” context tokens (the shots) can be compressed with almost no loss in accuracy. A paper might then report “5× compression with no performance drop” — but if you ran the same experiment under strict 0-shot conditions, performance would likely fall sharply. TLDR: Efficiency gains often look good — until you test them at the edge of a model’s true capability. The closer you get to that edge, the more trade-offs reveal themselves.

p_nawrot's tweet photo. > From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention.
> Did we find a free lunch? Not quite.
> The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop reasoning tasks.

Long post but the above sparked some motivation in me to describe what I believe is the most interesting theory I've developed about the efficiency-performance trade-off and the evaluation of Efficient Methods.

It matters how easy a given task is for your model

When we worked on Sparse Frontier study — a large-scale evaluation of training-free sparse attention — we systematically tested: 6 sparse methods; 4 model sizes (7-72B); 9 tasks; 4 sequence lengths (16-128k). Everything was tightly controlled.

At first, some results made no sense. For instance, a 70B model solved a task perfectly across all sparsity methods up to 64k tokens (with 95% sparsity — an impressive ~20× theoretical efficiency gain). But at 128k tokens, performance suddenly collapsed, even with moderate sparsity (around 60–70%). Meanwhile, the 14B model — though never perfect — maintained a consistent 70% accuracy across all sequence lengths for the same task and sparsity methods, again up to 95% sparsity. My intuition has always been that larger model should tolerate sparsity better so what's going on? Why the performance stays constant for 14B and drops for 70B?

After some investigation, I developed a theory. Sparse methods inherently reduce model capacity — the more you compress, the less capable the model becomes. To understand how far you can push compression, you have to look at the relationship between initial model capability (C) and task difficulty (D):
* If C ≫ D, you can compress aggressively and performance will stay strong.
* If C ≈ D, even small compression can break the model’s performance.
⠀
In the example above, the 70B model had enough capacity to achieve 100% accuracy at 64k tokens. But at 128k, with added distractors, the task difficulty increased — pushing the model right to its limit. A bit of compression was enough to tip it over. The 14B model, on the other hand, couldn’t solve every input, but its consistent 70% success rate came from easier samples. Since those inputs were very easy, adding distractors had little impact. The remaining 30% of samples 14B could never solve was challenging and at 128k they pushed 70B model to its limits.

Takeaway: When a paper reports “no accuracy drop” on easy benchmarks, that doesn’t mean the method is safe — it just means the benchmark wasn’t hard enough to expose the weaknesses. That’s why *Needle-in-a-Haystack* aren’t meaningful for evaluating sparse attention or token eviction. Modern models already solve them perfectly; they’re too easy. We need benchmarks that push models to their limits, and then apply efficiency mods.

[Extra insight / thing to pay attention to] In Sparse Attention and KV Compression context relevance matters a lot

I’ve also noticed that in some papers, the evaluation setup changes quietly — for example, switching from 0-shot to 5-shot settings in tasks where extra shots doesn't make a real difference.

If the performance gap between 0-shot and 5-shot is within the standard deviation, those extra shots don’t add meaningful information. But they can make compression methods appear stronger.

Why? Because in these cases, the “extra” context tokens (the shots) can be compressed with almost no loss in accuracy. A paper might then report “5× compression with no performance drop” — but if you ran the same experiment under strict 0-shot conditions, performance would likely fall sharply.

TLDR: Efficiency gains often look good — until you test them at the edge of a model’s true capability. The closer you get to that edge, the more trade-offs reveal themselves.

104

15K

andreasgrv retweeted

NeSy 2026 @nesyconf

9 months ago

@luislamb We're glad to announce the NeSy 2025 Test of Time award for "Probabilistic Inference Modulo Theories"! 🏆Rodrigo de Salvo Braz was here to accept the award. This is groundwork for recent NeSy approaches like DeepSeaProbLog and the probabilistic algebraic layer.

nesyconf's tweet photo. @luislamb We're glad to announce the NeSy 2025 Test of Time award for "Probabilistic Inference Modulo Theories"!

🏆Rodrigo de Salvo Braz was here to accept the award.

This is groundwork for recent NeSy approaches like DeepSeaProbLog and the probabilistic algebraic layer. https://t.co/9pDW0zpOOf

887

andreasgrv retweeted

Orion Weller @orionweller

9 months ago

Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀 But... is it even possible? 🤔 Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on! 🧵

orionweller's tweet photo. Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀

But... is it even possible? 🤔

Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on!

🧵 https://t.co/s90etHl0c9

322

215

35K

Andreas Grivas

@andreasgrv

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users