How expressive is your deep multi-label classifier?
Can it represent all outputs of interest? 🤔
SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫
Check out our paper! https://t.co/7Jw6gLuYYM
🧵(1/7)
The best supervisor I could have asked for is building something really exciting at Imperial London. If you're considering a PhD in Efficient NLP — seriously, look into this.
I am moving to @ICComputing at @imperialcollege as an associate professor, where I will be expanding my lab!
I am looking for PhDs and postdocs to join me on my quest to build foundation models with adaptive tokenisation and memory (AToM FMs, funded by @ERC_Research)
🚨🚨🚨
I'll hire a #postdoc (1+1 yrs) to work on
💥#reliable#LLM#agents with #neurosymbolic#nesy layers💥
with Huawei Trustworthy Technology and Engineering Laboratory Munich
👉https://t.co/zzVSYWluzm
🙏 please share!
I am grateful that the Carlsberg Foundation is supporting our basic research on tokenization-free language models at the University of Copenhagen.
I will be hiring Ph.D students to start in September 2026. Feel free to reach out early if you want to express informal interest.
We are releasing Bolmo today!
Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3.
Bolmo also performs competitively in terms of speed & is fully open.
I was skeptical of byte-level models for a long time but I finally switched camps🧵
Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering:
Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3).
Promising on coding and char-level understanding!
📣 PhD opening – Fall 2026
The DUCK Lab @imperialcollege is looking for a PhD student to join us!
Why 🦆?
We work on foundational aspects of #neurosymbolicAI and #SafeAI.
👉 DUCK = Data, Uncertainty, Constraints & Knowledge
📩 Apply by emailing: [email protected]
Happy to announce one (or more) postdoctoral positions at the U Amsterdam! There’s a lot of flexibility in research direction, including continual learning, memory in LLMs, AI safety, unlearning/editing, reasoning, and interpretability - areas our group is currently focused on.
We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction.
Save the date!
-San Diego, Thur 11:00
-Copenhagen, Thur 10:30
Excited about the collaboration with Kolya @FelineAutomaton . We’re offering a fully funded PhD at @EdinburghNLP (start Sept 2026), working on language-based state representations for time series, comes with a generous budget for travel and experiments.
I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026!
We are #3 in AI and #4 in NLP research on @CSrankings.
Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵
We at @EdinburghUni are looking for new PhD students to join us through the Centre for Doctoral Training in Responsible NLP. Work with us on making AI systems more responsible, trustworthy and safe
@EdinburghNLP
After reading many of the replies, we would like to issue a few clarifications:
- we cannot extract training data from the model using our method
- LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time
- for the same reasons, LLMs are not invertible from the output text
we hope this clears up any confusion and we welcome any feedback on the matter.
For any further questions, feel free to reach out to the authors:
@GiorgosNik02, @tommaso_mncttn, @DonatoCrisosto1, @teelinsan, Yannis Panagakis, @EmanueleRodola
> From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention.
> Did we find a free lunch? Not quite.
> The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop reasoning tasks.
Long post but the above sparked some motivation in me to describe what I believe is the most interesting theory I've developed about the efficiency-performance trade-off and the evaluation of Efficient Methods.
It matters how easy a given task is for your model
When we worked on Sparse Frontier study — a large-scale evaluation of training-free sparse attention — we systematically tested: 6 sparse methods; 4 model sizes (7-72B); 9 tasks; 4 sequence lengths (16-128k). Everything was tightly controlled.
At first, some results made no sense. For instance, a 70B model solved a task perfectly across all sparsity methods up to 64k tokens (with 95% sparsity — an impressive ~20× theoretical efficiency gain). But at 128k tokens, performance suddenly collapsed, even with moderate sparsity (around 60–70%). Meanwhile, the 14B model — though never perfect — maintained a consistent 70% accuracy across all sequence lengths for the same task and sparsity methods, again up to 95% sparsity. My intuition has always been that larger model should tolerate sparsity better so what's going on? Why the performance stays constant for 14B and drops for 70B?
After some investigation, I developed a theory. Sparse methods inherently reduce model capacity — the more you compress, the less capable the model becomes. To understand how far you can push compression, you have to look at the relationship between initial model capability (C) and task difficulty (D):
* If C ≫ D, you can compress aggressively and performance will stay strong.
* If C ≈ D, even small compression can break the model’s performance.
⠀
In the example above, the 70B model had enough capacity to achieve 100% accuracy at 64k tokens. But at 128k, with added distractors, the task difficulty increased — pushing the model right to its limit. A bit of compression was enough to tip it over. The 14B model, on the other hand, couldn’t solve every input, but its consistent 70% success rate came from easier samples. Since those inputs were very easy, adding distractors had little impact. The remaining 30% of samples 14B could never solve was challenging and at 128k they pushed 70B model to its limits.
Takeaway: When a paper reports “no accuracy drop” on easy benchmarks, that doesn’t mean the method is safe — it just means the benchmark wasn’t hard enough to expose the weaknesses. That’s why *Needle-in-a-Haystack* aren’t meaningful for evaluating sparse attention or token eviction. Modern models already solve them perfectly; they’re too easy. We need benchmarks that push models to their limits, and then apply efficiency mods.
[Extra insight / thing to pay attention to] In Sparse Attention and KV Compression context relevance matters a lot
I’ve also noticed that in some papers, the evaluation setup changes quietly — for example, switching from 0-shot to 5-shot settings in tasks where extra shots doesn't make a real difference.
If the performance gap between 0-shot and 5-shot is within the standard deviation, those extra shots don’t add meaningful information. But they can make compression methods appear stronger.
Why? Because in these cases, the “extra” context tokens (the shots) can be compressed with almost no loss in accuracy. A paper might then report “5× compression with no performance drop” — but if you ran the same experiment under strict 0-shot conditions, performance would likely fall sharply.
TLDR: Efficiency gains often look good — until you test them at the edge of a model’s true capability. The closer you get to that edge, the more trade-offs reveal themselves.
@luislamb We're glad to announce the NeSy 2025 Test of Time award for "Probabilistic Inference Modulo Theories"!
🏆Rodrigo de Salvo Braz was here to accept the award.
This is groundwork for recent NeSy approaches like DeepSeaProbLog and the probabilistic algebraic layer.
Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀
But... is it even possible? 🤔
Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on!
🧵