@sytelus This paper is saying that data filtering is a compute efficiency problem & not model quality problem.
And its worth saying imo. I think your qualms are that you don't buy the "high-compute, data-scarce" assumption -- under which they operate.
@OwainEvans_UK Yes, but I see you finetuned w/o any chat template, not sure if that's realistic for deployed llm (did eval also not use it?)
Does this result replicates w/ chat template? That would be surprising (and more interesting!). But currently model just acts like unaligned base model.
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
🧵1/ A single neuron is sufficient to bypass safety alignment in LLMs.
Across 7 models, 2 families, and scales from 1.7B to 70B, suppressing one MLP neuron bypasses refusal behavior — with no fine-tuning and no prompt engineering.
We call them refusal neurons.
We also study concept neurons: neurons that encode harmful knowledge itself. As a proof of concept, we identify suicide-related neurons. Our analysis reveals several interesting results⬇️
Joint work with @AtoosaChegini (equal contribution) , Maria Safi
Excited to announce my first preprint in LM interpretability!
Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp?
Open Source Pangram is out now! We have released the datasets, code, and two models based on our EditLens work on quantifying the extent of AI editing in texts.
⛷️Here’s my entry for the fast generative model olympics🥇
The Sphere Encoder is an autocoder so powerful that it produces high quality images quickly and without diffusion.
At training time, we learn an encoder that maps natural images uniformly onto the surface of a sphere. At inference time, we sample a random vector from the sphere, and a decoder makes it into an image.
🚀 Train Small, Run Big - Surrogate Training for Giant VLMs. Training a tiny 400M vision encoder that plugs into a 70B LLM – ✅ No billion $ GPU bills ✅ No endless fine‑tuning. Sounds like a free lunch? 🍱 Our #ICCV2025 paper shows it’s real with Zero‑Shot Grafting.📝 Paper: https://t.co/5A7hEc24pG 🧶 Thread ↓
We released 44B synthetic tokens from our CoT-guided rewriting, offering higher quality pretraining data than the average human-written web texts📈
🤗Data: https://t.co/FN6X1oFPNL
📜Paper: https://t.co/78Vu89UvuD (accepted at #COLM2025)
Excited to see what the community builds!
>always exist an input ... for which autoregressive models will break
that's a strong statement -- are you asserting models never truly generalize and never will?
@jachiam0 On downside, it feels more realigning/consolidation of oai model stack than a breathrough foundation model. It doesn't seem like a model oai would have proud to call gpt5 (say 2 years ago). the expectation (much of it invited) were/are huge. Not bad, but not otherworldly either.
@jachiam0 My first impression from gpt5 is that its a great model, qualitatively different from gpt4o, it feels closer to o3 (like its in middle of user and o3). there is obv smart model picker dynamics being played out. thinking mode is good too.
@max_spero_ I keep my Zoom unlocked (not anymore) and someone anonymously blasted indecent stuff on screensharing. Ended up having a bit too happening talk lol (thanks!)
@max_spero_@moultano not at all: https://t.co/fwK4skxZxP :)
weirdly, I was in market for actual literal binoculars, and its true, those things have no upper limit.
There’s been heated debate lately: Can generative AI truly self-improve?
✅Some say yes, pointing to models learning like curious humans.
❌Others say no, invoking the first law of thermodynamics: You can’t get something from nothing. No new info, no gain.
🧠 But what if the right questions could be generated on demand?🧠 🎯 Not static, not pre-written, but tailored to exactly what the model struggles with, right now.
🧑🎓Humans improve because we seek out new material, questions, feedback, and curriculum, just beyond our current abilities.
🔥 So what if AI agents 🤖 had access to the same kind of dynamic challenge? 🔥
A limitless pool of evolving tasks and questions that
-💡scale with skill
- 🔍 that reveal blind spots
- 📚 that teach
That’s the vision behind MORSE-500:
🎥 A programmatically controllable video benchmark to stress-test and train multimodal reasoning.
🧠 Abstract
🔄 Temporal
📦 Spatial
📈 Planning
⚙️ Physical
🧩 Mathematical
Each instance is scripted with Python (Manim, Matplotlib, MoviePy), gen-AI models, and real footage.
📉 Unlike static benchmarks that models quickly outgrow,
🚀 MORSE-500 evolves.
It’s not just a benchmark -- 🛝 it’s a reasoning simulator (**infinite training data!**) for next-gen AI.
📄 Paper: https://t.co/E5ALP9LjNr
🌐 Project: https://t.co/0gKjjcQTqf
#MORSE500 #VLM #MultimodalReasoning #AIResearch #SelfImprovingAI
✨ We believe this is just the first step toward building a playground to test the self-improvement potential of generative AI.
If you are around at #ICML2025, let's talk!