Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪
Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow.
We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/
1/8 RLVR improves accuracy but does not always lead to causal and verifiable CoTs.
Surprisingly, this happens even on reasoning-intensive tasks!
But we can fix this with reward shaping and SFT-before-RL.
Can we build a blind, *unlinkable inference* layer where ChatGPT/Claude/Gemini can't tell which call came from which users, like a “VPN for AI inference”?
Yes! Blog post below + we built it into open source infra/chat app and served >15k prompts at Stanford so far. How it helps with AI user privacy:
# The AI user privacy problem
If you ask AI to analyze your ChatGPT history today, it’s surprisingly easy to infer your demographics, health, immigration status, and political beliefs. Every prompt we send accumulates into an (identity-linked) profile that the AI lab controls completely and indefinitely. At a minimum this is a goldmine for ads (as we know now). A bigger issue is the concentration of power: AI labs can easily become (or asked to become) a Cambridge Analytica, whistleblow your immigration status, or work with health insurance to adjust your premium if they so choose.
This is a uniquely worse problem than search engines because your average query is now more revealing (not just keywords), interactive, and intelligence is now cheap. Despite this, most of us still want these remote models; they’re just too good and convenient! (this is aka the "privacy paradox".)
# Unlinkable inference as a user privacy architecture
The idea of unlinkable inference is to add privacy while preserving access to the remote models controlled by someone else. A “privacy wrapper” or “VPN for AI inference”, so to speak.
Concretely, it’s a blind inference middle layer that:
(1) consists of decentralized proxies that anyone can operate;
(2) blindly authenticates requests (via blind signatures / RFC9474,9578) so requests are provably sandboxed from each other and from user identity;
(3) relays prompts over randomly chosen proxies that don’t see or log traffic (via client-side ephemeral keys or hosting in TEEs); and
(4) the provider simply sees a mixed pool of anonymous prompts from the proxies. No state, pseudonyms, or linkable metadata.
If you squint, an unlinkable inference layer is essentially a vendor for per-request, anonymous, ephemeral AI access credentials (for users or agents alike). It partitions your context so that user tracking is drastically harder.
Obviously, unlinkability isn’t a silver bullet: the prompt itself still goes to the remote model and can leak privacy (so don't use our chat app for a therapy session!). It aims to combat *longitudinal tracking* as a major threat to user privacy, and its statistical power increases quickly by mixing more users and requests.
Unlinkability can be applied at any granularity. For an AI chat app, you can unlinkably request a fresh ephemeral key for every session so tracking is virtually impossible.
# The Open Anonymity Project
We started this project with the belief that intelligence should be a truly public utility. Like water and electricity, providers should be compensated by usage, not who you are or what you do with it. We think unlinkable inference is a first step towards this “intelligence neutrality”.
# Try it out! It’s quite practical
- Chat app “oa-chat”: https://t.co/ELf8LvxFzX
(<20 seconds to get going)
- Blog post that should be a fun read: https://t.co/OwFmyFlZH5
- Project page: https://t.co/Swerz1xDE2
- GitHub: https://t.co/38CeKajCy2
Relational Foundation Models face a scaling problem: diverse training datasets are rarely public due to privacy constraints 🔒.
🚀 We are excited to introduce "PluRel": a framework that synthesizes diverse multi-table relational databases from scratch, unlocking scaling laws for RFMs. 🧵
Kudos to the amazing collaborators at @StanfordAILab@Kumo_ai_team , and @SAP : @_rishabhranjan_@VHudovernik@vijaypradwi@johanneshoffart@guestrin@jure
Announcing 🌇HumanLM, a RL framework that trains LLMs to simulate human users’ responses, along with 🌆Humanual, a comprehensive user simulation benchmark
https://t.co/LivDkQ2ioo
🌄 One thing that’s fascinating about our society: human users shape the world and determine the value of almost everything
👨💼 Human reactions reflect how justifiable policies are
👩🎨 Human preferences determine the popularity of blogs/products/media
👩💻 Human feedback evaluates LLMs and makes the best LLM collaborators
🌅If we know how to simulate users **accurately**, we know how things are evaluated and what the future looks like, and we can improve things in a way that like or can collaborate well with.
So, meet HumanLM, our effort to enable a more human-centric future by simulating users.
1/ Wonderful student projects from CS329H (Fall ’25) ML from Human Preferences at Stanford University! 🚀
@sangttruong@andreas_h0wpt and I introduced students to preference learning + alignment, culminating in final projects. Out of ~50, here are 5 standouts 👇
Interested in building and benchmarking deep research systems?
Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley!
🏆Live Leaderboard https://t.co/SdE1tRtrYJ
📚 Paper: https://t.co/E1CBUqVMjO
🛠️ Code: https://t.co/FUxmY9QmBE
🧵👇
Some of the most exciting AI apps require LLM reasoning over large datasets at test time.
For these types of NL questions, RAG or Text2SQL + your favorite LLM are simply not enough.
Excited to announce our new leaderboard, from the TAG team at Stanford and Berkeley, to benchmark progress on these difficult types of NL questions over data.
Make a submission & top our leaderboard: https://t.co/Y753p8oGhJ
🔍 Vision language models are getting better - but how do we evaluate them reliably? Introducing AutoConverter: transforming open-ended VQA into challenging multiple-choice questions!
Key findings:
1️⃣ Current open-ended VQA eval methods are flawed: rule-based metrics correlate poorly with true performance (0.09 on VQAv2), while model-based eval has reproducibility issues (updates in GPT-4o versions constantly increase scores by 6% on MMVet).
2️⃣ To address this challenge, we propose AutoConverter, an agentic framework that automatically converts open-ended VQA to multiple-choice questions. It generates distractors matching/exceeding human difficulty, with only 3% of generated questions incorrect.
3️⃣ Using AutoConverter, we built VMCBench: 9,018 multiple-choice questions from 20 datasets testing 33 VLMs in a unified format!
🎯 Our goal: Make VLM evaluation more reliable, efficient & scalable
https://t.co/BSnOW8QmYr
Joint work with a really fantastic team: @hhhhh2033528 (co-lead) @leoliuym@XiaohanWang96@jmhb0@elaine__sui@ChenyuW64562111@AkliluJosiah2@Ale9806_@anjiangw advised by @lschmidt3@yeung_levy!
We've been building LOTUS at Stanford and Berkeley to make LLM-powered data processing fast, easy and declarative.
LOTUS is an open-source query engine that makes programming as easy as writing Pandas and optimizes your programs for up to 400x speedups.
To celebrate the holidays, we're excited to share our release of LOTUS 1.0.0 with a batch of new updates that make reasoning over your data faster, easier and better than ever!
Code: https://t.co/qQVJ6Vg6fi
🧵👇
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
🤔 Why are VLMs (even GPT-4V) worse at image classification than CLIP, despite using CLIP as their vision encoder?
Presenting VLMClassifier at #NeurIPS2024:
⏰ Dec 11 (Wed), 11:00-14:00
📍 East Hall #3710
Key findings:
1️⃣ VLMs dramatically underperform CLIP (>20% gap)
2️⃣ After testing 6 hypotheses, we found it's not architecture or training objective—it's lack of alignment data
3️⃣ Solution: adding classification data makes VLMs SOTA classifiers + improves their general capabilities!
🔗 https://t.co/dRVxwvaupk
Joint work w/ @AlyssaUnell, @XiaohanWang96, Dhruba Ghosh, Yuchang Su, @lschmidt3, @yeung_levy at @StanfordAILab!
Prior work has used LLMs to simulate survey responses, yet their ability to match the distribution of views remains uncertain.
Our new paper [https://t.co/DleesiPbif] introduces a benchmark to evaluate how distributionally aligned LLMs are with human opinions.
🧵
AI has the potential to transform real-world domains. But can AI actually improve outcomes in live interactions?
We conducted the first large-scale intervention of a Human-AI Approach that has statistically significant positive learning gains w/ 900 tutors & 1,800 K12 students.