What happens when you compare the distributions of real and simulated user behaviors?
🔍 The gap is large.
We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks.
@convai_uiuc@MSFTResearch@berkeley_ai
🧵 1/N
The better LLMs get at reasoning, the longer their traces get—thousands of tokens, dozens of tool calls. But in law, medicine, and agentic AI, "usually correct" isn't good enough: answers must be verifiably correct.
We built interwhen at @MSFTResearch to make that tractable. And it's now open source.
Across benchmarks, plugging interwhen into an LLM yields:
✅ 100% soundness (with full verifiers)
📈 up to 15% accuracy gain
⚡ ~ 1.5× compute cost
🧵
We believe forecasting and simulations should be accessible at SCALE,
the way everyone uses weather forecasts today!
For the past few months, we have been building towards that VISION.
More details coming soon!! Stay in touch. @KairosityAI
Great discussion this evening at the Hidden Layers reading group on State-Space Models. Thanks to @SaiMadhavanG (IIIT-B) and @cloneofatharva (Kairosity Co-founder) for leading us through it with their insights!
Astitva by Anurag, Prasad and Sagar.
Best Overall Film at Mumbai AI Film Festival
Prize: 12 Lakhs in cash + a trip with the best creators of the world to Japan in 2026
Award sponsored by @morphic
Excited to share that our paper “STACKFEED”, in collaboration with
@namak_kun
will be presented at EMNLP 2025! 🎉We explore how to make knowledge bases in RAG systems editable and adaptive instead of staying static and stale.