Starting in 30 min! If you’re interested in deploying LLMs in decision making domains + reasoning under uncertainty come chat with me and @JillianRossA_
Heading to #ICLR2026 (@iclr_conf) 🇧🇷 to present OpenEstimate!
As LLMs get deployed in decision-making domains, they're increasingly expected to do subjective probability estimation, drawing on everything they know to form beliefs about unknown quantities. Our paper studies this capability with a leakage-resistant benchmark.
This sits at the intersection of a few things I care about: RL in hard-to-verify domains, forecasting, and making LLMs honest about what they don't know.
Come find me Saturday 10:30–1 at poster #1716 in Pavilion 3! And if you'd like to grab coffee and chat about any of these, DMs are open!
One piece of code can replace 100 chains of thought when training LLMs.
Come chat with us tomorrow at the afternoon poster session #iclr2026
P3 poster #507 🕺
On my way to #ICLR2026 to present OpenEstimate with @alanamarzoev and give a spotlight talk at the FINAI Workshop.
Over the past few years, @AndrewWLo and I have been studying whether LLMs can be trusted to give sound investment advice. In my talk, I'll show that LLMs demonstrate heuristic collapse: rather than weighing all relevant factors, they latch onto a few salient features and ignore the rest. Heuristic collapse has direct consequences for whether LLMs can meet the legal standard of a fiduciary — and for AI advisors more broadly.
This is one of many reasons I think investing is one of the best domains for studying LLMs. Through this domain, I've been able to study LLM reasoning, human-LLM interaction, and emergent systemic effects. If you're working on any of these topics, I'd love to meet. Come find me before or after the talk on Monday at 1:35PM!
Heading to #ICLR2026 (@iclr_conf) 🇧🇷 to present OpenEstimate!
As LLMs get deployed in decision-making domains, they're increasingly expected to do subjective probability estimation, drawing on everything they know to form beliefs about unknown quantities. Our paper studies this capability with a leakage-resistant benchmark.
This sits at the intersection of a few things I care about: RL in hard-to-verify domains, forecasting, and making LLMs honest about what they don't know.
Come find me Saturday 10:30–1 at poster #1716 in Pavilion 3! And if you'd like to grab coffee and chat about any of these, DMs are open!
Do AI agents ask good questions? We built “Collaborative Battleship” to find out—and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost.
Paper, code & demos: https://t.co/ZFPt46XYUj
Here's what we learned about building rational information-seeking agents... 🧵🔽
👉 New preprint! We have lots of great benchmarks for tasks where it's possible, in principle, for models to get all the answers exactly correct. But what about tasks that *intrinsically* require reasoning about uncertain facts and quantities?
This was joint work with @JillianRossA_@MikeCafarella@jacobandreas
We’ve open sourced our benchmark OpenEstimate to drive research and progress in this space.
Stay tuned for more!
📝 Paper: https://t.co/dJkDBBNmJr
⚙️ Source code: https://t.co/KhBtz5wluA
🚨 New paper up on how LLMs reason under uncertainty! 🎲
Many real world uses of LLMs are characterized by the unknown—not only are the models prompted with partial information, but often even humans don't know the "right answer" to the questions asked.
Yet most LLM evals focus on problems with clearly defined success criteria. There’s a gap in our understanding of how models perform in this setting.
We investigate.... 🔎
All of that’s to say… There's a lot of room for improvement!
And we’re starting to see some action– maybe new RL methods like RLCR from @MehulDamani2, @ishapuri101 could make things better 👀
https://t.co/CdrkcLoIgC
🚨New Paper!🚨
We trained reasoning LLMs to reason about what they don't know.
o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more.
Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty -- improving both accuracy ✅ and calibration 🎯. [1/N]
Dr. GRPO paper was presented at @COLM_conf today, and it's a great read: https://t.co/SnWBLvZKhg
If I had a nickel for every time someone found a bug in a core ML algorithm, I would have at least two nickels
Bonjour from Montreal 🇨🇦 spending the next few days here @ COLM! DM me if you’re around and want to chat about research or non-research topics, including but not limited to: reasoning under uncertainty, forecasting, summarization/RAG, and startups
✈️ 🦙 Heading to COLM through Thursday!
We’re hiring ML researchers at Jane Street for intern and full time roles, as well as supporting grad students through our fellowship program — DM me or stop by the JS booth if you want to chat about what we’re doing with ML @ JS!
Streaming dataflow provides a unique solution to scaling OLTP applications. Want to learn how?
Founder and CEO of Readyset, @alanamarzoev, will be giving a talk on this subject at @qconlondon on Tuesday, April 9th at 10:35AM BST! Learn more:
https://t.co/L3JrIuWBY5
caching can be really helpful to reduce backend load, but cache invalidation is famously one of the hard problems in CS
enter https://t.co/cpDJ9xjU20 - a cache that is **always in sync** with postgres, so you don't need to invalidate stale data 😮