My new Weaviate podcast from Argentina with a freshly brewed mate in hand!🧉🇦🇷
It's always fun to chat with @CShorten30: I talk about search agents, I'm amazed how this field has changed rapidly and share ORBIT findings & research ideas! 💚
How do we train and evaluate Search Agents? 👾🔎
I am SUPER EXCITED to publish a new episode of the Weaviate Podcast with Nandan Thakur (@beirmug) on Search Agents! 🎙️💚
Firstly, congratulations to Nandan who has just completed his Ph.D. at the University of Waterloo advised by Professor Jimmy Lin (@lintool)! 🎉
During this time, Nandan published several impactful works such as BEIR 🍻, MIRACL 🌍🙌🌏, FreshStack 🥞, and many more.
This podcast dives into his new work on ORBIT and the current state of Search Agents! ⚛️
ORBIT contains 20K training examples, each one a complex, multi-hop question paired with a short verifiable answer. For example, "What was the runtime of the 2017 animated film set inside a smartphone, directed by..." (Answer: 86 minutes). 🎬
This dataset is used to train Search Agents on queries that require say 4 to 5 searches in order to answer.
The crazy part is that ORBIT was generated entirely without paid Web Search APIs! The entire pipeline runs on a 2018 Linux laptop driving DeepSeek's free chat interface! 💻♻️
Trained on ORBIT, Qwen3-4B beats InfoSeeker-4B by 4.3 EM and Search-R1-4B by 9.0 EM across 7 Wikipedia QA benchmarks.
A lot of interesting nuggets in this one! As always, I hope you find it useful and happy to discuss further! 👋
By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.
But they contain than you think: you can extract sparse Latent Terms from them.
And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.
Grateful that my PhD thesis was recognized as one of the top dissertations in the 2026 Faculty of Mathematics Doctoral Prize at the @UWaterloo ! 🎉
And it is always especially nice to hear kind words from your PhD supervisor @claclarke . I guess that feeling never really goes away, even after you graduate. 😊
https://t.co/P6huzVj0Y9
Does retrieval help RAG or did the LLM already memorize the answer? 🤔 Too often, the overlap between RAG corpora and what LLMs “know” is unclear
Better RAG evaluation needs tighter alignment between NLP and IR
📚 That's why for RAG 2026 we are using @nvidia's ClimbMix corpus
I’ve never been this excited about search.
6-7 years ago, IR got an influx of the paradigms we still use, all enabled by the big headroom MS MARCO and then BEIR created. Then progress slowed.
Today, Diane releases perhaps the most ambitious IR benchmark to date: OBLIQ-Bench.
Queries in it are meant to be increasingly opaque to current first-stage retrieval paradigms. Oblique queries put the bottleneck very early in the search process, as the relevance of a document to the query is quite latent.
I can't wait for core IR research on fundamentally more powerful paradigms for first-stage search to be reignited again. Stay tuned for more stories about this, and read Diane's thread and her paper below!!
Introducing ⚛️ORBIT, a 20K reasoning-intensive web training dataset for search agents frugally generated without relying on paid APIs. Small (<4B) search agents trained with ORBIT outperform others by up to 9.0 EM accuracy on single & multi-hop Wikipedia QA. 🧵
My new Weaviate podcast from Argentina with a freshly brewed mate in hand!🧉🇦🇷
It's always fun to chat with @CShorten30: I talk about search agents, I'm amazed how this field has changed rapidly and share ORBIT findings & research ideas! 💚
How do we train and evaluate Search Agents? 👾🔎
I am SUPER EXCITED to publish a new episode of the Weaviate Podcast with Nandan Thakur (@beirmug) on Search Agents! 🎙️💚
Firstly, congratulations to Nandan who has just completed his Ph.D. at the University of Waterloo advised by Professor Jimmy Lin (@lintool)! 🎉
During this time, Nandan published several impactful works such as BEIR 🍻, MIRACL 🌍🙌🌏, FreshStack 🥞, and many more.
This podcast dives into his new work on ORBIT and the current state of Search Agents! ⚛️
ORBIT contains 20K training examples, each one a complex, multi-hop question paired with a short verifiable answer. For example, "What was the runtime of the 2017 animated film set inside a smartphone, directed by..." (Answer: 86 minutes). 🎬
This dataset is used to train Search Agents on queries that require say 4 to 5 searches in order to answer.
The crazy part is that ORBIT was generated entirely without paid Web Search APIs! The entire pipeline runs on a 2018 Linux laptop driving DeepSeek's free chat interface! 💻♻️
Trained on ORBIT, Qwen3-4B beats InfoSeeker-4B by 4.3 EM and Search-R1-4B by 9.0 EM across 7 Wikipedia QA benchmarks.
A lot of interesting nuggets in this one! As always, I hope you find it useful and happy to discuss further! 👋
How do we train and evaluate Search Agents? 👾🔎
I am SUPER EXCITED to publish a new episode of the Weaviate Podcast with Nandan Thakur (@beirmug) on Search Agents! 🎙️💚
Firstly, congratulations to Nandan who has just completed his Ph.D. at the University of Waterloo advised by Professor Jimmy Lin (@lintool)! 🎉
During this time, Nandan published several impactful works such as BEIR 🍻, MIRACL 🌍🙌🌏, FreshStack 🥞, and many more.
This podcast dives into his new work on ORBIT and the current state of Search Agents! ⚛️
ORBIT contains 20K training examples, each one a complex, multi-hop question paired with a short verifiable answer. For example, "What was the runtime of the 2017 animated film set inside a smartphone, directed by..." (Answer: 86 minutes). 🎬
This dataset is used to train Search Agents on queries that require say 4 to 5 searches in order to answer.
The crazy part is that ORBIT was generated entirely without paid Web Search APIs! The entire pipeline runs on a 2018 Linux laptop driving DeepSeek's free chat interface! 💻♻️
Trained on ORBIT, Qwen3-4B beats InfoSeeker-4B by 4.3 EM and Search-R1-4B by 9.0 EM across 7 Wikipedia QA benchmarks.
A lot of interesting nuggets in this one! As always, I hope you find it useful and happy to discuss further! 👋
Haha, I think I can't emphasize enough that I think compute/funding should not limit us in academia.
In ⚛️ ORBIT, i found that a single Linux laptop running non-stop for months is enough to generate a pretty good dataset, you don't need expensive APIs!