Jimmy Lin

2 days ago

Search is no longer just a ranked list...LLM agents can now query, inspect, reformulate, and decide when to stop 🤖 At TREC RAG 2026, we’re introducing new metrics for agentic search: evaluating not only final results, but the search process itself 📊 Stay tuned!

3

4

3

1

411

lintool retweeted

5 days ago

🤨 Is your agent confused about what to build because it says there aren’t any guidelines? Now your agent has no more excuses - track guidelines for TREC RAG 2026 are out 🔥 And yes, they’re available via SKILLz 😎 Tell your agents to showcase your agentic search system!

1

13

11

2

2K

lintool retweeted

Associate Professor @UCLAengineering/@UCLA. Area: #NLProc/#ML/#AI https://t.co/zj1ssZj9ox

19 days ago

Does retrieval help RAG or did the LLM already memorize the answer? 🤔 Too often, the overlap between RAG corpora and what LLMs “know” is unclear Better RAG evaluation needs tighter alignment between NLP and IR 📚 That's why for RAG 2026 we are using @nvidia's ClimbMix corpus

15

16

7

4

2K

Who to follow

Kai-Wei Chang

@kaiwei_chang

Christopher Potts

@ChrisGPotts

Stanford Professor of Linguistics and, by courtesy, of Computer Science. Member of technical staff @stanfordnlp and @StanfordAILab. Co-founder @ Bigspin AI.

Craig Macdonald

@craig_macdonald

Professor of Information Retrieval

19 days ago

But I think we can do better... what about zero parameters? Let me introduce you to something else that's awesome: It's called grep. https://t.co/mjNYuZl2dC

0

11

1

2

713

19 days ago

Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. https://t.co/KvMqGKCRJF

3

95

14

51

7K

20 days ago

Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome! The broader lesson is that old baselines are still surprisingly important. Let's not forget them.

lintool's tweet photo. Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome!

The broader lesson is that old baselines are still surprisingly important. Let's not forget them. https://t.co/4EH0qPrkTJ

0

11

2

1

4K

20 days ago

I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form. So, if you're using hybrid search today, thank them. 🙏 Yes, this is clickbait-y, so I'll support my claims 🧵

Xueguang Ma

@xueguang_ma

22 days ago

This plot reminds me of my first IR work reproducing DPR in Pyserini, where we found BM25 is amazingly helpful when hybrid with a dense retriever. BM25 is never just a simple baseline -- used the right way, it can easily outperform many fancy methods. BM25 was the most robust method shown in BEIR, the most effective and efficient method for long-context search shown in LongEmbed, and now @mattjustram and @xuzihuan4 show that BM25 can push the search agents into the best efficiency frontier. p.s. Pyserini and pi-serini are two different repos.

xueguang_ma's tweet photo. This plot reminds me of my first IR work reproducing DPR in Pyserini, where we found BM25 is amazingly helpful when hybrid with a dense retriever. BM25 is never just a simple baseline -- used the right way, it can easily outperform many fancy methods.
BM25 was the most robust method shown in BEIR, the most effective and efficient method for long-context search shown in LongEmbed, and now @mattjustram and @xuzihuan4 show that BM25 can push the search agents into the best efficiency frontier.

p.s. Pyserini and pi-serini are two different repos.

2

68

12

44

13K

1

42

6

12

5K

20 days ago

But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. https://t.co/kxCvJhgWEv

lintool's tweet photo. But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. https://t.co/kxCvJhgWEv https://t.co/7VU3TYDiJ0

1

5

0

1

800

lintool retweeted

Jheng-Hong Yang

@mattjustram

20 days ago

https://t.co/qVJYa9YDp6

0

17

4

17

2K

lintool retweeted

Jheng-Hong Yang

@mattjustram

22 days ago

someone already wrote a love letter to pi, by @badlogicgames. so we wrote a love paper to pi :) with my teammates @xuzihuan4 and @lintool. a few days ago, i promised i’d share some fun plots once Pi-Serini joined the BrowseComp-Plus deep research agent party. now, it’s about time. here weeeee goooooo. bear with the sloppy images first. the serious one is at the end. the question was simple: how far can we push deep research with BM25 + pi? turns out: weirdly far.

5

62

11

61

17K

lintool retweeted

22 days ago

TREC RAG is returning for 2026! 🎉 This year’s iteration is special because agents 🤖 can join the fun… but what might agent-first community evaluation look like? 🧵👇

1

7

4

1

864

lintool retweeted

Tz-Huan Hsu @xuzihuan4

22 days ago

Does a lexical retriever suffice for agentic search when agents can keep refining their queries? As LLMs become more capable in agentic loops, agents can continuously refine their actions based on environmental feedback. We couldn’t help but ask the question above.

1

19

2

10

2K

23 days ago

What I'm cooking up... 👨‍🍳

4

59

4

37

5K

lintool retweeted

Zhuofeng Li

@zhuofengli96475

26 days ago

🔥 Introducing Direct Corpus Interaction (DCI)! The best retriever for agentic search is no retriever. 🚀 We replaced the entire agentic search pipeline — embedding model, vector index, top-k retrieval — with only `grep` and `bash`. 🔧 📄 Paper: https://t.co/9FVvrdLCRf DCI unlocks the full agentic potential of any Claude Sonnet 4.6: 69.0% → 80.0% on BrowseComp-Plus (+11.0, −$424). 💡The Magic: The agent searches the raw corpus directly — `grep`, `find`, `bash`, shell pipelines — exactly like a coding agent navigating a codebase. No preprocess. No embedding model. No vector index. No offline indexing. 📊The Results: DCI outperforms top baselines across 13 benchmarks, with average gains of: 🔍 Agentic Search: +11.0% 🧠 Multi-hop QA: +30.7% 📈 IR Ranking: +21.5% 💡 Insights: Beyond accuracy, we conduct a series of controlled ablation studies to pinpoint the sources of DCI’s gains. Specifically, we examine trajectory-level search, evidence utilization corpus, context management, and tool usage (RQ2-RQ6). Try it yourself! 🛠️Code: https://t.co/A8ch5QM1E4 🤖 Demo: https://t.co/Y9H4abb2P6 🔎 Eval logs: https://t.co/4iM7u1M8mz

zhuofengli96475's tweet photo. 🔥 Introducing Direct Corpus Interaction (DCI)! The best retriever for agentic search is no retriever.

🚀 We replaced the entire agentic search pipeline — embedding model, vector index, top-k retrieval — with only `grep` and `bash`. 🔧

📄 Paper: https://t.co/9FVvrdLCRf

DCI unlocks the full agentic potential of any Claude Sonnet 4.6: 69.0% → 80.0% on BrowseComp-Plus (+11.0, −$424).

💡The Magic:
The agent searches the raw corpus directly — `grep`, `find`, `bash`, shell pipelines — exactly like a coding agent navigating a codebase. No preprocess. No embedding model. No vector index. No offline indexing.

📊The Results:
DCI outperforms top baselines across 13 benchmarks, with average gains of:
🔍 Agentic Search: +11.0%
🧠 Multi-hop QA: +30.7%
📈 IR Ranking: +21.5%

💡 Insights:
Beyond accuracy, we conduct a series of controlled ablation studies to pinpoint the sources of DCI’s gains. Specifically, we examine trajectory-level search, evidence utilization corpus, context management, and tool usage (RQ2-RQ6).

Try it yourself!
🛠️Code: https://t.co/A8ch5QM1E4
🤖 Demo: https://t.co/Y9H4abb2P6
🔎 Eval logs: https://t.co/4iM7u1M8mz

25

261

61

256

75K

28 days ago

@s_gaweda Two criteria come to mind: (1) accuracy - did the agent do what the skill promises? (2) token efficiency - how many tokens did the agent have to burn?

1

0

52

28 days ago

⁉️ What's the goal of code review for SKILLz? 🙋‍♂️ I'm interested in hearing your opinion on this: What are CR best practices for SKILLz?

1

3

0

1

1K

28 days ago

Does this change if the SKILL is shared widely within an org? Does this change if an org uses multiple agents? Should I get Codex and Claude to iterate on the PR until they're both happy (and stay out of it)? What are emerging best practices here? ⁉️

1

0

435