Thinh @thinhphp_vt - Twitter Profile

The Sanghani Center at Virginia Tech @SanghaniCtrVT

about 1 month ago

Ph.D. student Thinh Pham (@thinhphp_vt) @VT_CS @SanghaniCtrVT presented his collaborative work, "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models," during a poster session at #ICLR2026 in Rio de Janeiro. 🔗to paper: https://t.co/42SmDlBLFs

SanghaniCtrVT's tweet photo. Ph.D. student Thinh Pham (@thinhphp_vt) @VT_CS @SanghaniCtrVT presented his collaborative work, "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models," during a poster session at #ICLR2026 in Rio de Janeiro.
🔗to paper: https://t.co/42SmDlBLFs https://t.co/z5BdwXNgjb

0

14

3

2

1K

Thinh

@thinhphp_vt

about 1 month ago

Come checkout our posters at #ICLR2026 🇧🇷! April 23, 10:30 AM - 1:00 PM 📍Pavilion 3 P3-#1424 SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models (https://t.co/3YkR5SzpRg) I'm also looking for opportunities this summer and would love to connect and chat!

thinhphp_vt's tweet photo. Come checkout our posters at #ICLR2026 🇧🇷!

April 23, 10:30 AM - 1:00 PM
📍Pavilion 3 P3-#1424
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models (https://t.co/3YkR5SzpRg)

I'm also looking for opportunities this summer and would love to connect and chat! https://t.co/zAXzLUH2Ru

0

19

5

1

2K

thinhphp_vt retweeted

AK

@_akhaliq

about 2 months ago

MERRIN A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments paper: https://t.co/UZpJdGxIxY

_akhaliq's tweet photo. MERRIN

A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

paper: https://t.co/UZpJdGxIxY https://t.co/ZmRa2TcuAu

2

42

12

13

16K

thinhphp_vt retweeted

Han Wang

@HanWang98

about 2 months ago

🚨 Excited to announce MERRIN: a human-annotated, expert-vetted benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments! 🌐🔍 The web is noisy, conflicting, and multimodal. Even strong search agents fail when questions require reasoning across text, images, video & audio — MERRIN is built to diagnose exactly where they break down, via multi-hop reasoning on open-web search without explicit references to specific modality sources.

HanWang98's tweet photo. 🚨 Excited to announce MERRIN: a human-annotated, expert-vetted benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments! 🌐🔍

The web is noisy, conflicting, and multimodal. Even strong search agents fail when questions require reasoning across text, images, video & audio — MERRIN is built to diagnose exactly where they break down, via multi-hop reasoning on open-web search without explicit references to specific modality sources.

1

45

25

6

14K

thinhphp_vt retweeted

Tu Vu

@tuvllms

2 months ago

Very impressive results from Chroma Context-1, which push the Pareto frontier of agentic search on our SealQA (Seal-0) & LongSealQA benchmarks. Check out their techical report below 👇 https://t.co/eYtJZDHIwP

0

7

1

2

1K

Thinh

@thinhphp_vt

4 months ago

🔥Our paper "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models" has been accepted to #ICLR 2026!! 🎉🎉🎉 Huge thanks to my supervisor @tuvllms and the other co-authors for all your hard work! See you in Brazil ✈️

thinhphp_vt's tweet photo. 🔥Our paper "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models" has been accepted to #ICLR 2026!! 🎉🎉🎉

Huge thanks to my supervisor @tuvllms and the other co-authors for all your hard work! See you in Brazil ✈️ https://t.co/rhShfgJ9Lr

3

33

5

1

2K

thinhphp_vt retweeted

alex zhang

@a1zhang

5 months ago

Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! https://t.co/x47pIfIkTb

a1zhang's tweet photo. Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs).

It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs!

Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025!

https://t.co/x47pIfIkTb

252

7K

1K

7K

2M

thinhphp_vt retweeted

Kimi.ai @Kimi_Moonshot

7 months ago

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on https://t.co/YutVbwktG0 in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: https://t.co/EOZkbOwCN4 🔗 Tech blog: https://t.co/n7xxaszqzF 🔗 Weights & code: https://t.co/4ukcXB0iP6

Kimi_Moonshot's tweet photo. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.

K2 Thinking is now live on https://t.co/YutVbwktG0 in chat mode, with full agentic mode coming soon. It is also accessible via API.

🔌 API is live: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/n7xxaszqzF
🔗 Weights & code: https://t.co/4ukcXB0iP6

577

10K

1K

4K

5M

thinhphp_vt retweeted

Sentient

@SentientAGI

9 months ago

Announcing ROMA (Recursive Open Meta Agent): our new multi-agent framework that sets SOTA in reasoning + search. Seal-0: 45.6% FRAMES: 81.7% SimpleQA: 93.9% 🧵 Read more about how recursive coordination lets agents tackle complex queries.

SentientAGI's tweet photo. Announcing ROMA (Recursive Open Meta Agent): our new multi-agent framework that sets SOTA in reasoning + search.

Seal-0: 45.6%
FRAMES: 81.7%
SimpleQA: 93.9%

🧵 Read more about how recursive coordination lets agents tackle complex queries. https://t.co/LmEKY6aCcB

764

2K

596

189

506K

thinhphp_vt retweeted

Rohan Paul

@rohanpaul_ai

9 months ago

OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses. The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing. OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower. Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess. Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers. 🧵 Read on 👇

rohanpaul_ai's tweet photo. OpenAI realesed new paper.

"Why language models hallucinate"

Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty.

The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses.

The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing.

OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower.

Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess.

Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers.

🧵 Read on 👇

96

2K

316

2K

372K

thinhphp_vt retweeted

Ken Liu

@kenziyuliu

9 months ago

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

kenziyuliu's tweet photo. New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions.

Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far: https://t.co/3TzD9ULEtg

15

368

76

193

67K

Thinh

@thinhphp_vt

10 months ago

DeepSeek achieved a strong result on SEAL0, a challenging benchmark for reasoning with conflicting search results. 🎊

DeepSeek

@deepseek_ai

10 months ago

Tools & Agents Upgrades 🧰 📈 Better results on SWE / Terminal-Bench 🔍 Stronger multi-step reasoning for complex search tasks ⚡️ Big gains in thinking efficiency 3/5

deepseek_ai's tweet photo. Tools & Agents Upgrades 🧰

📈 Better results on SWE / Terminal-Bench
🔍 Stronger multi-step reasoning for complex search tasks
⚡️ Big gains in thinking efficiency

3/5 https://t.co/qIknYSxfGd

11

822

50

63

194K

0

5

1

0

267

thinhphp_vt retweeted

Tu Vu

@tuvllms

10 months ago

Excited to share that our paper on efficient model development has been accepted to #EMNLP2025 Main conference @emnlpmeeting. Congratulations to my students @linusdd44804 and @Sub_RBala on their first PhD paper! 🎉

0

51

9

10

6K

thinhphp_vt retweeted

basvanopheusden @basvanopheusden

10 months ago

A few weeks ago, I started a new job at @OpenAI. I wrote a document about my interview process and recommendations for anyone on the job market for AI research positions. I hope it's helpful! https://t.co/0I6f6UrAqD

63

4K

340

6K

336K

thinhphp_vt retweeted

Sheryl Hsu

@SherylHsu02

10 months ago

1/n I’m thrilled to share that our @OpenAI reasoning system scored high enough to achieve gold 🥇🥇 in one of the world’s top programming competitions - the 2025 International Olympiad in Informatics (IOI) - placing first among AI participants! 👨‍💻👨‍💻

SherylHsu02's tweet photo. 1/n I’m thrilled to share that our @OpenAI reasoning system scored high enough to achieve gold 🥇🥇 in one of the world’s top programming competitions - the 2025 International Olympiad in Informatics (IOI) - placing first among AI participants! 👨‍💻👨‍💻 https://t.co/k3RQxIzXPH

195

3K

285

437

2M

thinhphp_vt retweeted

Intelligent Internet @ii_posts

10 months ago

Most search models need the cloud. II-Search-4B doesn’t. 4B model tuned for reasoning with search tools, built for local use. Performance of models 10x its size. Search that is small, smart, and open.

21

639

102

507

500K

Thinh

@thinhphp_vt

10 months ago

🥳Congrats @ii_posts for an impressive result on SEAL-0, a challenging benchmark for search-augmented LLMs. 🤩Looking forward to the evaluation standards it shapes in this field. 📚Read more: https://t.co/3YkR5SzpRg

Intelligent Internet @ii_posts

10 months ago

Overall Results:

1

21

2

5

4K

0

6

0

186

thinhphp_vt retweeted

Peter H. Diamandis, MD

@PeterDiamandis

10 months ago

. @EMostaque came back on the show to chat about: --how we can't compete against AI agents --his solution for a POSITIVE AI world --Why UBI won't work but UBAI might.. --we need to be focused on incentivizing the right outcomes -- Nations need sovereign AI stacks or they'll be left behind by the mega-models

24

234

57

69

28K

thinhphp_vt retweeted

Jasper Dekoninck @j_dekoninck

11 months ago

We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)

j_dekoninck's tweet photo. We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4) https://t.co/29sQuXZlxM

13

217

39

59

37K

Thinh

@thinhphp_vt

Last Seen Users on Sotwe

Trends for you

Most Popular Users