Shrey Pandit

about 2 months ago

(4/7) Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math: https://t.co/Sj1GfmR5MP Hard2Verify is a human-annotated benchmark (500+ hours of annotation) for assessing step-level verifiers on frontier math. Evaluating 29 models, it reveals that most open-source verifiers still lag behind closed-source systems on challenging, open-ended mathematical reasoning. Authors: Shrey Pandit @ShreyPandit2001, Austin Xu @austinsxu, Xuan-Phi Nguyen @xuanphinguyen, Yifei Ming @ming5_alvin, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq Accepted to #ACL2026

SFResearch's tweet photo. (4/7) Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math: https://t.co/Sj1GfmR5MP

Hard2Verify is a human-annotated benchmark (500+ hours of annotation) for assessing step-level verifiers on frontier math. Evaluating 29 models, it reveals that most open-source verifiers still lag behind closed-source systems on challenging, open-ended mathematical reasoning.

Authors: Shrey Pandit @ShreyPandit2001, Austin Xu @austinsxu, Xuan-Phi Nguyen @xuanphinguyen, Yifei Ming @ming5_alvin, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq

Accepted to #ACL2026

131

about 2 months ago

Glad to share our paper “Hard2Verify” has been accepted at #ACL2026 - Main Conference. #LLM can now solve IMO-level math, but can they reliably verify each step of a proof ? Turns out, not really. We built Hard2Verify: a step-level verification benchmark for open-ended, frontier-level math problems sourced from competitions like IMO and Putnam. The dataset comprises 1,860 steps across 200 model responses, fully annotated by PhD-level math experts over 500+ hours of human effort with three rounds of independent agreement checks. We benchmark 29 verifiers, and the takeaway is stark, models that score 60%+ on existing benchmarks like ProcessBench barely crack 20% on ours. If you're working on reasoning, reward models, or verification, the dataset is publicly available on HuggingFace and we think it's a useful stress test for where things actually stand. 📄 Paper: https://t.co/DGgyZ3ueYw 🤗 Dataset: https://t.co/HmxGx6rmYx Grateful to work alongside an amazing team - @austinsxu @xuanphinguyen @ming5_alvin @CaimingXiong @JotyShafiq at @SFResearch

ShreyPandit2001's tweet photo. Glad to share our paper “Hard2Verify” has been accepted at #ACL2026 - Main Conference. #LLM can now solve IMO-level math, but can they reliably verify each step of a proof ? Turns out, not really. We built Hard2Verify: a step-level verification benchmark for open-ended, frontier-level math problems sourced from competitions like IMO and Putnam.

The dataset comprises 1,860 steps across 200 model responses, fully annotated by PhD-level math experts over 500+ hours of human effort with three rounds of independent agreement checks. We benchmark 29 verifiers, and the takeaway is stark, models that score 60%+ on existing benchmarks like ProcessBench barely crack 20% on ours. If you're working on reasoning, reward models, or verification, the dataset is publicly available on HuggingFace and we think it's a useful stress test for where things actually stand.
📄 Paper: https://t.co/DGgyZ3ueYw
🤗 Dataset: https://t.co/HmxGx6rmYx

Grateful to work alongside an amazing team - @austinsxu @xuanphinguyen @ming5_alvin @CaimingXiong @JotyShafiq at @SFResearch

626

ShreyPandit2001 retweeted

AI @maximor_ai || Prev: @getpostman @Microsoft || Alumni @bitspilaniindia || Opinions are my own

3 months ago

Poisoning the Well: Search Agents Get Tricked by Maliciously Hosted Content https://t.co/X10GzjuBjm AI agents that rely on web search are vulnerable to a deceptively simple attack: adversaries publish fake but authoritative-sounding content designed to be retrieved during search. Think "AI Slop" for agents. Our research shows that when agents encounter planted content, they stop critically evaluating what they find and start accepting it at face value. Key findings: → ~80% of queries returned the attacker's chosen answer when poisoned content was manually injected into search results → Agents shift from information-seeking mode to verification mode, performing fewer searches and reporting higher confidence → Even in realistic settings with 100K+ clean documents and just a handful of poisoned ones, nearly 1 in 4 queries were compromised → Agent self-reported confidence actually increases in the presence of adversarial content, making poisoned answers harder to detect At @Salesforce, trust is a core value. These findings reinforce why we invest in defense mechanisms like our trust layer to help agents navigate hostile information landscapes reliably. Authors: Shafiq Joty @jotyshafiq, Xuan Phi Nguyen @xuanphinguyen, Shrey Pandit @ShreyPandit2001, Yifei Ming @ming5_alvin #FutureOfAI #EnterpriseAI #AIAgents

930

ShreyPandit2001 retweeted

Shafiq Joty

@JotyShafiq

3 months ago

A single “poisoned” webpage can turn LLM search agents from investigators into lazy confirmers—showing how easily adversaries can manipulate agentic research. Check our interesting findings here.

243

Who to follow

Rajaswa Patil

@RajaswaPatil

Abheesht Sharma

@penstrokes75

ML, LLMs, RL, RecSys @Google | Cricket fanatic | Bursts of random poetry | Prev. Applied Science @AmazonScience

Ajay Subramanian

@ajaysub110

Robot learning @BostonDynamics

ShreyPandit2001 retweeted

4 months ago

(1/5) MoE scaling loves Expert Parallelism (EP)… until routing gets imbalanced. Even “well-trained” MoEs route a big chunk of tokens to a small subset of experts (specialization), so EP can overload a few GPUs and waste the rest—especially in post-training/inference where explicit load-balancing is often off-limits. We introduce Least-Loaded Expert Parallelism (LLEP): dynamically spill excess tokens *and* the corresponding expert weights to the least-loaded devices, minimizing collective latency under memory constraints. Up to 5× faster + 4× lower peak mem; ~1.9× faster for gpt-oss-120b.

SFResearch's tweet photo. (1/5) MoE scaling loves Expert Parallelism (EP)… until routing gets imbalanced.

Even “well-trained” MoEs route a big chunk of tokens to a small subset of experts (specialization), so EP can overload a few GPUs and waste the rest—especially in post-training/inference where explicit load-balancing is often off-limits.

We introduce Least-Loaded Expert Parallelism (LLEP): dynamically spill excess tokens *and* the corresponding expert weights to the least-loaded devices, minimizing collective latency under memory constraints. Up to 5× faster + 4× lower peak mem; ~1.9× faster for gpt-oss-120b.

ShreyPandit2001 retweeted

4 months ago

MoE scaling loves Expert Parallelism (EP)… until routing gets imbalanced. Even “well-trained” MoEs route a big chunk of tokens to a small subset of experts (specialization), so EP can overload a few GPUs and waste the rest—especially in post-training/inference where explicit load-balancing is often off-limits. We introduce Least-Loaded Expert Parallelism (LLEP): dynamically spill excess tokens *and* the corresponding expert weights to the least-loaded devices, minimizing collective latency under memory constraints. Up to 5× faster + 4× lower peak mem; ~1.9× faster for gpt-oss-120b. @SFResearch

CaimingXiong's tweet photo. MoE scaling loves Expert Parallelism (EP)… until routing gets imbalanced.
Even “well-trained” MoEs route a big chunk of tokens to a small subset of experts (specialization), so EP can overload a few GPUs and waste the rest—especially in post-training/inference where explicit load-balancing is often off-limits.

We introduce Least-Loaded Expert Parallelism (LLEP): dynamically spill excess tokens *and* the corresponding expert weights to the least-loaded devices, minimizing collective latency under memory constraints. Up to 5× faster + 4× lower peak mem; ~1.9× faster for gpt-oss-120b. @SFResearch

133

10K

ShreyPandit2001 retweeted

4 months ago

Mixture-of-Experts models are typically trained with explicit load-balancing constraints—yet in practice, even strong MoEs exhibit highly imbalanced expert routing. Interestingly, this imbalance is often desirable, reflecting expert specialization across domains. The problem? Expert Parallelism (EP) quietly assumes balanced routing; under extreme imbalance—especially during post-training or inference—this assumption breaks, leading to severe compute and memory bottlenecks. We propose Least-Loaded Expert Parallelism (LLEP), a new EP algorithm that dynamically redistributes excess tokens and the corresponding expert weights from overloaded devices to underutilized ones—minimizing collective latency while respecting memory constraints. 📈 Up to 5× speedup 📉 Up to 4× peak memory reduction ⚡ ~1.9× faster inference for gpt-oss-120B Paper: https://t.co/Sda0l8dH0C Code: https://t.co/rGBLkDFmwg Demo: https://t.co/QcyVI3THQd

SFResearch's tweet photo. Mixture-of-Experts models are typically trained with explicit load-balancing constraints—yet in practice, even strong MoEs exhibit highly imbalanced expert routing. Interestingly, this imbalance is often desirable, reflecting expert specialization across domains. The problem? Expert Parallelism (EP) quietly assumes balanced routing; under extreme imbalance—especially during post-training or inference—this assumption breaks, leading to severe compute and memory bottlenecks. We propose Least-Loaded Expert Parallelism (LLEP), a new EP algorithm that dynamically redistributes excess tokens and the corresponding expert weights from overloaded devices to underutilized ones—minimizing collective latency while respecting memory constraints.

📈 Up to 5× speedup
📉 Up to 4× peak memory reduction
⚡ ~1.9× faster inference for gpt-oss-120B

Paper: https://t.co/Sda0l8dH0C
Code: https://t.co/rGBLkDFmwg
Demo: https://t.co/QcyVI3THQd

955

ShreyPandit2001 retweeted

Turing @turingcom

7 months ago

Turing partnered with @SFResearch to evaluate Olympiad-grade mathematical reasoning in frontier models like GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4. Over 200 long-form math responses were annotated step-by-step using a zero-tolerance, carry-forward logic rubric. Each step received a binary correctness label and written justification, executed by PhDs and domain experts. 500+ hours of expert annotation produced a benchmark-grade dataset with 100% compliance to Salesforce’s strict evaluation standards. The results: higher fidelity in reasoning verification, stronger model alignment, and a clearer view of how LLMs handle symbolic problem solving at scale. From RL Gyms to reasoning datasets, Turing accelerates post-training research with human precision and enterprise-grade rigor. Case Study Below.

ShreyPandit2001 retweeted

7 months ago

Hard2Verify: A Step-Level Verification Benchmark for Frontier Math 🎯 LLMs now achieve gold-level performance on IMO problems 🏅, but training these reasoners requires verifiers that can catch subtle step-level mistakes. We introduce Hard2Verify—500+ hours of human annotation assessing verifiers on frontier LLM responses to challenging, open-ended math problems 🧮 Key findings: Open-source verifiers struggle to identify errors, often marking nearly all steps as correct. Weaker models achieve near-0 True Negative Rate while True Positive Rate approaches 1 📊 📄 Paper: https://t.co/Sj1GfmQxXh 💻 Code: https://t.co/A2nWJAPsJe 📊 Dataset: https://t.co/DqMIGggXX7 Work by @ShreyPandit2001, @austinmxu, Xuan-Phi Nguyen, @yifeiming1, @caimingxiong, @JotyShafiq #FutureOfAI #EnterpriseAI #MachineLearning #DeepLearning #MathematicalReasoning #AIVerification

SFResearch's tweet photo. Hard2Verify: A Step-Level Verification Benchmark for Frontier Math 🎯

LLMs now achieve gold-level performance on IMO problems 🏅, but training these reasoners requires verifiers that can catch subtle step-level mistakes. We introduce Hard2Verify—500+ hours of human annotation assessing verifiers on frontier LLM responses to challenging, open-ended math problems 🧮

Key findings: Open-source verifiers struggle to identify errors, often marking nearly all steps as correct. Weaker models achieve near-0 True Negative Rate while True Positive Rate approaches 1 📊

📄 Paper: https://t.co/Sj1GfmQxXh
💻 Code: https://t.co/A2nWJAPsJe
📊 Dataset: https://t.co/DqMIGggXX7

Work by @ShreyPandit2001, @austinmxu, Xuan-Phi Nguyen, @yifeiming1, @caimingxiong, @JotyShafiq

#FutureOfAI #EnterpriseAI #MachineLearning #DeepLearning #MathematicalReasoning #AIVerification

ShreyPandit2001 retweeted

8 months ago

One of the key challenges for building web-based “deep research” agents is to construct sufficiently difficult long-horizon agentic data. At @SFResearch, We introduce ProgSearch, a controlled data synthesis pipeline that builds tasks of increasing complexity until a frontier agent fails — yielding more diverse, effective training data that pressures agents to reason on the open web. 🧠🌐 Paper: https://t.co/3Uyr2RABBC

CaimingXiong's tweet photo. One of the key challenges for building web-based “deep research” agents is to construct sufficiently difficult long-horizon agentic data.
At @SFResearch, We introduce ProgSearch, a controlled data synthesis pipeline that builds tasks of increasing complexity until a frontier agent fails — yielding more diverse, effective training data that pressures agents to reason on the open web. 🧠🌐

Paper: https://t.co/3Uyr2RABBC

105

20K

ShreyPandit2001 retweeted

8 months ago

📊Excited to introduce our Hard2Verify from @SFResearch , the benchmark for measuring the ability of verifiers to provide step-level correctness labels for model-generated responses to largely open-ended, frontier-level math problems📚 The current paradigm of RLVR requires questions with “easy to verify” answers. How can we move from answering algebra questions with concrete final answers 🎯 to writing complex proofs 📝 where correctness depends on all steps being correct? We need step-level verifiers! Hard2Verify is built to assess step-level verifiers at the frontiers of math reasoning. This benchmark is the product of 500+ hours of human expert effort. Thanks for the partnership, @turingcom! Paper: https://t.co/BE7DTDRmqM Evaluation code: https://t.co/7BuyppeEch Dataset: https://t.co/m7xzg3xZN0

CaimingXiong's tweet photo. 📊Excited to introduce our Hard2Verify from @SFResearch , the benchmark for measuring the ability of verifiers to provide step-level correctness labels for model-generated responses to largely open-ended, frontier-level math problems📚

The current paradigm of RLVR requires questions with “easy to verify” answers. How can we move from answering algebra questions with concrete final answers 🎯 to writing complex proofs 📝 where correctness depends on all steps being correct? We need step-level verifiers!

Hard2Verify is built to assess step-level verifiers at the frontiers of math reasoning.

This benchmark is the product of 500+ hours of human expert effort. Thanks for the partnership, @turingcom!

Paper: https://t.co/BE7DTDRmqM
Evaluation code: https://t.co/7BuyppeEch
Dataset: https://t.co/m7xzg3xZN0

133

10K

ShreyPandit2001 retweeted

elvis

@omarsar0

9 months ago

RL done right is no joke! The most interesting AI paper I read this week. It trains a top minimal single-agent model for deep research. Great example of simple RL-optimized single agents beating complex multi-agent scaffolds. Now let's break it down:

omarsar0's tweet photo. RL done right is no joke!

The most interesting AI paper I read this week.

It trains a top minimal single-agent model for deep research.

Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.

Now let's break it down: https://t.co/I1SNcaQDNu

673

797

171K

ShreyPandit2001 retweeted

9 months ago

Meet SFR-DeepResearch (SFR-DR) 🤖: our RL-trained autonomous agents that can reason, search, and code their way through deep research tasks. 🚀SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search 🔍, browsing 🌐, and Python interpreter 🐍, surpassing DeepResearch with OpenAI o3 and Kimi Researcher. 🤖SFR-DR agents are trained to operate independently, without pre-defined multi-agent workflows. They autonomously plan, reason, and propose and take actions as defined by their tools. 🔄SFR-DR agents are trained with end-to-end RL. Starting from reasoning optimized models, our RL pipeline carefully preserves reasoning abilities while training models to become more capable research agents. 📝SFR-DR agents are also trained to manage their own memory by summarizing previous results when context becomes limited. This enables a virtually unlimited context window, enabling long-horizon tasks Paper: https://t.co/32idhdknhh #AIAgents #ReinforcementLearning #DeepResearch

CaimingXiong's tweet photo. Meet SFR-DeepResearch (SFR-DR) 🤖: our RL-trained autonomous agents that can reason, search, and code their way through deep research tasks.

🚀SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search 🔍, browsing 🌐, and Python interpreter 🐍, surpassing DeepResearch with OpenAI o3 and Kimi Researcher.

🤖SFR-DR agents are trained to operate independently, without pre-defined multi-agent workflows. They autonomously plan, reason, and propose and take actions as defined by their tools.

🔄SFR-DR agents are trained with end-to-end RL. Starting from reasoning optimized models, our RL pipeline carefully preserves reasoning abilities while training models to become more capable research agents.

📝SFR-DR agents are also trained to manage their own memory by summarizing previous results when context becomes limited. This enables a virtually unlimited context window, enabling long-horizon tasks

Paper: https://t.co/32idhdknhh
#AIAgents #ReinforcementLearning #DeepResearch

938

144

693

310K

ShreyPandit2001 retweeted

9 months ago

@emnlpmeeting / #EMNLP2025 Accepted Paper: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models 🏥 📝 Paper: https://t.co/P7YToATxNs 🔗 Dataset & Code: https://t.co/OXgCMUuAOa This work introduces MedHallu, the first comprehensive benchmark specifically designed for detecting medical hallucinations in LLMs - a critical safety challenge where models generate plausible but factually incorrect medical information that could impact patient safety. Key contributions: ➡️ 10,000 high-quality medical question-answer pairs with systematically generated hallucinations across 4 categories ➡️ Controlled generation pipeline with multi-LLM filtering and difficulty stratification (easy, medium, hard) ➡️ Semantic analysis revealing harder-to-detect hallucinations are closer to ground truth in vector space ➡️ Comprehensive evaluation showing even state-of-the-art models like GPT-4o achieve only 62.5% F1 on ""hard"" hallucinations Key findings: 🔍 General-purpose LLMs outperform medical fine-tuned models in hallucination detection 📚 Providing domain knowledge improves detection by up to 38% relative improvement 🤔 Adding ""not sure"" option significantly boosts precision for critical medical applications 🎯 Incomplete information is the hardest hallucination type to detect (54% accuracy) The benchmark reveals concerning gaps in current LLMs' ability to detect medical misinformation, with implications for safe deployment in healthcare settings. 👥 Authors: Shrey Pandit @ShreyPandit2001, Jiawei Xu @jiaweixu0230, Junyuan Hong @hjy836, Zhangyang Wang, Tianlong Chen @TianlongChen4, Kaidi Xu @KaidiXu1, Ying Ding #FutureOfAI #EnterpriseAI #MedicalAI #MachineLearning #NLP #HealthcareAI #AIEthics #Hallucination #PatientSafety

SFResearch's tweet photo. @emnlpmeeting / #EMNLP2025 Accepted Paper: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models 🏥

📝 Paper: https://t.co/P7YToATxNs
🔗 Dataset & Code: https://t.co/OXgCMUuAOa

This work introduces MedHallu, the first comprehensive benchmark specifically designed for detecting medical hallucinations in LLMs - a critical safety challenge where models generate plausible but factually incorrect medical information that could impact patient safety.

Key contributions:
➡️ 10,000 high-quality medical question-answer pairs with systematically generated hallucinations across 4 categories
➡️ Controlled generation pipeline with multi-LLM filtering and difficulty stratification (easy, medium, hard)
➡️ Semantic analysis revealing harder-to-detect hallucinations are closer to ground truth in vector space
➡️ Comprehensive evaluation showing even state-of-the-art models like GPT-4o achieve only 62.5% F1 on ""hard"" hallucinations

Key findings:
🔍 General-purpose LLMs outperform medical fine-tuned models in hallucination detection
📚 Providing domain knowledge improves detection by up to 38% relative improvement
🤔 Adding ""not sure"" option significantly boosts precision for critical medical applications
🎯 Incomplete information is the hardest hallucination type to detect (54% accuracy)

The benchmark reveals concerning gaps in current LLMs' ability to detect medical misinformation, with implications for safe deployment in healthcare settings.

👥 Authors: Shrey Pandit @ShreyPandit2001, Jiawei Xu @jiaweixu0230, Junyuan Hong @hjy836, Zhangyang Wang, Tianlong Chen @TianlongChen4, Kaidi Xu @KaidiXu1, Ying Ding

#FutureOfAI #EnterpriseAI #MedicalAI #MachineLearning #NLP #HealthcareAI #AIEthics #Hallucination #PatientSafety

265

ShreyPandit2001 retweeted

10 months ago

🌟 Excited to present our work at Empirical Methods in Natural Language Processing @emnlpmeeting - a leading conference in NLP and AI research! 📄 Our accepted papers: Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization 👥Authors: Chuyuan Li @ChuyuanLi, Austin Xu @austinsxu, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp 📝Paper: https://t.co/bAI3XZVmCH Demystifying Domain-adaptive Post-training for Financial LLMs 👥Authors: Zixuan Ke @KeZixuan, Yifei Ming @ming5_alvin, Xuan-Phi Nguyen, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq 📝Paper: https://t.co/8g1LWKTB3o CEMTM: Contextual Embedding-based Multimodal Topic Modeling 👥Authors: Amirhossein Abaskohi @AmirAbaskohi, Raymond Li, Chuyuan Li @ChuyuanLi, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp 📝Paper: https://t.co/9z6WeOCGvl From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text 👥Authors: Ridwan Mahbub @mahbub_ridwan, Mohammed Saidul Islam, Mir Tafseer Nayeem @mtnayeem, Md Tahmid Rahman Laskar, Mizanur Rahman, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque 📝Paper: https://t.co/SqrOVicp1g Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text 👥Authors: Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque 📝Paper: https://t.co/sdveg5WciG Direct Judgement Preference Optimization 👥Authors: Peifeng Wang @PeifengWang3, Austin Xu @austinsxu, Yilun Zhou @YilunZhou, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq 📝Paper: https://t.co/CR3uMCx83B MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models 👥Authors: Shrey Pandit @ShreyPandit2001, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding 📝Paper: https://t.co/UHSsPld72o ActionStudio: A Lightweight Framework for Data and Training of Large Action Models 👥Authors: Jianguo Zhang @JianguoZhang3, Thai Hoang, Ming Zhu @ming_zhu0527, Zuxin Liu @LiuZuxin, Shiyu Wang @shiyu04490786, Tulika Awalgaonkar @tulika614, Akshara Prabhakar @aksh_555, Haolin Chen @HaolinChen11, Weiran Yao @iscreamnearby, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong 📝Paper: https://t.co/QxwroZ4dTl LATTE: Learning to Think with Vision Specialists 👥Authors: Zixian Ma, Jianguo Zhang @JianguoZhang3, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Caiming Xiong @CaimingXiong, Ranjay Krishna @RanjayKrishna, Silvio Savarese @silviocinguetta 📝Paper: https://t.co/HySTJwWYXr Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D 👥Authors: Artemis Panagopoulou @artemispng, Le Xue @Le_Xue01, Honglu Zhou @zhou_honglu 📝Paper: https://t.co/uyeQNtlnNH #EMNLP2025 #FutureOfAI #EnterpriseAI #LanguageModels #NLP

SFResearch's tweet photo. 🌟 Excited to present our work at Empirical Methods in Natural Language Processing @emnlpmeeting - a leading conference in NLP and AI research!

📄 Our accepted papers:

Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization
👥Authors: Chuyuan Li @ChuyuanLi, Austin Xu @austinsxu, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp
📝Paper: https://t.co/bAI3XZVmCH

Demystifying Domain-adaptive Post-training for Financial LLMs
👥Authors: Zixuan Ke @KeZixuan, Yifei Ming @ming5_alvin, Xuan-Phi Nguyen, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq
📝Paper: https://t.co/8g1LWKTB3o

CEMTM: Contextual Embedding-based Multimodal Topic Modeling
👥Authors: Amirhossein Abaskohi @AmirAbaskohi, Raymond Li, Chuyuan Li @ChuyuanLi, Shafiq Joty @JotyShafiq, and Giuseppe Carenini @careninigiusepp
📝Paper: https://t.co/9z6WeOCGvl

From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text
👥Authors: Ridwan Mahbub @mahbub_ridwan, Mohammed Saidul Islam, Mir Tafseer Nayeem @mtnayeem, Md Tahmid Rahman Laskar, Mizanur Rahman, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque
📝Paper: https://t.co/SqrOVicp1g

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
👥Authors: Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty @JotyShafiq, Enamul Hoque @Enamul_Hoque
📝Paper: https://t.co/sdveg5WciG

Direct Judgement Preference Optimization
👥Authors: Peifeng Wang @PeifengWang3, Austin Xu @austinsxu, Yilun Zhou @YilunZhou, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq
📝Paper: https://t.co/CR3uMCx83B

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
👥Authors: Shrey Pandit @ShreyPandit2001, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding
📝Paper: https://t.co/UHSsPld72o

ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
👥Authors: Jianguo Zhang @JianguoZhang3, Thai Hoang, Ming Zhu @ming_zhu0527, Zuxin Liu @LiuZuxin, Shiyu Wang @shiyu04490786, Tulika Awalgaonkar @tulika614, Akshara Prabhakar @aksh_555, Haolin Chen @HaolinChen11, Weiran Yao @iscreamnearby, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong
📝Paper: https://t.co/QxwroZ4dTl

LATTE: Learning to Think with Vision Specialists
👥Authors: Zixian Ma, Jianguo Zhang @JianguoZhang3, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles @jcniebles, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Caiming Xiong @CaimingXiong, Ranjay Krishna @RanjayKrishna, Silvio Savarese @silviocinguetta
📝Paper: https://t.co/HySTJwWYXr

Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
👥Authors: Artemis Panagopoulou @artemispng, Le Xue @Le_Xue01, Honglu Zhou @zhou_honglu
📝Paper: https://t.co/uyeQNtlnNH

#EMNLP2025 #FutureOfAI #EnterpriseAI #LanguageModels #NLP

10 months ago

Excited to share that our paper, “MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models,” was accepted to EMNLP 2025 (Main)! Huge thanks to my co-authors @jiaweixu0230, @hjy836, Atlas Wang, @TianlongChen4, @KaidiXu1, and Ying Ding!

over 1 year ago

🚨 New Research Alert! 🚨 Happy to share my latest work - MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models 📄Arxiv - https://t.co/WlyhOroZk3 🤗HuggingFace Dataset - https://t.co/me6zGKCcId 🌐Website - https://t.co/tBg3s4S7Q8

ShreyPandit2001's tweet photo. 🚨 New Research Alert! 🚨
Happy to share my latest work -
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

📄Arxiv - https://t.co/WlyhOroZk3
🤗HuggingFace Dataset - https://t.co/me6zGKCcId
🌐Website - https://t.co/tBg3s4S7Q8 https://t.co/Sals6VbnoM

406

over 1 year ago

🧑‍🔬 Research Team: @ShreyPandit2001 , @jiaweixu0230 , @hjy836 , Zhangyang Wang, @TianlongChen4 , @KaidiXu1 , Ying Ding @UTAustin @UNC @DrexelUniv #NLProc #Hallucination #MedicalAI #AI4Health #LLM

115

over 1 year ago