๐ Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs.
To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by 64 mathematicians, including 38 faculty members.
ResearchMath-14K: 14K open research-level math problems
Curated by agents from academic sources, with 220K reasoning traces. Fine-tuning filtered attempts improves Qwen3 by 9.2 points. Newer models also make 5x more fake references.
๐ฐ๐ทDespite rapid progress in AI agent research, Korean agentic benchmarks remain largely absent!
To narrow this gap, we release K-BrowseComp, a benchmark that requires searching across Korean websites and Korean-language content.
https://t.co/kuHby48uif
Huge thanks to @minseokwak103 , @dongseok1220 , @gson_AI , @alan_ritter , and @jhkim940331 for their support in writing this work!
We present LaRA (Layer-wise Representation Analysis), a framework for detecting data contamination in RL post-trained LLMs by examining how internal representations change across layers rather than relying on output-level signals such as likelihood or entropy.
https://t.co/Dcu29huKEa
(1/N)
The frontier of mathematics is defined by problems whose solutions are still unknown.
But where do we get prompts at that frontier?
The math literature already contains thousands of open problems ๐
In our recent work, we turn them into 14,056 research-level math problems for LLMs.
Link to dataset:
https://t.co/DVFYUGPXcM
The surprising part: imperfect attempts still help โจ
Since over 70% of the problems are still open, most traces are unlikely to be correct.
But after filtering out low-quality traces, fine-tuning Qwen3 models improves over the base models by 9.2 points on average.
Seems like generating cad with agents are attracting attention. But how do you know they are done right?
In our new work we use finite element analysis as verification to test cad-quality!
Creating CAD with LLMs (or agents) is cool but how do you guarantee these outputs are physically sound?
In my recent work with @seonggyung33214 we adopted finite-element analysis to confirm that the Agent-created outputs are sound.
Result? Even the best models fail to create CAD outputs that meet real-world engineering standards. Check out our paper here > https://t.co/0oj5XWedPI
@mythkernel@Zai_org@deepseek_ai hi! we are out of funds/credits at the moment and is looking for ways to add new models.
we definitely want to add periodic updates to the paper with new models
๐ Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs.
To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by 64 mathematicians, including 38 faculty members.
We are conducting an AI-assisted review of FrontierMath: Tiers 1-4. This has flagged fatal errors in about a third of problems, and we believe most of these flags to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.
Big thanks to @linguist_cat, @seungonekim, @AkariAsai, @wellecks, @gneubig, and @Youngjae4Yu for their help writing the paper, and to many others who contributed problems, feedback, evaluations, and support throughout the project.
This work would not have been possible without the broader community of mathematicians and researchers who helped shape Soohak.
If you want your model or agentic harness evaluated on Soohak, please feel free to reach out to me on X or by email at [email protected].
We are also looking for support to run a public leaderboard. If you would like to help with API credits, GPU compute, funding, or infrastructure, weโd be very grateful to chat.
Our goal is to make research-level math evaluation more transparent, rigorous, and useful for the community.