Guijin Son @gson_AI - Twitter Profile

Pinned Tweet

about 2 months ago

🚀 Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs. To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by 64 mathematicians, including 38 faculty members.

gson_AI's tweet photo. 🚀 Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs.

To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by 64 mathematicians, including 38 faculty members.

4

98

19

47

18K

Guijin Son

@gson_AI

8 days ago

@Aflah02101 this is nonsense

1

3

0

566

Guijin Son

@gson_AI

14 days ago

@j_dekoninck thanks for the mention! we are planning some followup works are you interested in some colab by any chance?

1

0

36

gson_AI retweeted

DailyPapers

@HuggingPapers

about 1 month ago

ResearchMath-14K: 14K open research-level math problems Curated by agents from academic sources, with 220K reasoning traces. Fine-tuning filtered attempts improves Qwen3 by 9.2 points. Newer models also make 5x more fake references.

HuggingPapers's tweet photo. ResearchMath-14K: 14K open research-level math problems

Curated by agents from academic sources, with 220K reasoning traces. Fine-tuning filtered attempts improves Qwen3 by 9.2 points. Newer models also make 5x more fake references. https://t.co/mm3zpJfAuE

1

38

5

14

2K

Who to follow

Building cool stuff and eating Sop Iga.

Dr. Dingus

@JoshuaTurman10

Tries to follow Torah and the Feasts of Yah!! I Love God, my neighbor, You and my enemies! Buy $EPIC cash!!

gson_AI retweeted

Seungone Kim

@seungonekim

about 1 month ago

🇰🇷Despite rapid progress in AI agent research, Korean agentic benchmarks remain largely absent! To narrow this gap, we release K-BrowseComp, a benchmark that requires searching across Korean websites and Korean-language content. https://t.co/kuHby48uif

seungonekim's tweet photo. 🇰🇷Despite rapid progress in AI agent research, Korean agentic benchmarks remain largely absent!

To narrow this gap, we release K-BrowseComp, a benchmark that requires searching across Korean websites and Korean-language content.

https://t.co/kuHby48uif https://t.co/rCCEqUmJzN

5

109

27

32

20K

gson_AI retweeted

minju gwak ✈️ acl2026

@MGwak96587

about 1 month ago

Huge thanks to @minseokwak103 , @dongseok1220 , @gson_AI , @alan_ritter , and @jhkim940331 for their support in writing this work! We present LaRA (Layer-wise Representation Analysis), a framework for detecting data contamination in RL post-trained LLMs by examining how internal representations change across layers rather than relying on output-level signals such as likelihood or entropy. https://t.co/Dcu29huKEa

0

13

3

2

925

Guijin Son

@gson_AI

about 1 month ago

Thanks to @devpotatopotato, @cartinoe__5930, @MGwak96587 , and @Youngjae4Yu for their help throughout the work. Link to paper: https://t.co/dH3sPTjVGG

0

3

2

440

Guijin Son

@gson_AI

about 1 month ago

(1/N) The frontier of mathematics is defined by problems whose solutions are still unknown. But where do we get prompts at that frontier? The math literature already contains thousands of open problems 📚 In our recent work, we turn them into 14,056 research-level math problems for LLMs. Link to dataset: https://t.co/DVFYUGPXcM

gson_AI's tweet photo. (1/N)
The frontier of mathematics is defined by problems whose solutions are still unknown.

But where do we get prompts at that frontier?

The math literature already contains thousands of open problems 📚

In our recent work, we turn them into 14,056 research-level math problems for LLMs.

Link to dataset:
https://t.co/DVFYUGPXcM

1

40

10

39

5K

Guijin Son

@gson_AI

about 1 month ago

The surprising part: imperfect attempts still help ✨ Since over 70% of the problems are still open, most traces are unlikely to be correct. But after filtering out low-quality traces, fine-tuning Qwen3 models improves over the base models by 9.2 points on average.

gson_AI's tweet photo. The surprising part: imperfect attempts still help ✨

Since over 70% of the problems are still open, most traces are unlikely to be correct.

But after filtering out low-quality traces, fine-tuning Qwen3 models improves over the base models by 9.2 points on average. https://t.co/QM1YrRnt0m

1

0

107

Guijin Son

@gson_AI

about 1 month ago

Seems like generating cad with agents are attracting attention. But how do you know they are done right? In our new work we use finite element analysis as verification to test cad-quality!

0

4

0

157

Guijin Son

@gson_AI

about 1 month ago

Creating CAD with LLMs (or agents) is cool but how do you guarantee these outputs are physically sound? In my recent work with @seonggyung33214 we adopted finite-element analysis to confirm that the Agent-created outputs are sound. Result? Even the best models fail to create CAD outputs that meet real-world engineering standards. Check out our paper here > https://t.co/0oj5XWedPI

Ruben Kostandyan

@ruben_kostard

about 1 month ago

There we go

4

23

1

18

35K

0

2

0

271

Guijin Son

@gson_AI

about 2 months ago

@mythkernel @Zai_org @deepseek_ai hi! we are out of funds/credits at the moment and is looking for ways to add new models. we definitely want to add periodic updates to the paper with new models

1

0

92

Guijin Son

@gson_AI

about 2 months ago

🚀 Excited to share our new preprint: Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs. To study research-level mathematical reasoning, we introduce Soohak, a benchmark of 439 research-level math problems created from scratch by 64 mathematicians, including 38 faculty members.

4

98

19

47

18K

Guijin Son

@gson_AI

about 2 months ago

Looks like Frontier Math is in maintenance ⚠️ In the meantime, check out Soohak, our new research-level math dataset! https://t.co/4l4YmUauz3

Epoch AI

@EpochAIResearch

about 2 months ago

We are conducting an AI-assisted review of FrontierMath: Tiers 1-4. This has flagged fatal errors in about a third of problems, and we believe most of these flags to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.

30

868

67

195

477K

0

7

0

726

Guijin Son

@gson_AI

about 2 months ago

Big thanks to @linguist_cat, @seungonekim, @AkariAsai, @wellecks, @gneubig, and @Youngjae4Yu for their help writing the paper, and to many others who contributed problems, feedback, evaluations, and support throughout the project. This work would not have been possible without the broader community of mathematicians and researchers who helped shape Soohak.

0

3

0

241

Guijin Son

@gson_AI

about 2 months ago

If you want your model or agentic harness evaluated on Soohak, please feel free to reach out to me on X or by email at [email protected]. We are also looking for support to run a public leaderboard. If you would like to help with API credits, GPU compute, funding, or infrastructure, we’d be very grateful to chat. Our goal is to make research-level math evaluation more transparent, rigorous, and useful for the community.

1

4

1

3K

Guijin Son

@gson_AI

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users