Manasi Sharma @ ICLR 2026 @ManasiSharma_ - Twitter Profile

Pinned Tweet

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

ManasiSharma_'s tweet photo. 🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why. https://t.co/aPYN3WZBhW

12

222

33

129

32K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

2 months ago

also excited to chat more broadly about agents (deep research agents, multimodal computer-use agents, etc.), evals & post-training

1

0

140

Manasi Sharma @ ICLR 2026 @ManasiSharma_

2 months ago

i'll be attending @iclr_conf in Rio this week! 🇧🇷 excited to be presenting ResearchRubrics at the 4/23 Thurs afternoon poster session, and a poster on rubric-benchmark sensitivity at the Agents in the Wild workshop

Manasi Sharma @ ICLR 2026 @ManasiSharma_

5 months ago

excited to share that ResearchRubrics has been accepted to ICLR 2026! see you in Rio☀️

4

67

5

23

8K

1

33

1

5

2K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

4 months ago

@mariyaivasileva @WiMLworkshop hi @WiMLworkshop, I just ran into this issue as well

1

0

62

Who to follow

Pratik Joshi

@Roprajo

Research Engineer @GoogleDeepMind | Teaching machines to code | Prev @LTIatCMU @GoogleAI, @MSFTResearch @BITSPilaniGoa

Harsh Maheshwari

@harsh_m121

Research @SarvamAI | @GeorgiaTech 2023 | @iitdelhi 2019

Shaily

@shaily99

PhD @LTIatCMU. Prev: @allen_ai @GoogleAI @MSFTResearch. #NLProc. Often ranting about research.

ManasiSharma_ retweeted

Epoch AI

@EpochAIResearch

4 months ago

Can AI do real digital work? We reviewed three benchmarks to find out: RLI, GDPval, and APEX-Agents. Our take: progress here will indicate substantial economic value, but tasks are too self-contained to tell us about wholesale automation. Thread for more:

EpochAIResearch's tweet photo. Can AI do real digital work? We reviewed three benchmarks to find out: RLI, GDPval, and APEX-Agents.

Our take: progress here will indicate substantial economic value, but tasks are too self-contained to tell us about wholesale automation.

Thread for more: https://t.co/ZnzwMxSbkl

6

139

20

63

22K

ManasiSharma_ retweeted

Spreadsheet Arena

@sheetarena

4 months ago

TL;DR: Spreadsheet generation is multi-dimensional. Human preference data captures what users actually value, but different dimensions matter across domains, and some signals surface more clearly than others. Spreadsheet Arena gives us a powerful foundation for evaluation, and a new lens for improving post-training. Start a battle at https://t.co/MHGdWX6xxm Read the paper at https://t.co/6GX4ify8CS @srkundurthy @claranahhh @Zachkirshner @calvincbzhang @ManasiSharma_ @jhnling

1

10

1

0

724

ManasiSharma_ retweeted

Calvin Zhang

@calvincbzhang

4 months ago

New paper from @scale_AI & @MeridianAgent: SpreadsheetArena 📄 We evaluated 16 LLMs on end-to-end spreadsheet generation via 4,300+ blind pairwise votes. Crucially, we move beyond scalar Elo ratings to decompose the latent preference signal into functional, structural, and stylistic components. 🧵

2

32

4

9

8K

ManasiSharma_ retweeted

Scale AI

@scale_AI

4 months ago

🎙️ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance. We explore what meaningful agent evaluation looks like, where today’s agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.

4

18

5

4

3K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

5 months ago

🎙️check out the full episode and paper: https://t.co/9AuTRbGzuP https://t.co/VQurPwFeqY

0

5

1

391

Manasi Sharma @ ICLR 2026 @ManasiSharma_

5 months ago

i also recently sat down with @calvincbzhang and @Bckenstler on the CoT podcast at @scale_AI to discuss Deep Research agents & the ResearchRubrics paper in particular 🧵

1

39

4

27

2K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

5 months ago

we dug into (1) how to assess deep research queries & reports, (2) where today’s agents still fall short, and (3) why clearer evaluation frameworks are critical as real-world agent use accelerates. :)

1

3

0

291

Manasi Sharma @ ICLR 2026 @ManasiSharma_

5 months ago

excited to share that ResearchRubrics has been accepted to ICLR 2026! see you in Rio☀️

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂��𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

ManasiSharma_'s tweet photo. 🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂��𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why. https://t.co/aPYN3WZBhW

12

222

33

129

32K

4

67

5

23

8K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

chatted with other folks at NeurIPS last week working on browser / CUA post-training, and it seems like people are shifting toward vision-only agents, which appear to generalize better

0

5

0

1

308

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

@LiuZuxin would love to chat!

0

51

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

@996roma would love to chat!

0

51

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

@DrJimFan would love to meet!

0

171

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

i'll be attending NeurIPS in San Diego next week! excited to chat about agents in general (deep research agents, multimodal browser agents for long-horizon tasks, etc.), evals for open-ended + economically valuable tasks, and RL / post-training

3

9

0

2

814

ManasiSharma_ retweeted

Rohan Paul

@rohanpaul_ai

7 months ago

ResearchRubrics is a benchmark that tests how well AI deep research agents handle open ended web questions. The key finding is that these agents still fall short of expert expectations once checked against rubrics. The benchmark has 101 prompts from 9 domains, each with a human written checklist about content, reasoning, synthesis, citations, and clarity. Each checklist item has a weight, some are mandatory, some optional, and a judge marks them as fully met, partly met, or not met. Tasks get 3 tags describing topic breadth, reasoning depth, and how open ended the goal is. Top agents stay below 68% compliance because they miss implicit context and fail to connect information across documents, even though language model judges and humans agree when rubrics use concrete examples. ---- Paper – arxiv. org/abs/2511.07685 Paper Title: "ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents"

rohanpaul_ai's tweet photo. ResearchRubrics is a benchmark that tests how well AI deep research agents handle open ended web questions.

The key finding is that these agents still fall short of expert expectations once checked against rubrics.

The benchmark has 101 prompts from 9 domains, each with a human written checklist about content, reasoning, synthesis, citations, and clarity.

Each checklist item has a weight, some are mandatory, some optional, and a judge marks them as fully met, partly met, or not met.

Tasks get 3 tags describing topic breadth, reasoning depth, and how open ended the goal is.

Top agents stay below 68% compliance because they miss implicit context and fail to connect information across documents, even though language model judges and humans agree when rubrics use concrete examples.

----

Paper – arxiv. org/abs/2511.07685

Paper Title: "ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents"

2

20

5

15

4K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

thanks @_akhaliq for featuring our new work! 💻we also recently released the code to evaluate the benchmark: https://t.co/POPYsW0SGR

AK

@_akhaliq

7 months ago

ResearchRubrics A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

4

56

7

25

13K

1

13

0

1

1K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

@atasteoff thanks for sharing! we added a few more details and the link to the dataset here: https://t.co/bjB0AkaB9x

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂��𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

12

222

33

129

32K

0

1

0

93

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

This work was led by our amazing team at @scale_AI & other institutions: @calvincbzhang, @cbandieth, @clintonjwang, @ankitaich30, @hnghiem_ai, @TahseenRab74917, @yehtetCS, @Rkwmdkrhrh, @sumanabasu21, @Iishiiyaa, @DenisPeskoff, @marcos_scale, @SeanHendryx, @Bckenstler, @vbingliu

2

10

0

588

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

12

222

33

129

32K

Manasi Sharma @ ICLR 2026 @ManasiSharma_

7 months ago

ResearchRubrics is publicly available 🎉 📄 100+ realistic prompts 🧮 2.5K expert rubrics 📄 arXiv: https://t.co/AEwQL6FLNs 🌐 Website: https://t.co/dtYIR9qmm7 🤗 Dataset: https://t.co/EmwtIHI2zh

1

17

3

4

973

Manasi Sharma @ ICLR 2026

@ManasiSharma_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users