Erica @ericavaneee - Twitter Profile

Pinned Tweet

18 days ago

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 https://t.co/XajAyaZRct

ericavaneee's tweet photo. We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier.

🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2.
✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder.
🔗 https://t.co/XajAyaZRct

22

235

27

107

45K

Erica

@ericavaneee

6 days ago

@googlegemma Thank you so much for highlighting TERMS-Bench! We’re excited to see Gemma 4 31B perform as the top open-weight model in our environment-as-verifier benchmark for LLM negotiation agents. Really grateful for the Gemma team’s engagement and support!

0

6

0

238

Erica

@ericavaneee

6 days ago

Very honored to see Gemma @googlegemma feature TERMS-Bench! We built TERMS-Bench to evaluate LLM negotiation agents in settings where success is not cleanly verifiable by math/code-style checks, but also should not be outsourced to LLM-as-judge. Instead, the economic environment verifies the outcome. Gemma 4 31B is the top open-weight model on our benchmark, competitive alongside frontier peers. Excited to see open models advancing in these social-strategic, agentic evaluation domains 💛🚀

Google Gemma

@googlegemma

7 days ago

Honored to see Gemma 4 31B on TERMS-Bench, a benchmark for LLM negotiation agents based on economic negotiation! 🤝 - Environment verifies outcomes (no LLM-as-judge) - Top open-weight model alongside frontier peers - Allow diagnosing why and where agents fail

googlegemma's tweet photo. Honored to see Gemma 4 31B on TERMS-Bench, a benchmark for LLM negotiation agents based on economic negotiation! 🤝

- Environment verifies outcomes (no LLM-as-judge)
- Top open-weight model alongside frontier peers
- Allow diagnosing why and where agents fail https://t.co/cYnpVpVxkU

27

531

50

62

35K

2

13

2

4

4K

ericavaneee retweeted

Google Gemma

@googlegemma

7 days ago

Honored to see Gemma 4 31B on TERMS-Bench, a benchmark for LLM negotiation agents based on economic negotiation! 🤝 - Environment verifies outcomes (no LLM-as-judge) - Top open-weight model alongside frontier peers - Allow diagnosing why and where agents fail

27

531

50

62

35K

Erica

@ericavaneee

18 days ago

Joint work across @StanfordHAI, @StanfordEng, and @StanfordGSB with @fangzhao_zhang, @aneeshpappu, @elb4tu , @jose_blanchet, @Susan_Athey, @liujiashuo77, and @james_y_zou. Thanks to @ivanleomk , @osanseviero, @o_lacombe, and @GoogleDeepMind for hosting the Gemma open-model event in SF where we first presented this ❤️🚀!

ericavaneee's tweet photo. Joint work across @StanfordHAI, @StanfordEng, and @StanfordGSB with @fangzhao_zhang, @aneeshpappu, @elb4tu , @jose_blanchet, @Susan_Athey, @liujiashuo77, and @james_y_zou.

Thanks to @ivanleomk , @osanseviero, @o_lacombe, and @GoogleDeepMind for hosting the Gemma open-model event in SF where we first presented this ❤️🚀!

0

10

0

1

974

Erica

@ericavaneee

18 days ago

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 https://t.co/XajAyaZRct

22

235

27

107

45K

Erica

@ericavaneee

18 days ago

To our knowledge, this is the first benchmark to bring verifier-based evaluation (the paradigm behind progress in math, code, and DB agents) into a multi-turn social-strategic domain. 💡The payoff: you can see where models break, not just whether they do.

ericavaneee's tweet photo. To our knowledge, this is the first benchmark to bring verifier-based evaluation (the paradigm behind progress in math, code, and DB agents) into a multi-turn social-strategic domain.

💡The payoff: you can see where models break, not just whether they do.

1

5

0

1

1K

Erica

@ericavaneee

18 days ago

Three tiers, increasing in real-world grounding: • Synthetic suite: controlled Bayesian-game environments • Catalog-grounded: real product price data • Procurement chains: stateful multi-agent commercial settings Verifier-based eval at each tier: the environment itself, not an LLM judge, scores the agent.

ericavaneee's tweet photo. Three tiers, increasing in real-world grounding:
• Synthetic suite: controlled Bayesian-game environments
• Catalog-grounded: real product price data
• Procurement chains: stateful multi-agent commercial settings

Verifier-based eval at each tier: the environment itself, not an LLM judge, scores the agent.

2

5

0

2

2K

Erica

@ericavaneee

Last Seen Users on Sotwe

Trends for you

Most Popular Users