Rayan Krishnan

7 days ago

NYT cited our Vibe Code Bench in its coverage of Opus 4.8, signal that real-world model evals are becoming part of how frontier labs are understood. With Ant's S1 filing, would expect benchmarks like this to matter more over time for accountability + market reporting.

RayanKrishnan's tweet photo. NYT cited our Vibe Code Bench in its coverage of Opus 4.8, signal that real-world model evals are becoming part of how frontier labs are understood.

With Ant's S1 filing, would expect benchmarks like this to matter more over time for accountability + market reporting. https://t.co/ts7tfckFY3

0

9

0

2K

10 days ago

"It’s actually one of the existential risks for AI progress in general... What are you hillclimbing on? It is one of the foundational questions that needs to be answered" Was great to talk through the 3.5 release with Logan and where model progress is headed

11 days ago

We are excited to share that @OfficialLoganK joined us on The Bench to discuss Google's new Gemini 3.5 Flash: why it's deliberately more persistent and capable than previous Flash models, how it hit #1 on our FinanceAgent Benchmark taking 82 steps where competitors stopped at 13, and what justifies the price increase. We also get into why AI benchmarks need a paradigm shift, the trade-off of building everything vs staying focused, the Pope, and why Omni might kill the Subway Surfers content era. 0:11:00 – Flash is being rebased for the agent era, not just a cheaper model anymore 0:14:03 – Persistence by design: 82 tool calls vs competitors' 13 0:17:52 – Why pricing went up and how Google thinks about value per token 0:22:55 – Coding performance: from 20th to 10th place in one generation 0:28:28 – Why benchmarks have historically been misleading and what the new era of evaluation looks like 0:29:28 Logan on why Google has the best researchers in the world 0:36:16 – The cost of being Google 0:39:07 – The Pope’s encyclical on AI and whether most people see frontier intelligence as a good thing 0:51:12 – Why Omni is the thing that recently clicked

10

106

13

43

16K

0

1

0

262

11 days ago

Ant continuing to push the ceiling and find headroom on coding. I was blown away seeing such a significant jump on our Vibe Code Benchmark (71% -> 83%) only furthering their lead as the SOTA coding model.

11 days ago

Anthropic just dropped another powerhouse model, Opus 4.8 and it’s the new SOTA on the Vals Index (70.2%) and Vals Multimodal (70.7%). Full results below.

ValsAI's tweet photo. Anthropic just dropped another powerhouse model, Opus 4.8 and it’s the new SOTA on the Vals Index (70.2%) and Vals Multimodal (70.7%). Full results below. https://t.co/HIPy2VNsSE

2

119

11

18

47K

0

2

0

131

RayanKrishnan retweeted

14 days ago

Pitch us a benchmark or eval technique. We'll fund you to build it. We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them. What you get: - Unlimited API credits + budget capacity for GPUs and human data - Vals’ evaluation infrastructure - $1,000–2,500 / week stipend - A network of evals researchers across frontier labs and academia Location: Both remote / in-person in SF applications will be considered

ValsAI's tweet photo. Pitch us a benchmark or eval technique. We'll fund you to build it.

We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them.

What you get:
- Unlimited API credits + budget capacity for GPUs and human data
- Vals’ evaluation infrastructure
- $1,000–2,500 / week stipend
- A network of evals researchers across frontier labs and academia

Location: Both remote / in-person in SF applications will be considered

23

515

38

861

97K

RayanKrishnan retweeted

24 days ago

Yesterday we had the privilege of hosting @tkalil2050 and the top foundation model labs at our office for an exclusive first look at what we are shipping next. Exciting things to come! DM us if you want to come to the next one.

ValsAI's tweet photo. Yesterday we had the privilege of hosting @tkalil2050 and the top foundation model labs at our office for an exclusive first look at what we are shipping next.

Exciting things to come! DM us if you want to come to the next one. https://t.co/Z7EaEy40eM

0

16

3

1

2K

RayanKrishnan retweeted

26 days ago

Can AI do the job of a financial analyst? We just released V2 of our Finance Agent Benchmark and tested the frontier models. The results are tighter than you'd expect.

ValsAI's tweet photo. Can AI do the job of a financial analyst?

We just released V2 of our Finance Agent Benchmark and tested the frontier models. The results are tighter than you'd expect. https://t.co/uE9fRd05iy

10

146

14

41

84K

RayanKrishnan retweeted

27 days ago

Finance Agent Benchmark v2 is here. Finance is one of the most lucrative applications of AI where much of the busy work could be automated. That’s why we rebuilt our Finance Agent Benchmark to push frontier models even further. We designed V2 to better reflect what financial analysts actually do: refined taxonomy reflecting real workflows, an improved harness with more tools, and jury-based evaluation. The result: no model cracks 52%. Would you trust a financial analyst who’s only correct half the time?

11

93

15

41

11K

about 1 month ago

h/t @sean_t_strong

0

95

about 1 month ago

which one is the consumer ai company again?

1

4

0

336

about 1 month ago

@ArthurMacwaters yes strong agree. open to collaborating on this :)

1

2

1

0

106

RayanKrishnan retweeted

about 1 month ago

After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not. This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.

3

56

2

14

14K