Braintrust @Braintrust - Twitter Profile

Pinned Tweet

Braintrust

@braintrust

about 1 month ago

Topics is now GA on all plans. Continuously find the patterns worth investigating across your production traffic.

2

20

4

8

4K

Braintrust

@braintrust

1 day ago

We ran the 48 Group Stage World Cup matchups through six configurations of @p0 research agents and scored every output in Braintrust. Here's what we found → https://t.co/BrOi713GnE

braintrust's tweet photo. We ran the 48 Group Stage World Cup matchups through six configurations of @p0 research agents and scored every output in Braintrust.

Here's what we found → https://t.co/BrOi713GnE https://t.co/I2bojN8zGN

1

15

4

1

3K

braintrust retweeted

Izzy Hurley

@iz_hurley_

1 day ago

In a recent @Braintrust post, I compared GLM 5.2, served by @baseten, with Anthropic’s Opus 4.8 and Sonnet 5 on a long-context code retrieval eval. Building and digging into this eval reminded me of the importance of considering the variance in your data.

iz_hurley_'s tweet photo. In a recent @Braintrust post, I compared GLM 5.2, served by @baseten, with Anthropic’s Opus 4.8 and Sonnet 5 on a long-context code retrieval eval. Building and digging into this eval reminded me of the importance of considering the variance in your data. https://t.co/nRC4emjg79

1

12

2

4K

Braintrust

@braintrust

1 day ago

@huggingface Read more → https://t.co/Tb0iu7IR1O

0

2

0

133

Braintrust

@braintrust

1 day ago

The answer to "which model is cheapest" depends entirely on whether you're asking about cost per task or cost per success. We took 1,781 real agent traces from @huggingface and used Braintrust to eval them for patterns. The results show that you should pick the cheapest config that clears your quality bar, but that config is different per task family. Open-weight models can manage coding, but gpt-4.1 is still better for conversational support.

4

3

1

515

Braintrust

@braintrust

2 days ago

Game, set, match on the inaugural Agent Open. Thanks to everyone who stopped by and grabbed a paddle. And thanks to our friends and partners who made it possible: @Modal, @Browserbase, @turbopuffer, @p0, @llama_index, and @cursor_ai. See you at the next one.

braintrust's tweet photo. Game, set, match on the inaugural Agent Open.

Thanks to everyone who stopped by and grabbed a paddle.

And thanks to our friends and partners who made it possible: @Modal, @Browserbase, @turbopuffer, @p0, @llama_index, and @cursor_ai.

See you at the next one. https://t.co/O3l1GMkeoQ

0

15

1

0

832

Braintrust

@braintrust

3 days ago

@baseten Read the full GLM-5.2 eval → https://t.co/cIXknXKgN2

0

32

5

48

17K

Braintrust

@braintrust

3 days ago

Run the best OSS models in Braintrust, in collaboration with @Baseten. Call GLM-5.2 natively, eval its quality, and observe its behavior in production. Save on inference costs without compromising quality by picking the best model for your agent. Free to try through July 31.

4

55

5

18

9K

Braintrust

@braintrust

3 days ago

@baseten How to use GLM-5.2 in Braintrust → https://t.co/ixUB4Xle17

1

5

0

2

898

Braintrust

@braintrust

4 days ago

Read more → https://t.co/vhSac8HVKg Apply now → https://t.co/1JE5S7INL9

0

2

0

1

439

Braintrust

@braintrust

4 days ago

The AI teams that ship quality agents put evals and observability in place early. Braintrust for Startups gives early-stage companies production-grade observability infrastructure so they can build with confidence, no matter their size or resources.

1

15

5

6

11K

Braintrust

@braintrust

4 days ago

Watch the workshop → https://t.co/NBPQWz20jB Get started with Topics → https://t.co/teAoNNlOXJ

0

245

Braintrust

@braintrust

4 days ago

Reading traces one by one doesn't scale. Topics automatically clusters production traces so you can identify patterns, investigate failures, and decide where to focus your eval efforts.

1

4

0

458

Braintrust

@braintrust

5 days ago

Learn more in the Braintrust docs → https://t.co/sgzpTTgvvn

0

109

Braintrust

@braintrust

5 days ago

Run a full chess eval without writing a single line of code using the Braintrust CLI. - Take a CSV of chess puzzles and make a dataset. - Write a prompt to solve mate in 2 puzzles, and upload it to the project. - Then write a scorer that compares the output to the expected answer. The eval found that GPT‑5 with no reasoning scored about 25% on the chess puzzles, and with low reasoning it scored about 15%.

3

13

3

927

Braintrust

@braintrust

7 days ago

Braintrust

@braintrust

Last Seen Users on Sotwe

Trends for you

Most Popular Users