DeepEval

9 days ago

Thrilled to announce @claudeai Opus 4.8 is now on @deepeval v4.0.5, go try it now.

0

2

1

0

62

deepeval retweeted

19 days ago

Now with @deepeval's AgentCore integration, you can: > Instrument AgentCore with 1 line of code. > Collect AgentCore traces in a Pytest setup > Run evals using 50+ metrics on your AgentCore agent, will full traceability Link in comments @awscloud

jeffr_yyy's tweet photo. Now with @deepeval's AgentCore integration, you can:

> Instrument AgentCore with 1 line of code.
> Collect AgentCore traces in a Pytest setup
> Run evals using 50+ metrics on your AgentCore agent, will full traceability

Link in comments @awscloud https://t.co/T2crmpiu09

0

2

1

103

deepeval retweeted

24 days ago

Today, we're proud to announce DeepEval 4.0 — the AI evaluation harness for vibe coding agents. Our biggest and boldest release yet. A long thread 🧵 @deepeval :

jeffr_yyy's tweet photo. Today, we're proud to announce DeepEval 4.0 — the AI evaluation harness for vibe coding agents. Our biggest and boldest release yet.

A long thread 🧵 @deepeval : https://t.co/DWRFs8PIyG

2

12

3

4

501

deepeval retweeted

2 months ago

We just launched dataset generation on Confident AI — connect Google Drive, SharePoint, Notion, or S3 and generate eval datasets directly from your docs. That's a wrap on Launch Week! 5 days, 5 launches.

confident_ai's tweet photo. We just launched dataset generation on Confident AI — connect Google Drive, SharePoint, Notion, or S3 and generate eval datasets directly from your docs.

That's a wrap on Launch Week! 5 days, 5 launches. https://t.co/eV8WJpRSlq

1

3

2

0

309

deepeval retweeted

2 months ago

We're launching Auto-Categorize Traces & Threads — Day 4 of our Launch Week! Every production trace gets categorized automatically, so you can detect response drift and see exactly which areas your agent crushes and which ones need work. Full post: https://t.co/n5nNtzvM2f

0

4

2

1

212

deepeval retweeted

2 months ago

Day 3 of launch week: auto trace-to-dataset ingestion. Set a rule once — production traces continuously flow into eval datasets and annotation queues. No scripts, no stale data. Post: https://t.co/5jOqvI0WyE

confident_ai's tweet photo. Day 3 of launch week: auto trace-to-dataset ingestion.

Set a rule once — production traces continuously flow into eval datasets and annotation queues. No scripts, no stale data.

Post: https://t.co/5jOqvI0WyE https://t.co/9VtgricyIa

0

4

2

0

226

deepeval retweeted

2 months ago

Confident AI Launch Week Day 2: Scheduled Evals ⏰ Everyone agrees to run evals every few days. Nobody actually does. Now you set a frequency, configure your mappings, and evals run themselves. Link here: https://t.co/8jySn12YBq

confident_ai's tweet photo. Confident AI Launch Week Day 2: Scheduled Evals ⏰

Everyone agrees to run evals every few days. Nobody actually does.

Now you set a frequency, configure your mappings, and evals run themselves.

Link here: https://t.co/8jySn12YBq https://t.co/a0DIGhsPNh

0

4

2

1

535

deepeval retweeted

2 months ago

Announcing our Q1 Launch Week! Day 1: Automated Error Analysis. Link here: https://t.co/C29LRkwFE4

0

5

2

1

495

4 months ago

@hasantoxr Wow 👀

0

239

deepeval retweeted

Hasan Toor

@hasantoxr

4 months ago

🚨BREAKING: Someone just solved LLM testing's biggest problem. It's called DeepEval and it gives you answer relevancy, hallucination detection, and G-Eval metrics that actually work. - Run evaluations 100% locally (no data leaves your machine). - Test agents, RAG systems, and production responses with human-level accuracy. 100% Opensource.

hasantoxr's tweet photo. 🚨BREAKING: Someone just solved LLM testing's biggest problem.

It's called DeepEval and it gives you answer relevancy, hallucination detection, and G-Eval metrics that actually work.

- Run evaluations 100% locally (no data leaves your machine).
- Test agents, RAG systems, and production responses with human-level accuracy.

100% Opensource.

33

781

125

947

50K

7 months ago

My sister just got released, DeepTeam v1.0, 100% open-source, Apache 2.0 red teaming for LLMs. ⭐ Star on GitHub to stay on top of the latest developments in AI security and safety: https://t.co/FwtIQ5xiSA

1

11

5

4

960

deepeval retweeted

8 months ago

@_avichawla Author of @deepeval here, I'm glad you've found our approach of LLM-Arena-as-a-Judge useful :)

1

4

1

0

358

deepeval retweeted

Avi Chawla

@_avichawla

8 months ago

Most LLM-powered evals are BROKEN! These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up. G-Eval is one popular example. Here's the core problem with LLM eval techniques and a better alternative to them: Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative. So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better. This is unlike scoring, say, classical ML models, where metrics like accuracy, F1, or RMSE give a clear and objective measure of performance. There’s no room for subjectivity, and the results are grounded in hard numbers, not opinions. LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM evals. In a gist, instead of assigning scores, you just run A vs. B comparisons and pick the better output. Just like G-Eeval, you can define what “better” means (e.g., more helpful, more concise, more polite), and use any LLM to act as the judge. LLM Arena-as-a-Judge is actually implemented in @deepeval (open-source with 12k stars), and you can use it in just three steps: - Create an ArenaTestCase, with a list of “contestants” and their respective LLM interactions. - Next, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case. - Finally, run the evaluation and print the scores. This gives you an accurate head-to-head comparison. Note that LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet below) or reference-based. If needed, you can specify an expected output as well for the given input test case and specify that in the evaluation parameters. Why DeepEval? It's 100% open-source with 12k+ stars and implements everything you need to define metrics, create test cases, and run evals like: - component-level evals - multi-turn evals - LLM Arena-as-a-judge, etc. Moreover, tracing LLM apps is as simple as adding one Python decorator. And you can run everything 100% locally. I have shared the repo in the replies.

_avichawla's tweet photo. Most LLM-powered evals are BROKEN!

These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up.

G-Eval is one popular example.

Here's the core problem with LLM eval techniques and a better alternative to them:

Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative.

So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better.

This is unlike scoring, say, classical ML models, where metrics like accuracy, F1, or RMSE give a clear and objective measure of performance.

There’s no room for subjectivity, and the results are grounded in hard numbers, not opinions.

LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM evals.

In a gist, instead of assigning scores, you just run A vs. B comparisons and pick the better output.

Just like G-Eeval, you can define what “better” means (e.g., more helpful, more concise, more polite), and use any LLM to act as the judge.

LLM Arena-as-a-Judge is actually implemented in @deepeval (open-source with 12k stars), and you can use it in just three steps:

- Create an ArenaTestCase, with a list of “contestants” and their respective LLM interactions.

- Next, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case.

- Finally, run the evaluation and print the scores.

This gives you an accurate head-to-head comparison.

Note that LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet below) or reference-based. If needed, you can specify an expected output as well for the given input test case and specify that in the evaluation parameters.

Why DeepEval?

It's 100% open-source with 12k+ stars and implements everything you need to define metrics, create test cases, and run evals like:

- component-level evals
- multi-turn evals
- LLM Arena-as-a-judge, etc.

Moreover, tracing LLM apps is as simple as adding one Python decorator.

And you can run everything 100% locally.

I have shared the repo in the replies.

9

79

15

81

21K

Vermilion Cliffs Ventures

8 months ago

🙌 our favorite VC ❤️‍🔥

@vermilionfund

8 months ago

The new Vermilion newsletter is out 🗞️ Inside: 💰 @514hq raises $17m to simplify AI-ready analytics 📈 @deepeval becomes the most adopted LLM eval framework globally 🤝 Google’s Agent Development Kit ships a @CopilotKit integration 👀 Who’s hiring? Check out the new Vermilion Careers page Plus: @ashl3ysm1th's take on startup KPIs, revenge of the acronyms, and more founder lessons from the trail.

1

2

1

0

445

0

1

0

241

deepeval retweeted

Vermilion Cliffs Ventures

@vermilionfund

8 months ago

The new Vermilion newsletter is out 🗞️ Inside: 💰 @514hq raises $17m to simplify AI-ready analytics 📈 @deepeval becomes the most adopted LLM eval framework globally 🤝 Google’s Agent Development Kit ships a @CopilotKit integration 👀 Who’s hiring? Check out the new Vermilion Careers page Plus: @ashl3ysm1th's take on startup KPIs, revenge of the acronyms, and more founder lessons from the trail.

1

2

1

0

445

deepeval retweeted

Avi Chawla

@_avichawla

9 months ago

Pytest for LLM Apps is finally here! DeepEval turns LLM evals into a two-line test suite to help you identify the best models, prompts, and architecture for AI workflows (including MCPs). Works with all frameworks like LlamaIndex, CrewAI, etc. 100% open-source with 11k stars!

_avichawla's tweet photo. Pytest for LLM Apps is finally here!

DeepEval turns LLM evals into a two-line test suite to help you identify the best models, prompts, and architecture for AI workflows (including MCPs).

Works with all frameworks like LlamaIndex, CrewAI, etc.

100% open-source with 11k stars! https://t.co/Xayu1aFGFV

8

277

45

274

20K

deepeval retweeted

Mariano Falcón @falconius

9 months ago

And now an external tool can be useful: langfuse, @braintrustdata @deepeval @langfuse @helicone_ai

0

1

0

205

deepeval retweeted

anshuman

@athleticKoder

9 months ago

Companies like @OpenAI, @perplexity_ai and @AnthropicAI already use LLM judges for production evaluation at massive scale. @ragas_io and @deepeval are two evaluation frameworks that I personally find intuitive. [NOT AN AD]

1

18

1

19

3K

deepeval retweeted

9 months ago

"widespread adoption" @deepeval

1

3

1

0

307