Comet @cometml - Twitter Profile

Comet

@Cometml

14 days ago

@Rajesh7113 Thanks for sharing @Rajesh7113!

0

143

Cometml retweeted

Rajesh M

@Rajesh7113

14 days ago

AI agent debugging is a COMPLETE mess right now. You fix one issue… and another workflow randomly breaks. You change a prompt. Tool calls start behaving differently. You improve latency. Accuracy drops somewhere else. Most teams are basically duct taping evals, traces, prompts, scripts, and observability together hoping nothing explodes. That’s why the new direction from Comet Opik feels important. Comet Opik just dropped two features that feel like a HUGE leap for agent workflows: • Test Suites • Ollie 1] Test Suites That “fix one thing, break another” problem? This is the answer. Every real failure you hit becomes a permanent test case with plain-English rules. So when you tweak that prompt and tool calls start misbehaving, you catch it BEFORE it ships. No giant eval dataset to build upfront. And no more arguing whether 0.84 is better than 0.81. You just get pass/fail on the scenarios that actually matter for your agent. 2] Ollie And this is the CRAZY part. A coding agent with full access to: • your traces • project history • agent behavior inside Opik That latency vs accuracy tradeoff you're constantly fighting? Ollie sees both. It diagnoses from your real traces, writes the fix in your code, AND generates a regression test so the same tradeoff doesn't bite you twice. So instead of: spot issue → switch tools → debug manually → write fix → create test separately → pray …the entire loop closes inside one platform. Find the problem. Write the fix. Generate the regression test. All connected. This is the first time I’ve seen an agent stack that actually feels built for iteration instead of chaos. The teams with the fastest feedback loops are going to dominate this space. Try Opik here: https://t.co/QG5weYcKgx #AIAgents #AgenticAI #GenerativeAI #RAG #EnterpriseAI

Rajesh7113's tweet photo. AI agent debugging is a COMPLETE mess right now.

You fix one issue…

and another workflow randomly breaks.

You change a prompt.

Tool calls start behaving differently.

You improve latency.

Accuracy drops somewhere else.

Most teams are basically duct taping evals, traces, prompts, scripts, and observability together hoping nothing explodes.

That’s why the new direction from Comet Opik feels important.

Comet Opik just dropped two features that feel like a HUGE leap for agent workflows:

• Test Suites
• Ollie

1] Test Suites

That “fix one thing, break another” problem?

This is the answer.

Every real failure you hit becomes a permanent test case with plain-English rules.

So when you tweak that prompt and tool calls start misbehaving, you catch it BEFORE it ships.

No giant eval dataset to build upfront.

And no more arguing whether 0.84 is better than 0.81.

You just get pass/fail on the scenarios that actually matter for your agent.

2] Ollie

And this is the CRAZY part.

A coding agent with full access to:
• your traces
• project history
• agent behavior inside Opik

That latency vs accuracy tradeoff you're constantly fighting?

Ollie sees both.

It diagnoses from your real traces, writes the fix in your code, AND generates a regression test so the same tradeoff doesn't bite you twice.

So instead of:

spot issue → switch tools → debug manually → write fix → create test separately → pray

…the entire loop closes inside one platform.

Find the problem.

Write the fix.

Generate the regression test.

All connected.

This is the first time I’ve seen an agent stack that actually feels built for iteration instead of chaos.

The teams with the fastest feedback loops are going to dominate this space.

Try Opik here:

https://t.co/QG5weYcKgx

#AIAgents #AgenticAI #GenerativeAI #RAG #EnterpriseAI

12

72

17

14

799

Comet

@Cometml

22 days ago

Our Head of Research Doug Blank headed to Boston for his 3rd annual talk at @MITDeepLearning. He took Asimov's laws of robotics & applied them to agentic AI -- proposing his own three laws of AI and sharing how we're thinking about AI safety at Comet. https://t.co/UZB0AkkdCI

1

2

0

408

Comet

@Cometml

27 days ago

We're hiring across the team 🎉 If you know any rockstars (or are one yourself), we'd love to chat with you! 🔗 https://t.co/AM2KCkbUSM

0

2

0

226

Who to follow

Databricks AI Research

@DbrxMosaicAI

We remove the barriers to state-of-the-art generative AI model development and make data + AI available to all.

Chip Huyen

@chipro

@aisysbooks @goodailist AI Engineering: https://t.co/94dv4uTU1H Designing MLSys: https://t.co/G81hL2dWmr Reading @chipslib

Pau Labarta Bajo

@paulabartabajo_

Citizen of the World who teaches AI that works | @liquidai | Maths Olympian | Father of 1… sorry 2 | Opinions are my own

Cometml retweeted

Paul Iusztin

@pauliusztin_

29 days ago

I just interviewed the former CTO at IBM and Chairperson of NodeJS. Here's what I learned: Michael @maximilien spent 12 months shipping production RAG to multiple customers. In our discussion, he told me that nothing on a leaderboard can predict what works until you evaluate your customers' data. Which I found interesting because... Most teams treat RAG like a setup task. Pick a vector database. Pick OpenAI embeddings. Ship it. Then spend months “vibe-checking” results. But production RAG doesn’t work like that. It's more of an iteration loop rather than a setup problem. Stitch → evaluate → iterate A real system has multiple moving parts. You don’t pick one... You swap and measure each one. Here’s what that looks like in practice: 1. Build a small eval set from real user questions 2. Build your evaluator (e.g., LLM Judge) against that dataset 3. Align your evaluator with human feedback (before trusting scores) 4. Iterate cheapest-first (retrieval → embeddings → infra) To make this work, you also need visibility across runs. This is where tools like Opik by @Cometml come in... Tracking each experiment so you can compare models, configs, and results over time. But most teams refuse to do this because it's extremely cumbersome. • Re-ingestion takes time • Pipelines break • Comparisons become unreliable So people default to benchmarks instead. But that doesn't mean it's better. On a real customer dataset (auction listings), Michael @maximilien swapped only the embedding model. An open-source model ranked #130 on MTEB beat OpenAI: • +11% quality • 240x faster re-embedding • 50% smaller vectors • $0 cost Here's the gist... RAG is not about picking the best tools. It’s about measuring what works for your data. Until you do that… You’re just guessing. Full interview and breakdown here: https://t.co/MpJ3bYuH8g

pauliusztin_'s tweet photo. I just interviewed the former CTO at IBM and Chairperson of NodeJS.

Here's what I learned:

Michael @maximilien spent 12 months shipping production RAG to multiple customers.

In our discussion, he told me that nothing on a leaderboard can predict what works until you evaluate your customers' data.

Which I found interesting because...

Most teams treat RAG like a setup task.

Pick a vector database.
Pick OpenAI embeddings.
Ship it.

Then spend months “vibe-checking” results.

But production RAG doesn’t work like that.

It's more of an iteration loop rather than a setup problem.

Stitch → evaluate → iterate

A real system has multiple moving parts.

You don’t pick one...
You swap and measure each one.

Here’s what that looks like in practice:
1. Build a small eval set from real user questions
2. Build your evaluator (e.g., LLM Judge) against that dataset
3. Align your evaluator with human feedback (before trusting scores)
4. Iterate cheapest-first (retrieval → embeddings → infra)

To make this work, you also need visibility across runs.

This is where tools like Opik by @Cometml come in...

Tracking each experiment so you can compare models, configs, and results over time.

But most teams refuse to do this because it's extremely cumbersome.
• Re-ingestion takes time
• Pipelines break
• Comparisons become unreliable

So people default to benchmarks instead.
But that doesn't mean it's better.

On a real customer dataset (auction listings), Michael @maximilien swapped only the embedding model.

An open-source model ranked #130 on MTEB beat OpenAI:
• +11% quality
• 240x faster re-embedding
• 50% smaller vectors
• $0 cost

Here's the gist...

RAG is not about picking the best tools.
It’s about measuring what works for your data.

Until you do that…

You’re just guessing.

Full interview and breakdown here: https://t.co/MpJ3bYuH8g

3

18

4

11

662

Comet

@Cometml

about 1 month ago

"Until you evaluate on your data, nothing else matters."

Paul Iusztin

@pauliusztin_

about 1 month ago

I’ve spent the last week interviewing @maximilien, former CTO at IBM and Chairperson of NodeJS Foundation, who has shipped production RAG to multiple customers over the past year. The lesson he kept circling back to is that until you evaluate on your customer’s data, nothing else you do matters. Production RAG is a loop: stitch your embedding model, chunking, retrieval, vector DB, and judge, then evaluate and iterate until you hit your customer’s metrics. Public benchmarks and the MTEB leaderboard are signals, not verdicts. On a real customer dataset of Leica auction listings, an open-source sentence-transformer that ranked around #130 on MTEB still beat OpenAI by 11% in quality. It ran 240x faster, produced 50% smaller vectors, and cost $0.

pauliusztin_'s tweet photo. I’ve spent the last week interviewing @maximilien, former CTO at IBM and Chairperson of NodeJS Foundation, who has shipped production RAG to multiple customers over the past year. The lesson he kept circling back to is that until you evaluate on your customer’s data, nothing else you do matters.

Production RAG is a loop: stitch your embedding model, chunking, retrieval, vector DB, and judge, then evaluate and iterate until you hit your customer’s metrics. Public benchmarks and the MTEB leaderboard are signals, not verdicts.

On a real customer dataset of Leica auction listings, an open-source sentence-transformer that ranked around #130 on MTEB still beat OpenAI by 11% in quality. It ran 240x faster, produced 50% smaller vectors, and cost $0.

3

6

2

5

1K

1

0

1

653

Cometml retweeted

Gideon M

@gidim

about 1 month ago

As your agent matures, something shifts. You stop writing code, and start editing prompts, tweaking params, trying new tools, etc. The tooling for this phase sucks. Today, we’re fixing that. Announcing Agent Configuration + Agent Playground in Opik. 🧵

gidim's tweet photo. As your agent matures, something shifts.
You stop writing code, and start editing prompts, tweaking params, trying new tools, etc.

The tooling for this phase sucks. Today, we’re fixing that.

Announcing Agent Configuration + Agent Playground in Opik. 🧵 https://t.co/U0xVdT0qhW

3

28

9

3

29K

Cometml retweeted

Gideon M

@gidim

about 1 month ago

Shared by a customer. Ollie just made their slack bot 52% faster and 98% cheaper. With test suites no regressions either

1

13

1

0

302

Comet

@Cometml

about 1 month ago

We're launching the Agent Playground so you can test your full agent configuration from the UI. Tweak prompts and swap models without touching your code. See how the entire agent responds and only save what works. https://t.co/yhPkl4krHG

0

121

Comet

@Cometml

about 1 month ago

Third and final day of "What we've been building" launch week: Agent Playground Your agent isn't just one prompt. It's a complex system of models and parameters working together. It's time to have a workflow that treats it as such.

Cometml's tweet photo. Third and final day of "What we've been building" launch week: Agent Playground

Your agent isn't just one prompt. It's a complex system of models and parameters working together.

It's time to have a workflow that treats it as such. https://t.co/9egjKyueXg

1

0

2

145

Comet

@Cometml

about 1 month ago

@namd1nh Thanks for sharing!

0

1

0

57

Comet

@Cometml

about 1 month ago

@python_spaces Thanks for sharing!

0

1

0

44

Comet

@Cometml

about 1 month ago

@hasantoxr Thanks for sharing!

0

41

Comet

@Cometml

about 1 month ago

@itsjasonai Can't wait to see what you build!

0

35

Comet

@Cometml

about 1 month ago

It’s his first week in the office so say hi if you see him around 👋 Research preview available in the Opik Cloud. Sign up for early access: https://t.co/b1U4P9nOdF

0

2

0

141

Comet

@Cometml

about 1 month ago

Second day of "What we've been building" launch week Meet Ollie 🦉 You may have already seen Ollie around as our mascot. Today he's also joining the team as our new coding assistant.

Cometml's tweet photo. Second day of "What we've been building" launch week

Meet Ollie 🦉

You may have already seen Ollie around as our mascot. Today he's also joining the team as our new coding assistant. https://t.co/ehe8vIpN5g

1

4

2

0

404

Comet

@Cometml

about 1 month ago

Ollie lives in the Opik UI with full context of your agent. When you spot a problem, he diagnoses it, writes the fix, ships it to your IDE, and adds a test case so it doesn't come back.

1

4

0

1

359

Cometml retweeted

Gideon M

@gidim

about 1 month ago

The big idea with Test Suites is that agents need comprehensive regression tests, built on nuanced assertions and real production traces. This is how you improve your agent for one user without damaging it for 3 others, as explained by @JacquesVerre https://t.co/hyCzfNmNca

1

19

3

2

569

Comet

@Cometml

about 1 month ago

Your suite grows as you build. Every failure you catch becomes a test case. Each failed test tells you what needs to be fixed. Available in the open-source instance. Take a first look: https://t.co/HTnIMzhGFI

0

3

0

144

Comet

@Cometml

about 1 month ago

Day 1 of "What we've been building": Test Suites Most agent testing feels like a chore because it starts with a blank CSV. You're forced to invent a dataset before you even know how your agent fails.

Cometml's tweet photo. Day 1 of "What we've been building": Test Suites

Most agent testing feels like a chore because it starts with a blank CSV. You're forced to invent a dataset before you even know how your agent fails.

1

4

2

0

296

Comet

@Cometml

about 1 month ago

Test Suites change that. Describe how your agent should behave using rules written in plain English and get clear pass/fail results when you run tests.

1

2

0

222

Comet

@Cometml

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users