3️⃣ 2️⃣ 1️⃣ Our free course on LLM evaluations for AI product teams starts today!
🎥 7 days of byte-sized videos into your inbox
⭐️ Certificate upon completion
👩💻 No coding skills required
👩🎓500+ students have signed up
You can still join the course👇
https://t.co/Go2bNYJXCR
📌 In case you missed it
How to evaluate an AI agent?
Follow the tutorial as we:
1️⃣ Build an AI agent,
2️⃣ Create a test dataset,
3️⃣ Assess responses and tool choice,
4️⃣ Track the agent’s behaviour.
Follow the tutorial from our LLM evals course: https://t.co/lkoEhBBdGC
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How Uber improves driver availability at airports: Estimated time-to-request model, Earnings-per-hour prediction, and Driver-deficit forecasting.
https://t.co/c3fwqIduGx
🦾 More AI agents aren’t always better.
Google evaluated 180 agent setups and found multi-agent systems help with parallel tasks but can hurt sequential ones.
The work also proposes a model to predict optimal agentic designs.
https://t.co/ODbRtyGPui
📌 In case you missed it
Let’s test your RAG system!
Follow the tutorial as we:
1️⃣ Build a RAG system,
2️⃣ Generate test data,
3️⃣ Evaluate answers for correctness and faithfulness.
Watch the tutorial from our LLM evals course: https://t.co/HuU5TWk0HZ
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How GoDaddy built Lighthouse, an internal AI analytics platform: prompt engineering framework, model orchestration, solution architecture, and use cases.
https://t.co/fil15hoXPi
(policyNIM oss tool)
preflight command is working. when I provide a coding task, it kicks off a search through indexed policies to determine which rules are relevant for implementation.
@nvidia for embedding w/ @OpenAI + @lancedb for vector storage.
eval command is also working. using @EvidentlyAI for running eval suite.
🚦 Meta’s “Agents Rule of Two”
According to Meta, AI agents should satisfy at most two of these conditions per session to reduce prompt-injection risk:
- Handle untrusted inputs
- Access sensitive data
- Change state / act externally
https://t.co/Zdb6rHtj3i
📌 In case you missed it
How do you know if your RAG works?
You need to check:
✅ Can it find the right information?
✅ Is the final answer complete, relevant, and free of hallucinations?
Watch the intro to RAG evaluation from our LLM evals course: https://t.co/e80MQr7ent
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How DoorDash improves its RecSys using LLMs to bridge behavioral silos in multi-vertical recommendations.
https://t.co/3bSxC7qPTG
💭 Can AI systems introspect?
Anthropic’s new research suggests Claude models can sometimes identify and describe their own internal states.
It’s still unreliable, but marks a step toward more transparent AI reasoning.
https://t.co/hEhV9xBy87
📌 In case you missed it
Can LLMs write engaging tech tweets?
Follow the tutorial as we:
1️⃣ Build a tweet generator,
2️⃣ Score its outputs with custom LLM judges,
3️⃣ Improve the results with prompt iteration.
Watch the tutorial from our LLM evals course: https://t.co/VsNXVdZNc6
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How Shopify transformed its product classification system from basic categorization to an AI-driven framework using Vision Language Models.
https://t.co/6gY2GtTY9v
📚 Context is everything.
OpenAI shares how it built an in-house data agent that answers complex questions in minutes.
It uses 6 layers of context:
- Table metadata
- Human annotations
- Codex enrichment
- Company knowledge
- Memory
- Runtime context
https://t.co/vrjw4XDktt
📌 In case you missed it
Are LLMs good for classification tasks?
We built an LLM-based classifier for a travel support chatbot and compared its performance to a classic ML model.
Watch the tutorial from our LLM evals course: https://t.co/6EayS9lThw
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How Wayfair built Wilma, a customer service agent copilot: workflow, prompt templates, and the copilot’s evolution.
https://t.co/LHLeOsMEFd
🤖 How to develop and deploy chatbots at scale?
DoorDash shares how they created a simulation platform and evaluation flywheel, allowing them to test chatbots with fast feedback loops and without production risk.
https://t.co/BS9bufAiXr
📌 In case you missed it
How to create an LLM judge that aligns with human labels:
- Define criteria
- Create test dataset
- Run evaluation prompt to see if the judge aligns with your labels
- Evaluate the judge
Watch the video from our LLM evals course: https://t.co/d3fe8a8yBY
A Friday ML use case 📕
📚 From the database of 800 ML & LLM systems: https://t.co/jJoUj6MfFZ
How Wayfair uses AI agents to automatically triage support tickets: agents vs. workflows and a hybrid approach.
https://t.co/pRigeuGbZx
🔎 Scaling catalog attribute extraction with multi-modal LLMs
Instacart shares how it built PARSE, a self-serve multi-modal LLM platform for structured product attribute extraction from text and images at scale 👇
https://t.co/3CKzlFLhlD