🚀 I’m participating in AgentX AgentBeats — the world’s largest open competition focused on agentic AI.
Hosted by Berkeley RDI and connected to ~40,000 learners via the Agentic AI MOOC.
This isn’t just about building better agents — it’s about how we evaluate them. 🧵👇
🧠 Why agentic evaluation matters
As AI systems become more autonomous, classic benchmarks start to break:
• data contamination
• overfitting
• leaderboard gaming
They tell us what scored well — not why systems succeed or fail.
AgentBeats flips the problem.
Evaluation itself becomes agentic:
• runs tasks autonomously
• enforces protocols
• analyzes errors
• produces structured, reproducible reports
Benchmarks become systems, not spreadsheets.
🧬 What I’m building
An autonomous evaluation agent for biomedical NLP, inspired by recent large-scale LLM studies (e.g. Chen et al., Nature Comms 2025).
The idea is simple:
one agent evaluates other agents — rigorously and transparently.
🟢 How it works
A “Green Agent” evaluates competing “Purple Agents” (LLMs or agentic systems):
• orchestrates task execution
• enforces evaluation rules
• measures performance
• generates diagnostic reports
No single score.
Real insight.
Goals:
• reduce contamination & overfitting
• enable fine-grained error analysis
• move from leaderboard scores → system-level understanding
Critical for high-stakes domains like healthcare.
🔬 Why biomedical NLP is hard
• hallucinations can sound plausible — and be dangerous
• ground truth is scattered across papers & databases
• rare diseases = sparse, inconsistent data
Evaluation needs evidence, not vibes.
🤔 A key question
Can generalist LLM agents match or beat specialized biomedical NLP tools?
Historically, domain-specific systems dominated:
• NER
• relation extraction
• evidence synthesis
AgentBeats finally lets us test this properly.
📊 What gets evaluated
Across 6 task types / 12 datasets:
• QA (MedQA, PubMedQA)
• NER (BC5CDR, NCBI Disease)
• multi-label classification
• relation extraction
• text simplification
• dynamic summarization (live PubMed articles)
Dynamic summarization is real-world hard: fetched live, evaluated end-to-end.
⚙️ Key capabilities
• fully automated evaluation
• fine-grained error analysis (hallucinations, boundaries, knowledge gaps…)
• 1–5⭐ ratings aligned with production readiness
• side-by-side agent comparisons
• actionable insights — not just scores
🔮 Why this matters
AgentBeats points toward a future where:
• evaluation is autonomous
• benchmarks are reproducible & contamination-aware
• AI systems are judged with real rigor
In healthcare, this isn’t optional.
More soon 👀
👉 https://t.co/EeRgZJ4IrC
cc: @BerkeleyRDI
@karanjagtiani04@iEx_ec hey Karan,
the idea is to restrict access,
so the agent only has access to a TEE tool and never see the user’s personal informations
The tg handle is an example but it could be medical, financial or intimate informations
⚡️ excited to announce our collaboration with @iEx_ec
bringing TEE privacy and security to ai agents
SOC 2 is just a proof of audit
TEEs are proof of security
here is what we're working on 🧵
our dream is to make privacy a default
and we believe the future of AI agents is
✅ autonomous
✅ privacy-first
✅ built for Web3
thanks to @iEx_ec@arbitrum@ArweaveEco
for pushing the boundaries with us
one example is the @web3privacy newsletter
an agent fetches the newsletter,
chooses the most impactful ones
and send you a summary privately 🤫
you can subscribe here: https://t.co/QhplxU1VAR
ps: this requires some Eth on arbitrum
1xBuild – Their platform runs on “agent templates.” With Web3Telegram, they’re building a privacy AI newsletter: Telegram updates that stay anonymous, with content archived permanently on Arweave.
https://t.co/GvIY1PH9FQ T
I don't use Telegram. All those accounts are fake.
Not against Telegram, but the one feature that killed it for me was anyone can message you when they know your handle. And I get spammed to the point my phone lags. Gave this feedback to Pavel directly once too. 🤷♂️