Bertrand @bertrandbuild - Twitter Profile

🚀 I’m participating in AgentX AgentBeats — the world’s largest open competition focused on agentic AI. Hosted by Berkeley RDI and connected to ~40,000 learners via the Agentic AI MOOC. This isn’t just about building better agents — it’s about how we evaluate them. 🧵👇 🧠 Why agentic evaluation matters As AI systems become more autonomous, classic benchmarks start to break: • data contamination • overfitting • leaderboard gaming They tell us what scored well — not why systems succeed or fail. AgentBeats flips the problem. Evaluation itself becomes agentic: • runs tasks autonomously • enforces protocols • analyzes errors • produces structured, reproducible reports Benchmarks become systems, not spreadsheets. 🧬 What I’m building An autonomous evaluation agent for biomedical NLP, inspired by recent large-scale LLM studies (e.g. Chen et al., Nature Comms 2025). The idea is simple: one agent evaluates other agents — rigorously and transparently. 🟢 How it works A “Green Agent” evaluates competing “Purple Agents” (LLMs or agentic systems): • orchestrates task execution • enforces evaluation rules • measures performance • generates diagnostic reports No single score. Real insight. Goals: • reduce contamination & overfitting • enable fine-grained error analysis • move from leaderboard scores → system-level understanding Critical for high-stakes domains like healthcare. 🔬 Why biomedical NLP is hard • hallucinations can sound plausible — and be dangerous • ground truth is scattered across papers & databases • rare diseases = sparse, inconsistent data Evaluation needs evidence, not vibes. 🤔 A key question Can generalist LLM agents match or beat specialized biomedical NLP tools? Historically, domain-specific systems dominated: • NER • relation extraction • evidence synthesis AgentBeats finally lets us test this properly. 📊 What gets evaluated Across 6 task types / 12 datasets: • QA (MedQA, PubMedQA) • NER (BC5CDR, NCBI Disease) • multi-label classification • relation extraction • text simplification • dynamic summarization (live PubMed articles) Dynamic summarization is real-world hard: fetched live, evaluated end-to-end. ⚙️ Key capabilities • fully automated evaluation • fine-grained error analysis (hallucinations, boundaries, knowledge gaps…) • 1–5⭐ ratings aligned with production readiness • side-by-side agent comparisons • actionable insights — not just scores 🔮 Why this matters AgentBeats points toward a future where: • evaluation is autonomous • benchmarks are reproducible & contamination-aware • AI systems are judged with real rigor In healthcare, this isn’t optional. More soon 👀 👉 https://t.co/EeRgZJ4IrC cc: @BerkeleyRDI

0

1

0

2

100

Who to follow

Hani Y Khellef

@hykhellef

مستشار سياسي و إعلامي Political advisor 🇫🇷 in 🇸🇦. Head com & press section @FranceinKSA . Frmr Political Affairs Officer, 🇺🇳UN / RTs not endorsement

Head of Engineering @h2g_lab, @mobinergy, mobile specialist, security enthusiast, Mac & iOS developer, Apple Certified Trainer, entrepreneur, speaker, advisor

Bertrand

@BertrandBuild

7 months ago

@AndrewKepson @wardpeet @calcsam nice, feel free to share what you're building

1

0

16

Bertrand

@BertrandBuild

7 months ago

trying Mastra right now and wow! so many good stuffs⚡️ I’m hooked thanks a lot @calcsam 😎

1

6

4

1

1K

Bertrand

@BertrandBuild

7 months ago

@AndrewKepson @wardpeet @calcsam sounds good! working with wordpress right?

1

0

23

Bertrand

@BertrandBuild

8 months ago

@karanjagtiani04 @iEx_ec hey Karan, the idea is to restrict access, so the agent only has access to a TEE tool and never see the user’s personal informations The tg handle is an example but it could be medical, financial or intimate informations

1

0

53

Bertrand

@BertrandBuild

8 months ago

⚡️ excited to announce our collaboration with @iEx_ec bringing TEE privacy and security to ai agents SOC 2 is just a proof of audit TEEs are proof of security here is what we're working on 🧵

BertrandBuild's tweet photo. ⚡️ excited to announce our collaboration with @iEx_ec

bringing TEE privacy and security to ai agents

SOC 2 is just a proof of audit
TEEs are proof of security

here is what we're working on 🧵 https://t.co/sZDoHExNUl

6

15

3

1

3K

Bertrand

@BertrandBuild

8 months ago

@TechVVV @iEx_ec thx TechVVV!!

0

1

0

39

Bertrand

@BertrandBuild

8 months ago

our dream is to make privacy a default and we believe the future of AI agents is ✅ autonomous ✅ privacy-first ✅ built for Web3 thanks to @iEx_ec @arbitrum @ArweaveEco for pushing the boundaries with us

0

4

0

1

91

Bertrand

@BertrandBuild

8 months ago

one example is the @web3privacy newsletter an agent fetches the newsletter, chooses the most impactful ones and send you a summary privately 🤫 you can subscribe here: https://t.co/QhplxU1VAR ps: this requires some Eth on arbitrum

1

3

0

108

Bertrand

@BertrandBuild

8 months ago

@FarzaTV this is so cool man! I’d be happy to contribute if you share the code somewhere

0

1

0

155

Bertrand

@BertrandBuild

8 months ago

@iEx_ec thank you for the shoutout! ⚡️

0

6

BertrandBuild retweeted

iExec RLC @iEx_ec

9 months ago

1xBuild – Their platform runs on “agent templates.” With Web3Telegram, they’re building a privacy AI newsletter: Telegram updates that stay anonymous, with content archived permanently on Arweave. https://t.co/GvIY1PH9FQ T

2

3

1

0

200

Bertrand

@BertrandBuild

9 months ago

This is why @iExec is building web3telegram. A privacy tool to protect your handle ⚡️

CZ 🔶 BNB

@cz_binance

9 months ago

I don't use Telegram. All those accounts are fake. Not against Telegram, but the one feature that killed it for me was anyone can message you when they know your handle. And I get spammed to the point my phone lags. Gave this feedback to Pavel directly once too. 🤷‍♂️

2K

8K

537

245

2M

0

1

0

136