HB-Eval System @hbevalsystem - Twitter Profile

Pinned Tweet

7 months ago

I’m excited to join the technology and AI community and share my research journey, where I’m working on building a new generation of explainable agents with measurable cognitive performance. Abuelgasim Adam Founder of HB-Eval System

0

1

0

182

HB-Eval System @hbEvalSystem

14 days ago · Israel

Modern AI agents operate in messy environments: • APIs fail • Tools become unavailable • Context changes • Memory becomes inconsistent • Plans need revision Traditional benchmarks rarely capture these conditions. Yet these are exactly the situations that determine whether an

0

3

HB-Eval System @hbEvalSystem

14 days ago · Israel

Most AI benchmarks measure what a model can do. Very few measure what happens when things go wrong. A model can achieve 95% accuracy in evaluation and still fail in production. Why? Because capability and reliability are not the same thing. 🧵

0

4

HB-Eval System @hbEvalSystem

about 1 month ago · Israel

The Day of Arafah is the greatest of days, a day on which God forgives sins and pardons. I ask God to forgive you all past and future sins, and that next year you may be in the best of circumstances. I send you sincere prayers from the depths of my heart while I am fasting.

0

15

HB-Eval System @hbEvalSystem

about 1 month ago

@GeminiApp Yes you are doing great job o human thank alot

0

67

HB-Eval System @hbEvalSystem

5 months ago · Israel

3,000 Experiments in Progress! 🚨 Testing the limits of AI Reliability. Watching the HB-Eval engine stress-test the giants: Llama-3.3-70b, 3.1-8b, and Gemma-2-9b. We aren't just asking "Can it reason?" — We are asking "When will it break?" 🔍 Current Stage: Cycle 50/1000

0

37

HB-Eval System @hbEvalSystem

5 months ago · Israel

Breaking the Illusion of AI Reliability 🚀⚖️ I’ve just completed a massive 3,000-run stress test on LLM agents using HB-Eval. The goal? Testing behavioral reliability under systematic Compound Fault Injections. Results show that "Bigger" doesn't mean "Reliable." 📊👇

hbEvalSystem's tweet photo. Breaking the Illusion of AI Reliability 🚀⚖️
I’ve just completed a massive 3,000-run stress test on LLM agents using HB-Eval.
The goal? Testing behavioral reliability under systematic Compound Fault Injections.
Results show that "Bigger" doesn't mean "Reliable." 📊👇 https://t.co/oxmxCCQdES

0

16

HB-Eval System @hbEvalSystem

5 months ago · Israel

Introducing HB-Eval: A multi-layered architecture designed for the "Real World" Unlike traditional benchmarks, HB-Eval uses Episodic Narrative Memory EDM to detect failure contexts before they happen. It’s not just about getting the right answer it’s about Preemptive Adaptation

hbEvalSystem's tweet photo. Introducing HB-Eval: A multi-layered architecture designed for the "Real World"
Unlike traditional benchmarks, HB-Eval uses Episodic Narrative Memory EDM to detect failure contexts before they happen.
It’s not just about getting the right answer it’s about Preemptive Adaptation https://t.co/JzDDfPsjLw

0

12

HB-Eval System @hbEvalSystem

5 months ago · Israel

a catastrophic 43% Reliability Gap. 📉 I’ve discovered a hidden failure mode called "Withdrawal Pathology"—where agents choose to fail efficiently rather than succeed reliably. Here is how my framework HB-Eval solves the Nash Equilibrium of AI failure. 🧵👇

hbEvalSystem's tweet photo. a catastrophic 43% Reliability Gap. 📉
I’ve discovered a hidden failure mode called "Withdrawal Pathology"—where agents choose to fail efficiently rather than succeed reliably.
Here is how my framework HB-Eval solves the Nash Equilibrium of AI failure. 🧵👇 https://t.co/Dd1U9ieVye

0

13

HB-Eval System @hbEvalSystem

5 months ago · Israel

Witness HB-Eval in action! Cycle 1: System detects a reliability gap and stores the failure context in Episodic Memory. Cycle 2: Preemptive Intelligence detects a 'Similarity Match' and adapts BEFORE the failure occurs. This is how we move from AI Capability to Reliability "

0

13

HB-Eval System @hbEvalSystem

5 months ago

@shizhediao Thank you for your great job i watched the presentation its very amazing, so can you take a QUICK look for our papers in Flow links https://t.co/3OuqYRG3xs https://t.co/8PzaSdJjCG https://t.co/wSU9ADlx1S https://t.co/rHOmgGoKR0

0

2

HB-Eval System @hbEvalSystem

5 months ago

@shizhediao Can take a QUICK look of our papers in Flow https://t.co/3OuqYRG3xs https://t.co/8PzaSdJjCG https://t.co/wSU9ADlx1S https://t.co/rHOmgGoKR0

0

16

HB-Eval System @hbEvalSystem

5 months ago

@dair_ai https://t.co/rYrSwOasnM

0

3

HB-Eval System @hbEvalSystem

6 months ago

Full text: https://t.co/mcWafkh9Ds #AI_Safety #FutureOfAI @Preprints_org وحسابات مهتمة بالذكاء الاصطناعي مثل @IEEEorg (

0

18

HB-Eval System @hbEvalSystem

6 months ago

The Results 📊 Our controlled simulations show that grounding explanations in certified performance history works: ✅ Calibrated Trust Score: 4.62/5.0 ✅ Transparency Index: 0.91 ✅ 51% reduction in human cognitive load compared to traditional baselines.#AI_Safety #FutureOfAI

hbEvalSystem's tweet photo. The Results 📊
Our controlled simulations show that grounding explanations in certified performance history works:
✅ Calibrated Trust Score: 4.62/5.0
✅ Transparency Index: 0.91
✅ 51% reduction in human cognitive load compared to traditional baselines.#AI_Safety #FutureOfAI https://t.co/RbcczeD07M

0

12

HB-Eval System @hbEvalSystem

6 months ago

The Problem: "Explainability Current AI explanations (like Chain-of-Thought) are often "post-hoc linguistic rationalizations." They sound fluent but aren't always grounded in the agent's actual decision history. In high-stakes environments, we need evidence, not just narratives

0

11

HB-Eval System @hbEvalSystem

6 months ago

🚀 I am thrilled to announce the publication of my latest research paper: "HCI-EDM: Performance-Grounded Interpretability." We are moving beyond "guessing" why AI acts to "proving" it through certified performance history. 🔗 Read the full paper here: https://t.co/rHOmgGod1s

hbEvalSystem's tweet photo. 🚀
I am thrilled to announce the publication of my latest research paper: "HCI-EDM: Performance-Grounded Interpretability."
We are moving beyond "guessing" why AI acts to "proving" it through certified performance history.
🔗 Read the full paper here: https://t.co/rHOmgGod1s https://t.co/zkd0YdawW7

0

9

HB-Eval System @hbEvalSystem

6 months ago

@alshbabyt66607 https://t.co/3OuqYRFvHU https://t.co/8PzaSdILN8 https://t.co/wSU9ADkZck

0

3

HB-Eval System @hbEvalSystem

6 months ago

The Architecture It’s not just a benchmark; it’s a 4-layer stack for trustworthy AI: 🔹 HB-Eval (Evaluation) 🔹 Adapt-Plan (Real-time Control) 🔹 EDM (Memory Governance) 🔹 HCI-EDM (Certified Interpretability) A complete loop to prevent reliability #AgenticAI #LLMs #Reliability