AI agents are easy to demo. Production is a different problem.
Enter CUGA. 🦉
An open-source agent harness that lets you focus on building instead of plumbing: https://t.co/5jRzs87mOV
Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%
ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations
Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time
ITBench-AA SRE overview:
➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks
➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident
➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions
Methodology details:
➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task
➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research
➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats.
➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models.
Key findings:
➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%
➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench
➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives
➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%
@OrlvndoA Hay que empezar a desarmar la infrastructura del chavismo, traer una flotilla de aviones todo personal en el gobierno de origen cubano de vuelta a cuba, toda gente de Hezbollah de vuelta an iran. Primero por las buenas.
@bcherny I assume your tabs are related, and you probably are building improvements for Claude code, do you have a way to synchronize work, evaluate improvements of the new capability, generate tests? how does that loop work?
IBM dropped CUGA, open-source enterprise agent to automate boring tasks 🔥
> given workspace files, it writes and executes code to accomplish any task 🤯
> comes with a ton of tools built for enterprise tasks, supports MCPs
> plug in your favorite LLM 👏
here's a small demo where it retrieves info from a file, calculates revenue by writing code, and drafts an e-mail 🤯
they release code, a blog and a demo 🙌🏻 you can run this locally
It is official: the Venezuelan electoral authority fraudulently announced Maduro’s victory. This is a sham. Proving fraud will be a cakewalk if and when the results are published. This crime cannot be allowed to stand.
@PLLChaos@JNEU_88 Are we all aware of the physics of the game? larger stick gives you an edge that easily accounts for the 5-10 mph difference - Congrats though! 🎉🥍
@anothercohen In Florida HOAs send violation letters for insufficient mulch, dirty roof, empty garbage bins left overnight, non blooming plants, and the list goes on…