Ambrish Rawat

@iambrishing

AI @IBMResearch · Previously @CambridgeMLG, @iitdelhi

New York

Joined August 2013

511 Following

137 Followers

119 Posts

iambrishing retweeted

The Hindu

@the_hindu

23 days ago

The Nexbax AI Index proposes new AI evaluation metrics focused on real-world usability, cost, and accessibility for users in India and the Global South, challenging standard global benchmarks. ✍️Rashmi Patil https://t.co/q1EvuHQDNK

iambrishing retweeted

Artificial Analysis

@ArtificialAnlys

27 days ago

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time ITBench-AA SRE overview: ➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks ➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident ➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions Methodology details: ➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task ➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research ➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats. ➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models. Key findings: ➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42% ➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench ➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives ➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%

557

164

204K

iambrishing retweeted

Nandan Nilekani

@NandanNilekani

about 2 months ago

India doesn't need to lead the world in building the most advanced AI models. But it must lead in ensuring benefits of AI are widely shared. @rvenk and I have an op-ed in The @EconomicTimes https://t.co/yuzgkRXXWf

792

143

505

iambrishing retweeted

IBM Developer @IBMDeveloper

about 2 months ago

🧵Building AI apps rarely means using just one model. Granite 4.1 brings language, vision, speech, and guardrails together—so you can build real workflows, not just demos. What devs should know.👇

14K

Who to follow

David Moinina Sengeh

@dsengeh

Chief Minister-Gov't of Sierra Leone; Former Edu Minister. Author of RADICAL INCLUSION (May 2023), order here https://t.co/ncWip0J2fL ❤️

Lee Chagolla-Christensen

@tifkin_

I like making computers misbehave. Does stuff at https://t.co/YsrVyTjOY7. https://t.co/UsRIholZ3M

Journal of Management

@Journal_of_Mgmt

Publishing impactful research on management, business strategy & policy, entrepreneurship, and more, since 1975. IF=13.508

iambrishing retweeted

Keshav Ramji

@KeshavRamji

about 2 months ago

What if your language model could reason efficiently in an entirely new language? We introduce Abstract Chain-of-Thought, a new mechanism which allows language models to reason through a short sequence of reserved "abstract" tokens through reinforcement learning. It is as performant as verbalized CoT at a fraction of the cost, achieving major gains in inference-time efficiency.

$KeshavRamji's tweet photo. What if your language model could reason efficiently in an entirely new language? We introduce Abstract Chain-of-Thought, a new mechanism which allows language models to reason through a short sequence of reserved "abstract" tokens through reinforcement learning. It is as performant as verbalized CoT at a fraction of the cost, achieving major gains in inference-time efficiency.$

132

891

iambrishing retweeted

Kush Varshney कुश वार्ष्णेय @krvarshney

11 months ago

Check out IBM's latest open source tools for trustworthy AI on GitHub: In-Context Explainability 360 FactReasoner Contextual Privacy Links from here: https://t.co/IVU1rFOIit .

404

iambrishing retweeted

BCCI

@BCCI

11 months ago

A thrilling end to a captivating series 🙌 #TeamIndia win the 5th and Final Test by 6 runs Scorecard ▶️ https://t.co/Tc2xpWNayE #ENGvIND

558

93K

iambrishing retweeted

Fan Account Richard Kettlebourogh

@RichKettle07

11 months ago

This Test Series needs a Documentary🙌🏻 - ENG won the 1st Test quite comfortably - Then IND won the 2nd test quite comfortably - At Lord's ENG win by barest of margin - At Manchester IND draws after trailing by 311 runs - At Oval, India win by 6 runs

18K

405

815K

iambrishing retweeted

Simon Willison

@simonw

about 3 years ago

The hardest problem in computer science is convincing AI enthusiasts that they can't solve prompt injection vulnerabilities using more AI

427

113K

iambrishing retweeted

armand

@armand_ruiz

over 1 year ago

5/ Granite Guardian 3.2: Slimmer, faster safety models. New 5B (pruned from 8B) & 3B-A800M MoE options cut costs, keep performance. Plus, “verbalized confidence” adds nuance to risk detection—High or Low certainty, not just Yes/No.

armand_ruiz's tweet photo. 5/ Granite Guardian 3.2: Slimmer, faster safety models. New 5B (pruned from 8B) & 3B-A800M MoE options cut costs, keep performance.

Plus, “verbalized confidence” adds nuance to risk detection—High or Low certainty, not just Yes/No. https://t.co/F8SI3yOE4P

368

iambrishing retweeted

BCCI

@BCCI

over 1 year ago

🗣️ "I've had a lot of fun and created a lot of memories." All-rounder R Ashwin reflects after bringing the curtain down on a glorious career 👌👌 #TeamIndia | #ThankYouAshwin | @ashwinravi99

123

16K

248

228K

iambrishing retweeted

Harsha Bhogle

@bhogleharsha

over 1 year ago

This is so beautiful and moving. @DGukesh. #WorldChampion

14K

631

214

239K

iambrishing retweeted

Gukesh D

@DGukesh

over 1 year ago

18th @ 18!

213K

18K

iambrishing retweeted

Kush Varshney कुश वार्ष्णेय @krvarshney

over 1 year ago

Nice video about Granite Guardian! https://t.co/vkuJYIz3AA

422

iambrishing retweeted

elvis

@omarsar0

over 1 year ago

IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs. The authors claim that "With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space." https://t.co/WOHdeKIB01

omarsar0's tweet photo. IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs.

The authors claim that "With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space."

https://t.co/WOHdeKIB01

105

15K

iambrishing retweeted

Revanth Atmakuri

@RevanthAtmakuri

over 1 year ago

Introducing Granite Guardian: your AI bouncer 🕺💂‍♂️! It tackles social bias, profanity, and even jailbreaks! With top AUC scores, it's the coolest cop on the block. Open-source fun for all! - Grok 2

376

iambrishing retweeted

Prasanna Sattigeri

@prasatti

over 1 year ago

📢 Exciting #NeurIPS2024 highlights 🧵 First up, 🛡️ Granite Guardian technical report is out. Check out the demo at the NeurIPS IBM booth with Werner Geyer. Huggingface paper page: https://t.co/As1LLYhBwo

835

iambrishing retweeted

Kush Varshney कुश वार्ष्णेय @krvarshney

over 1 year ago

Look at those beautiful Granite Guardian safety vests! #brand #bootleg The Granite Guardian technical report is now on arXiv: https://t.co/2cgJrzCXLA Give it a read to see how the model is state-of-the-art in detecting harmful or hallucinated prompts and responses.

579

Ambrish Rawat @iambrishing

over 1 year ago

Check out the latest Attack Atlas taxonomy at the @NeurIPSConf red teaming workshop! It maps out a framework for thinking through single-turn prompt attacks in Gen AI https://t.co/RR9E4EhUwB #AIsafety #RedTeam #AIsecurity @GiandomenicoC17 @pinyuchenTW @prasatti @krvarshney

Bryan Casey

@bryanfcasey

over 1 year ago

@krvarshney is here to help us understand what's the right way to approach managing risk of a nearly infinite attack surface?

851

619

iambrishing retweeted

Bryan Casey

@bryanfcasey

over 1 year ago

The third generation of IBM Granite also introduces a new family of LLM-based guardrail models. Granite Guardian 3.0 8B and Granite Guardian 3.0 2B can be used to monitor and manage inputs and outputs to any LLM, whether open or proprietary. Across areas like jailbreaking, bias, violence, profanity, sexual content and unethical behavior, these models also demonstrate leading performance.

bryanfcasey's tweet photo. The third generation of IBM Granite also introduces a new family of LLM-based guardrail models.

Granite Guardian 3.0 8B and Granite Guardian 3.0 2B can be used to monitor and manage inputs and outputs to any LLM, whether open or proprietary.

Across areas like jailbreaking, bias, violence, profanity, sexual content and unethical behavior, these models also demonstrate leading performance.

461

Ambrish Rawat

@iambrishing

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users