@bezalelsm אף אחד לא הקים ממשלה עם חמאס. רע״מ היא לא חמאס. רע״מ היא לא ארגון טרור. מנסור עבאס היה ראש המפלגה הערבית הראשון שהכיר בישראל כמדינה יהודית. נתניהו (״שקרן בן שקרן״ לדבריך) דווקא מאוד רצה להקים ממשלה עם רע״מ. אתה אמרת שחמאס הוא נכס. בקרוב תעוף הביתה עם כל ממשלת הטבח. גזען שקרן
🚨 Benchmarks tell us which model is better — but not why it fails.
For developers, this means tedious, manual error analysis. We're bridging that gap.
Meet CLEAR: an open-source tool for actionable error analysis of LLMs.
🧵👇
Evaluating LLM-based Agents
This report has a comprehensive list of methods for evaluating AI Agents.
Don't ignore evals. If done right, they are a game-changer.
Highly recommend it to AI devs. (bookmark it)
🔔 New Paper!
We propose a challenging new benchmark for LLM judges: Evaluating debate speeches.
Are they comparable to humans? Well... it’s debatable. 🤔
https://t.co/u0sd8SrGjj
👇 Here are our findings:
Interested in Agent Evaluation? 🤖
We’re excited to launch our new repo:
“Evaluation of LLM-based Agents: A Reading List” 📚
Browse benchmarks, methods, and frameworks from our recent survey.
👉 Explore & Contribute: https://t.co/nGnL03xCXm
#LLMAgents#AgentEvaluation
Survey on Evaluation of LLM-based Agents 🤖
Our paper is the first to provide a comprehensive overview of LLM-based agent evaluation 📜
Paper: https://t.co/43ByXGXLkQ
New preprint! ✨
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
🕺💃
https://t.co/7FPgj8FWKh
Say I want to compare system qualities - pick between 2 configurations, or rank a whole bunch of models. I'll use LLM-as-a-judge, right? 🧑🏻⚖️
But how do I know the LLM judge is up to the task? Who is a good judge for ranking systems?
Enter our new paper!✨🧵
https://t.co/RJajBQtdUn
We are excited to announce that the Argument Mining workshop will take place at #ACL2024 in Bangkok, Thailand. For more info see our website at https://t.co/etIRMVxu9h
We are happy to announce two shared tasks for ArgMining 2024: 1) Perspective Argument Retrieval organized by Neele Falk and Andreas Waldis. 2) DialAM-2024 organized by Ramon Ruiz-Dolz, John Lawrence, Ella Schad, and Chris Reed.
Curious to see how can we summarize opinions beyond plain text summaries?
Check out our #ACL2023 paper: From Key Points to Key Point Hierarchy: Structured and Expressive Opinion Summarization
with Lilach Eden, @yoavkantor@RoyBarHaim from @IBMResearch@IBM@biunlp
>>
Want to build a text classifier in a few hours?
Even if you don’t have any:
labeled data
#machineLearning knowledge
programing skills
Label Sleuth https://t.co/ViNd3sQNkT a new open-source no-code system for annotations 🧵 @IBMResearch@NotreDame@StanfordHCI UT Dallas #NLProc
Welcome PrimeQA at #NAACL2022! Replicate the state-of-the-art on multilingual open QA quickly! Here’s a new open-source repo in collab with with @stanfordnlp, @huggingface, @Uni_Stuttgart @NLPIllinois1. Link: https://t.co/R6GbTT3mKq Talk to me or read:
https://t.co/GblOy35JGK 🧵
NAACL 2022 is starting on Sunday! Visit our website https://t.co/7pAzKxHmmZ to learn about the exciting NLP work from IBM Research that will be presented at this conference. @IBMResearch@naaclmeeting#NAACL2022
NAACL'21 main conference is starting today!
Meet our researchers and recruiting team at the @IBMResearch virtual booth: https://t.co/WL1P01vn4v, and learn more about IBM Research's presence at @NAACLHLT, careers and booth schedule at https://t.co/ZR8HW1yaZJ