Topics, cont.: Improving human and automated annotation methods for use in evaluation, including LLM-as-a-judge; generative AI user simulations and synthetic data generation; linguistic models of conversational organization
Come intern with us! The Sociotechnical Alignment Center (STAC) in Microsoft Research NYC is hiring PhD interns for summer 2026. Looking for candidates interested in all aspects of GenAI evaluation. Apply here by Dec 18! https://t.co/HKoIn6dGkJ
Topics of interest: measurement theory from the social sciences; reliability of generative AI systems and generative AI system evaluation; validity of generative AI system evaluation; automated methods for constructing conceptual definitions for evaluation; (cont.)
Remember this @NeurIPSConf workshop paper? We spent the past month writing a newer, better, longer version!!! You can find it online here: https://t.co/JHhOzhTZ3t
Candidates should have research interest and experience in designing ML and/or AI evaluations. Position is in New York City. Apply here: https://t.co/IGTC6dNbD3
Looking for an intern interested in GenAI evaluation design to join us at the Microsoft Research Socio-Technical Alignment Center for spring or summer 2025. We're looking to expand on the research Hanna mentions here:
Dimensions of Generative AI Evaluation Design: https://t.co/9G3Qt7xKaG
TL;DR: We propose a set of dimensions that capture critical choices involved in GenAI evaluation design in order to guide decision-making and provide a structure for comparing different evaluations.
A barrier to faster progress in generative AI is evaluations (evals), particularly of custom AI applications that generate free-form text. Let’s say you have a multi-agent research system that includes a researcher agent and a writer agent. Would adding a fact-checking agent improve the results? If we can’t efficiently evaluate the impact of such changes, it’s hard to know which changes to keep.
For evaluating general-purpose foundation models such as large language models (LLMs) — which are trained to respond to a large variety of prompts — we have standardized tests like MMLU (multiple-choice questions that cover 57 disciplines like math, philosophy, and medicine) and HumanEval (testing code generation); the LMSYS Chatbot arena, which pits two LLMs’ responses against each other and asks a human to judge which response is superior; and large-scale benchmarking like HELM. These evaluation tools took considerable effort to build, and they are invaluable for giving LLM users a sense of different models' relative performance. Nonetheless, they have limitations: For example, leakage of benchmarks datasets’ questions and answers into training data is a constant worry, and human preference for certain answers does not mean those answers are more accurate.
In contrast, our current options for evaluating specific applications built using LLMs are far more limited. Here, I see two major types of applications.
- For applications designed to deliver unambiguous, right-or-wrong responses, we have reasonable options. Let’s say we want an LLM to read a resume and extract the candidate's most recent job title, or read a customer email and route it to the right department. We can create a test set that comprises ground-truth labeled examples with the right responses, and measure the percentage of times the LLM generates the right output. The main bottleneck is creating the labeled test set, which is expensive but surmountable.
- But many LLM-based applications generate free-text output with no single right response. For example, if we ask an LLM to summarize customer emails, there’s a multitude of possible good (and bad) responses. The same holds for an agentic system to do web research and write an article about a topic, or a RAG system for answering questions. It’s impractical to hire an army of human experts to read the LLM’s outputs every time we tweak the algorithm and evaluate if the answers have improved — we need an automated way to test the outputs. Thus, many teams use an advanced language model to evaluate outputs. In the customer email summarization example, we might design an evaluation rubric (scoring criteria) for what makes a good summary. Given an email summary generated by our system, we might prompt an advanced LLM to read it and score it according to our rubric. I’ve found that the results of such a procedure, while better than nothing, can also be noisy — sometimes too noisy to reliably tell me if the way I’ve tweaked an algorithm is good or bad.
The cost of running evals poses an additional challenge. Let’s say you’re using an LLM that costs $10 per million input tokens, and a typical query has 1000 tokens. Each user query therefore costs only $0.01. However, if you iteratively work to improve your algorithm based on 1000 test examples, and if in a single day you evaluate 20 ideas, then your cost will be 20*1000*0.01 = $200. For many projects I’ve worked on, the development costs were fairly negligible until we started doing evals, whereupon the costs suddenly increased. (If the product turned out to be successful, then costs increased even more at deployment, but that was something we were happy to see!)
In addition to the dollar cost, evals also have a significant time cost. Running evals on 1000 examples might take tens of minutes or even hours. Time spent waiting for eval jobs to finish also slows down the speed with which we can experiment and iterate over new ideas. Previously I wrote that fast, inexpensive token generation is critical for agentic workflows. This will also be useful for evals, which involve nested for-loops that iterate over a test set and different model/hyperparameter/prompt choices and therefore consume large numbers of tokens.
Despite the limitations of today's eval methodologies, I’m optimistic that our community will invent better techniques (maybe involving agentic workflows like reflection?) for getting LLMs to evaluate such output.
If you’re a developer or researcher and have ideas along these lines, I hope you’ll keep working on them and consider open sourcing or publishing your findings!
[Original text: https://t.co/HXtzJH7eP8 ]
People across North America witnessed a solar eclipse on Monday that reminded all in its path of our planet’s place in the cosmos. Here’s what the moon’s shadow looked like from space as it moved across the continent. See more updates: https://t.co/7fLFzH5Se2
Life update! We got a dog!!! Also, I got a new job. Next week, I'll be an applied scientist at Microsoft working on responsible AI, evaluation, and what the cool kids are calling Sociotechnical Alignment 😎 with @hannawallach and others. Oh, and the dog's name is Marlon.