Alex Dow

@diojenez

Dodger fan. Works at the Microsoft Research Socio-Technical Alignment Center. That's STAC. Not SAC.

San Diego, CA

Joined June 2009

534 Following

372 Followers

570 Posts

Alex Dow @diojenez

7 months ago

Topics, cont.: Improving human and automated annotation methods for use in evaluation, including LLM-as-a-judge; generative AI user simulations and synthetic data generation; linguistic models of conversational organization

Alex Dow @diojenez

7 months ago

Come intern with us! The Sociotechnical Alignment Center (STAC) in Microsoft Research NYC is hiring PhD interns for summer 2026. Looking for candidates interested in all aspects of GenAI evaluation. Apply here by Dec 18! https://t.co/HKoIn6dGkJ

313

Alex Dow @diojenez

7 months ago

Topics of interest: measurement theory from the social sciences; reliability of generative AI systems and generative AI system evaluation; validity of generative AI system evaluation; automated methods for constructing conceptual definitions for evaluation; (cont.)

Alex Dow @diojenez

over 1 year ago

We're taking a position on GenAI evaluation: it's actually a social science measurement challenge!

Hanna Wallach (@hannawallach.bsky.social) @hannawallach

over 1 year ago

Remember this @NeurIPSConf workshop paper? We spent the past month writing a newer, better, longer version!!! You can find it online here: https://t.co/JHhOzhTZ3t

170

Who to follow

Matthew Salganik | @msalganik.bsky.social

@msalganik

Prof of Sociology @Princeton, Co-founder https://t.co/OZStFvv0li, Author of Bit by Bit: Social Research in the Digital Age: https://t.co/VDpzBp4Mui

ICWSM

@icwsm

The 20th International AAAI Conference on Web and Social Media 🇺🇸 Los Angeles, USA. May 27th to 29th, 2026.

Johan Ugander

@jugander

https://t.co/UdYepCqJnB — Associate Professor, Yale Statistics & Data Science. Social networks, social and behavioral data, causal inference, mountains.

Alex Dow @diojenez

over 1 year ago

Candidates should have research interest and experience in designing ML and/or AI evaluations. Position is in New York City. Apply here: https://t.co/IGTC6dNbD3

371

Alex Dow @diojenez

over 1 year ago

Looking for an intern interested in GenAI evaluation design to join us at the Microsoft Research Socio-Technical Alignment Center for spring or summer 2025. We're looking to expand on the research Hanna mentions here:

Hanna Wallach (@hannawallach.bsky.social) @hannawallach

over 1 year ago

Dimensions of Generative AI Evaluation Design: https://t.co/9G3Qt7xKaG TL;DR: We propose a set of dimensions that capture critical choices involved in GenAI evaluation design in order to guide decision-making and provide a structure for comparing different evaluations.

10K

diojenez retweeted

FOX Sports: MLB

@MLBONFOX

over 1 year ago

"WHO WANTS A PARADE!?!?"

303

93K

diojenez retweeted

Nicholas A. Christakis

@NAChristakis

almost 2 years ago

“Where in the world are we going to find these angels who will organize society for us?”

447

100

36K

diojenez retweeted

Andrew Ng

@AndrewYNg

about 2 years ago

A barrier to faster progress in generative AI is evaluations (evals), particularly of custom AI applications that generate free-form text. Let’s say you have a multi-agent research system that includes a researcher agent and a writer agent. Would adding a fact-checking agent improve the results? If we can’t efficiently evaluate the impact of such changes, it’s hard to know which changes to keep. For evaluating general-purpose foundation models such as large language models (LLMs) — which are trained to respond to a large variety of prompts — we have standardized tests like MMLU (multiple-choice questions that cover 57 disciplines like math, philosophy, and medicine) and HumanEval (testing code generation); the LMSYS Chatbot arena, which pits two LLMs’ responses against each other and asks a human to judge which response is superior; and large-scale benchmarking like HELM. These evaluation tools took considerable effort to build, and they are invaluable for giving LLM users a sense of different models' relative performance. Nonetheless, they have limitations: For example, leakage of benchmarks datasets’ questions and answers into training data is a constant worry, and human preference for certain answers does not mean those answers are more accurate. In contrast, our current options for evaluating specific applications built using LLMs are far more limited. Here, I see two major types of applications. - For applications designed to deliver unambiguous, right-or-wrong responses, we have reasonable options. Let’s say we want an LLM to read a resume and extract the candidate's most recent job title, or read a customer email and route it to the right department. We can create a test set that comprises ground-truth labeled examples with the right responses, and measure the percentage of times the LLM generates the right output. The main bottleneck is creating the labeled test set, which is expensive but surmountable. - But many LLM-based applications generate free-text output with no single right response. For example, if we ask an LLM to summarize customer emails, there’s a multitude of possible good (and bad) responses. The same holds for an agentic system to do web research and write an article about a topic, or a RAG system for answering questions. It’s impractical to hire an army of human experts to read the LLM’s outputs every time we tweak the algorithm and evaluate if the answers have improved — we need an automated way to test the outputs. Thus, many teams use an advanced language model to evaluate outputs. In the customer email summarization example, we might design an evaluation rubric (scoring criteria) for what makes a good summary. Given an email summary generated by our system, we might prompt an advanced LLM to read it and score it according to our rubric. I’ve found that the results of such a procedure, while better than nothing, can also be noisy — sometimes too noisy to reliably tell me if the way I’ve tweaked an algorithm is good or bad. The cost of running evals poses an additional challenge. Let’s say you’re using an LLM that costs $10 per million input tokens, and a typical query has 1000 tokens. Each user query therefore costs only $0.01. However, if you iteratively work to improve your algorithm based on 1000 test examples, and if in a single day you evaluate 20 ideas, then your cost will be 20*1000*0.01 = $200. For many projects I’ve worked on, the development costs were fairly negligible until we started doing evals, whereupon the costs suddenly increased. (If the product turned out to be successful, then costs increased even more at deployment, but that was something we were happy to see!) In addition to the dollar cost, evals also have a significant time cost. Running evals on 1000 examples might take tens of minutes or even hours. Time spent waiting for eval jobs to finish also slows down the speed with which we can experiment and iterate over new ideas. Previously I wrote that fast, inexpensive token generation is critical for agentic workflows. This will also be useful for evals, which involve nested for-loops that iterate over a test set and different model/hyperparameter/prompt choices and therefore consume large numbers of tokens. Despite the limitations of today's eval methodologies, I’m optimistic that our community will invent better techniques (maybe involving agentic workflows like reflection?) for getting LLMs to evaluate such output. If you’re a developer or researcher and have ideas along these lines, I hope you’ll keep working on them and consider open sourcing or publishing your findings! [Original text: https://t.co/HXtzJH7eP8 ]

872

156

589

187K

Alex Dow @diojenez

about 2 years ago

Cool

The New York Times

@nytimes

about 2 years ago

People across North America witnessed a solar eclipse on Monday that reminded all in its path of our planet’s place in the cosmos. Here’s what the moon’s shadow looked like from space as it moved across the continent. See more updates: https://t.co/7fLFzH5Se2

324

110

137K

110

Alex Dow @diojenez

about 2 years ago

@EugeniaGiraudy @carlosdiuk I think so!

Alex Dow @diojenez

about 2 years ago

That time Daniel Kahneman beat me at rock, paper, scissors. (With @EugeniaGiraudy and @carlosdiuk looking on.)

750

Alex Dow @diojenez

about 2 years ago

@grinbergnir @hannawallach Thanks! It would be great to get back to ICWSM!

Alex Dow @diojenez

about 2 years ago

Life update! We got a dog!!! Also, I got a new job. Next week, I'll be an applied scientist at Microsoft working on responsible AI, evaluation, and what the cool kids are calling Sociotechnical Alignment 😎 with @hannawallach and others. Oh, and the dog's name is Marlon.