Nari Johnson

@narijohnson

PhD student @mldCMU @scsatcmu. AI + HCI. she/her

Joined March 2020

696 Following

726 Followers

365 Posts

narijohnson retweeted

Cas (Stephen Casper)

@StephenLCasper

8 days ago

Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do. With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.

StephenLCasper's tweet photo. Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do.

With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI. https://t.co/UpSU509i5r

806

51K

narijohnson retweeted

Cassidy Laidlaw

@cassidy_laidlaw

13 days ago

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

cassidy_laidlaw's tweet photo. We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵 https://t.co/3wDFdXlIue

13K

narijohnson retweeted

Cas (Stephen Casper)

@StephenLCasper

11 days ago

Apply by June 7 to work with MATS in the Fall. In the MATS stream that I coordinate, we will work on impact-oriented technical AI governance research, potentially including research on open-weight models, AI safeguards, AI incidents, technically rigorous AI policy, etc.

117

narijohnson retweeted

Naveen Raman @NaveenJRaman

12 days ago

A recent study found an LLM scored 95% on a healthcare benchmark. Deployed with real patients, it dropped to 34%. In our new work, we argue the problem isn't the benchmark, but the implicit assumptions buried in evaluation. Paper: https://t.co/mi445QtJvM 🧵 1/n

NaveenJRaman's tweet photo. A recent study found an LLM scored 95% on a healthcare benchmark. Deployed with real patients, it dropped to 34%.

In our new work, we argue the problem isn't the benchmark, but the implicit assumptions buried in evaluation.

Paper: https://t.co/mi445QtJvM

🧵 1/n https://t.co/96ZbfotWfd

Who to follow

Valerie Chen

@valeriechen_

on the academic job market | phd student @mldcmu @SCSatCMU + intern @OpenHandsDev + co-creator of @CopilotArena

Hua Shen✨

@huashen218

✨Assistant Professor of CS @NYU Shanghai @NYU. Exploring human–AI alignment, safety, and the dynamics of co-evolving systems. 🦋

Luke Guerdan

@lukeguerdan

PhD Student at @SCSatCMU | Researching sociotechnical measurement & evaluation of AI systems.

narijohnson retweeted

Sarah Cen

@cen_sarah

12 days ago

To prove an AI developer or deployer broke the law, you need evidence. But what happens when the evidence needed to prove a claim is hidden inside proprietary models, platform logs, protected databases, or internal documentation? Our paper explores barriers to evidence in AI-related litigation. We study past and ongoing cases + propose a legal test for evidence decisions ⬇️ (1/7)

cen_sarah's tweet photo. To prove an AI developer or deployer broke the law, you need evidence. But what happens when the evidence needed to prove a claim is hidden inside proprietary models, platform logs, protected databases, or internal documentation?

Our paper explores barriers to evidence in AI-related litigation.

We study past and ongoing cases + propose a legal test for evidence decisions ⬇️

(1/7)

narijohnson retweeted

Meryl Ye @merylyemerylye

18 days ago

🚨 New preprint 🚨 We developed a sycophancy taxonomy based on prior literature and surveyed 106 experts. 94% agreed it's a serious problem. But they substantially disagreed about which behaviors actually count as sycophancy. Thread 🧵(1/n)

merylyemerylye's tweet photo. 🚨 New preprint 🚨

We developed a sycophancy taxonomy based on prior literature and surveyed 106 experts.

94% agreed it's a serious problem. But they substantially disagreed about which behaviors actually count as sycophancy.

Thread 🧵(1/n) https://t.co/AeLPjOJ748

12K

narijohnson retweeted

Chhavi Yadav

@chhaviyadav_

19 days ago

We built Safety-Nudges, a lightweight intervention layer, designed to promote awareness & reflection while using LLM chatbots. We chat with AI bots every day now. But there’s a growing problem not talked about enough: People are starting to overtrust AI chatbots. Especially users outside tech. When a chatbot sounds confident, empathetic, and fluent, it’s easy to forget: it can still be wrong, biased, manipulative, or misleading. So here comes Safety-Nudges! https://t.co/psEcus4WF6

narijohnson retweeted

Helen Toner

@hlntnr

about 1 month ago

What's old is new again in AI policy. In 2023 @fiiiiiist and I co-organized 2 workshops on how, if at all, licensing frontier AI could work. Our conclusion then, which IMO still holds: more govt capacity, better testing, more disclosure would all be very valuable. Licensing (or requiring USG "pre-approval") without those is a pretty dicey idea.

hlntnr's tweet photo. What's old is new again in AI policy. In 2023 @fiiiiiist and I co-organized 2 workshops on how, if at all, licensing frontier AI could work.

Our conclusion then, which IMO still holds: more govt capacity, better testing, more disclosure would all be very valuable. Licensing (or requiring USG "pre-approval") without those is a pretty dicey idea.

112

12K

narijohnson retweeted

Joachim Baumann @ ICLR'26

@joabaum

about 1 month ago

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

joabaum's tweet photo. Can you boost your AI review scores by asking an LLM to rewrite your paper?
Yes! We call it paper laundering
Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

458

244

53K

narijohnson retweeted

Lawrence Jang

@JangLawrenceK

about 1 month ago

AI agents can work pretty well on the web now for short tasks. I wanted to know: could they go longer, on harder tasks? Can an agent plan 2 weddings in different cities and a honeymoon within the same month, or find the most suitable culinary arts school across the US for my post PhD plans? We are releasing Odysseys: a benchmark of 200 long-horizon web agent tasks evaluated on the live internet. All our tasks are inspired from real human data and many take hours to complete. The best frontier model we tested (Claude Opus 4.6) reaches only 44.5% perfect-task success, leaving substantial room for improvement. I donated a couple of my own automation wishes to this benchmark. My favorite contribution was to “Rank the top 10 ACL + Meniscus surgeons in the area” - as this took me a decent amount of time to do myself when I got my own knee fixed. GPT 5.5 was able to do this for me with 4 dollars and 30 minutes!

18K

narijohnson retweeted

Oxford Internet Institute @oiioxford

about 1 month ago

New! Friendlier chatbots make more mistakes. Latest Oxford research published in @Nature tested 5 AI models and 400,000+ responses. Warm models made 10–30% more factual errors and were 40% more likely to agree with users' false beliefs. 1/2

oiioxford's tweet photo. New! Friendlier chatbots make more mistakes. Latest Oxford research published in @Nature tested 5 AI models and 400,000+ responses. Warm models made 10–30% more factual errors and were 40% more likely to agree with users' false beliefs. 1/2 https://t.co/qCeMz6yINs

narijohnson retweeted

Avijit Ghosh

@evijit

about 1 month ago

AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: https://t.co/sArlZMkytF Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

evijit's tweet photo. AI evaluation is becoming its own compute bottleneck.

We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks.

In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over.

This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems.
The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap.

Some takeaways:

→ Leaderboards should report cost alongside accuracy.
→ Reliability should not be treated as optional.
→ We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements.

Read the full post: https://t.co/sArlZMkytF

Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

12K

narijohnson retweeted

Séb Krier

@sebkrier

about 2 months ago

I think a lot of arguments in this piece are weak or out of date, and mostly repeating theoretical assumptions developed way before we had the AI systems we build today. Some of the evidence cited is a bit selective too. Some takes: 1. Grok going MechaHitler is not an example of goal misgeneralization or specification gaming, which were issues with different, older RL systems. Nor does it support the article's claim that 'AI won't do what we want (by default)'. This was a result of a bad system prompt, which was easily fixed: https://t.co/PkePMnztoU 2. In general, models are pretty good at instruction following, and even more so over time. The default empirical trend so far is improving steerability and instruction following, so expecting this to get worse as they get more capable requires strong evidence than metaphors or analogies. 3. The examples used in the post don't really support the chimp to human analogy, which is rhetorically vivid but analytically sloppy; plus there are many examples of control exerted over more capable systems, like control over companies. Control is also likely not the sole frame we should use to understand and steer AI. 4. The 'we are growing AIs' meme is not very instructive. It's not true that 'all we can see are the trillions of inscrutable parameters' - there is a lot of research that now give us a much better understanding of how models work: circuit tracing work, sparse feature work, representation engineering etc. Just because models aren't 'pre-programmed' does not imply that 'there is no way to directly specify what behaviour we want an AI system to have.' 5. We don't train AIs to 'optimise for long-term goals', this is not in fact a good description of what model training does. The article compresses too many distinct things (models, instruction tuning, scaffolds etc) into one 'goal-maximizing' or 'scheming' story. See also: https://t.co/m0bjaEp5wU 6. It's not at all obvious that learning self-preservation is a necessary side effect of better capabilities, which is a core assumption in rationalist circles. There are in my view no strong signs pointing towards robust endogenous self-preservation drives, and if anything more signs pointing in the other direction. See also: https://t.co/zYWtpafFnu 7. The cited 'blackmail the engineer' test environment is widely seen as highly flawed and not instructive in any way. Even Anthropic’s own write-up makes clear these were contrived scenarios with no external validity. See also: https://t.co/J5GWHE9ntd and https://t.co/4KQn1eJhpe 8. Reward hacking is a legitimate issue, but not one that implies the kind of loss of control alluded to in the article, nor that some hidden reward has become the system’s deep objective. See also: https://t.co/eDccq0ttSf 9. The 2024 alignment faking paper is also highly stylized, not particularly instructive, and dismissed as not in fact proving deceptive intent as the post implies. The label “alignment faking” imported more intentionality and strategic coherence than the setup warranted. See also: https://t.co/DstxlEnf4w and https://t.co/7orMO8znuq 10. A model to inferring it's being evaluated shows recognizing highly standardised/obvious evaluation environments rather than any deceptive intent. Interpreting eval awareness as such illustrates nicely the underlying assumptions held by some safety researchers. The jump from "models can detect standardized eval contexts" to "models are deceptively scheming" is a unwarranted interpretive leap.

275

215

45K

narijohnson retweeted

Daniel King

@The_DanielKing

about 2 months ago

4/4 We can’t build trustworthy AI procurement without an institution capable of certifying what that even means. Fund CAISI at $100 million. @A1Policy has reached the same figure through a diligent comparative analysis. It’s no fault of NIST or Congress that CAISI got its training wheels on a $10m budget; capabilities grew a lot since 2024. In the next year, AI capabilities will again compound faster than appropriations. The longer this waits, the wider the gap between what CAISI is equipped to deliver and what it actually needs. https://t.co/c2pUYHFJwk

885

narijohnson retweeted

Emmy Liu @_emliu

about 2 months ago

wrote a guide on getting compute grants as a student, something I wish I did more at the beginning of my PhD. It's honestly one of the highest ROI things you can do as a student (we've gotten 100k+ gpu hrs for roughly 2 weeks of work writing). https://t.co/U15nwau88a

200

237K

narijohnson retweeted

Collective Intelligence Project

@collect_intel

about 2 months ago

We are launching a new research fellowship! We are looking for 3 researchers – ideally PhD candidates or early-career postdocs* – who are passionate about AI and democratic governance. https://t.co/vvWmhshuDi *We are open to Masters-level students!

204

244

15K

narijohnson retweeted

Michelle Lam @michelle123lam

about 2 months ago

Most of what I actually need help with, I never think to tell a model. But why is it on me to remember? Our new paper asks: what if AI could proactively specialize to individuals and the tasks they’re carrying out at this very moment? 🧵

260

266

44K

narijohnson retweeted

Amanda Dsouza @amanda_dsouza

about 2 months ago

Rubrics have become the de-facto method of providing rewards for both evals & training in non-verifiable domains. Surprisingly, we found little has been studied on what makes a quality rubric. Rubrics exhibit several error modes, with significant downstream impact. Our rubric failure mode taxonomy (RIFT) is easy to apply, for detecting and ensuring rubric quality. If you're at the ICLR DATA-FM workshop or at @iclr_conf, come say hi!)

561

narijohnson retweeted

Deb Raji @rajiinio

2 months ago

Love this quote from Zoe and think it's so essential to the mental health and AI discourse: these models are PRODUCTS meant to keep you hooked on and dependent and "satisfied" as a customer; not challenged & eventually weaned off, as is the goal in actual clinical therapy!

narijohnson retweeted

Technical AI Governance @ ICML 2026 @taig_icml

2 months ago

📣 Submissions are now OPEN for the 2nd Workshop on Technical AI Governance Research at #ICML2026! 🗓️ Deadline: April 24 (23:59 AOE)

Nari Johnson

@narijohnson

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users