Lisa Dunlap @lisabdunlap - Twitter Profile

Pinned Tweet

6 months ago

🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: https://t.co/FJ4GAxPIkx ➡️Blog: https://t.co/3AyXBFBEmV

3

92

37

38

28K

lisabdunlap retweeted

Zirui "Colin" Wang @zwcolin

3 days ago

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. 🌐Project: https://t.co/P1ASxE5VBE 📰Paper: https://t.co/XnPbAF3Zr2 💻Code: https://t.co/TEX5T3SLmy 1/n

3

237

36

207

47K

lisabdunlap retweeted

Mihran Miroyan

@mirmiroyan

10 days ago

We release Recon — a new approach to reasoning synthesis for user modeling. The key insight: post-hoc rationalization ≠ reasoning. We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks. Paper and project page in 🧵

mirmiroyan's tweet photo. We release Recon — a new approach to reasoning synthesis for user modeling.

The key insight: post-hoc rationalization ≠ reasoning.

We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks.

Paper and project page in 🧵

2

44

19

29

10K

Lisa Dunlap

@lisabdunlap

24 days ago

@sh_reya Congrats!!!!!

0

1

0

76

Who to follow

Building SkyPilot @skypilot_org | PhD AI+Systems @Berkeley_EECS @ucbrise

Lily Liu

@eqhylxx

Researcher @OpenAI, Sky Lab @UCBerkeley, @vllm_project

lisabdunlap retweeted

Ziqian Zhong

@fjzzq2002

4 months ago

🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.

28

1K

155

1K

74K

lisabdunlap retweeted

Transluce

@TransluceAI

4 months ago

Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵 GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇

TransluceAI's tweet photo. Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵

GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇

2

75

15

19

10K

Lisa Dunlap

@lisabdunlap

4 months ago

@tymofii Not just more ai focused, the problems are much more open ended and less algorithms-y

1

0

22

Lisa Dunlap

@lisabdunlap

4 months ago

Always great to see more work in model diffing!

Elias Kempf @elkmf

4 months ago

New model release? Great. But did the LLM’s behavior change in ways the changelog doesn't mention? We built and evaluated a pipeline to find out! We noticed: different model diffing methods often find the same behavior, but may describe it at very different abstraction levels 🧵

elkmf's tweet photo. New model release? Great. But did the LLM’s behavior change in ways the changelog doesn't mention?

We built and evaluated a pipeline to find out! We noticed: different model diffing methods often find the same behavior, but may describe it at very different abstraction levels 🧵 https://t.co/QkobwSAWhQ

3

84

12

68

21K

1

10

0

3

1K

lisabdunlap retweeted

Terry Kim @thtrkim

4 months ago

I had a fun time writing a deep dive on Diffusion Language Models - with an equation walkthrough and Excalidraw sketches ✏️ In Part 1, I focused on the method: what does “noise” even mean for text, and how do DLMs denoise back into tokens? https://t.co/G8zSWCesB3

2

30

8

13

3K

Lisa Dunlap

@lisabdunlap

4 months ago

@graceluo_ Absolutely incredible work, amazing job grace!

0

1

0

179

lisabdunlap retweeted

Grace Luo @graceluo_

4 months ago

We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵

31

1K

192

1K

221K

lisabdunlap retweeted

Parth Asawa

@pgasawa

4 months ago

Continual learning from natural language is data-hungry. Can we make it sample-efficient? SIEVE distills natural language context (instructions, feedback, rules, etc.) into model weights using as few as 3 examples only of queries—outperforming prior methods and even in-context learning baselines. (1/n)

pgasawa's tweet photo. Continual learning from natural language is data-hungry. Can we make it sample-efficient? SIEVE distills natural language context (instructions, feedback, rules, etc.) into model weights using as few as 3 examples only of queries—outperforming prior methods and even in-context learning baselines. (1/n)

9

214

44

133

48K

lisabdunlap retweeted

Arena.ai

@arena

4 months ago

LMArena is now Arena. A name that takes us back to our roots with a powerful mission: to measure and advance the frontier of AI for real-world use. We have grown from a small PhD research project to a platform powered by a global community of millions. This rebrand has been shaped by the people who use it. 👇 Take a look inside the rebrand.

88

1K

108

288

306K

Lisa Dunlap

@lisabdunlap

4 months ago

With the rise of agents comes the need to better evaluate their true visual capabilities, VisGym takes a step in this direction and analyzes what representations are best for models and how they fail. It was such an honor to be a part of such an incredible team!

Zirui "Colin" Wang @zwcolin

4 months ago

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42 @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]

2

180

32

83

40K

0

23

1

5

2K

lisabdunlap retweeted

Haven Feng @ CVPR

@HavenFeng

4 months ago

✨Thinking with Blender~ Meet VIGA: a multimodal agent that autonomously codes 3D/4D blender scenes from any image, with no human, no training! @berkeley_ai #LLMs #Blender #Agent 🧵1/6

72

2K

309

2K

337K

Lisa Dunlap

@lisabdunlap

4 months ago

@yifandotqiao @inferact @simon_mo_ @KaichaoYou @BerkeleySky Congrats!

1

2

0

134

Lisa Dunlap

@lisabdunlap

4 months ago

@woosuk_k @inferact @vllm_project congrats!

0

2

0

228

lisabdunlap retweeted

Angjoo Kanazawa @akanazawa

5 months ago

In an effort to better understand VLMs, we found that they are fragile in surprising ways. Just changing the color of pointing markers (red circle → blue circle) can completely change the results! :

4

105

9

43

15K

lisabdunlap retweeted

Laude Institute @LaudeInstitute

5 months ago

This is the kind of work that makes you rethink what leaderboards are actually measuring. If marker color can reorder rankings, are we evaluating vision capability or visual sensitivity to arbitrary details?

0

5

1

1K

lisabdunlap retweeted

Long Lian

@LongTonyLian

5 months ago

Seemingly task-irrelevant details, such as the choice of visual markers, can actually cause large changes in the performance of vision-language models! Check out our work that investigates the fragility of visually prompted benchmarks: https://t.co/VLdn6ITrca

0

9

1

0

1K

Lisa Dunlap

@lisabdunlap

5 months ago

This is an amazing collaboration between @LongTonyLian, @HavenFeng, Jiahao Shu, @XDWang101, @ren_wang1, @trevordarrell, @alsuhr and @akanazawa!

0

9

0

1

545

Lisa Dunlap

@lisabdunlap

5 months ago

🌟NEW PAPER🌟 Do you know that changing a visual marker from red to blue can completely reorder VLM leaderboards? In our most recent work, we explore the fragility of visually prompted benchmarks. https://t.co/Kck6w7Vvf6

lisabdunlap's tweet photo. 🌟NEW PAPER🌟
Do you know that changing a visual marker from red to blue can completely reorder VLM leaderboards? In our most recent work, we explore the fragility of visually prompted benchmarks. https://t.co/Kck6w7Vvf6 https://t.co/hWBXkbI9sN

6

219

38

92

48K

Lisa Dunlap

@lisabdunlap

5 months ago

We release VPBench along with our analysis framework: Datasest: https://t.co/JBIettzeRc code: https://t.co/xYOzRmuPj4 Arxiv: https://t.co/C9EoVbxzSB

1

7

0

580

Lisa Dunlap

@lisabdunlap

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users