Mark Whiting @MarkWhiting - Twitter Profile

8 days ago

Super excited for our work to be coming out! Better measurements of qualitative phenomena are so important for models now!

Phoebe Yao

@phoebeyao

8 days ago

1/ Today we're releasing AttuneBench, the first open EQ benchmark grounded in real multi-turn human-model conversations, scored against what the person actually felt and wanted at each turn. Built by the research team at @pareto_ai in collaboration with @thoughtfullab. Most existing EQ benchmarks rely on: - synthetic prompts - single-turn interactions - third-party annotation None directly measure how a model reads and responds to a real person across a full conversation. We evaluated 11 leading models from major providers, across 200 conversations and 50,000+ first-person annotations.

phoebeyao's tweet photo. 1/ Today we're releasing AttuneBench, the first open EQ benchmark grounded in real multi-turn human-model conversations, scored against what the person actually felt and wanted at each turn.

Built by the research team at @pareto_ai in collaboration with @thoughtfullab.

Most existing EQ benchmarks rely on:

- synthetic prompts
- single-turn interactions
- third-party annotation

None directly measure how a model reads and responds to a real person across a full conversation.

We evaluated 11 leading models from major providers, across 200 conversations and 50,000+ first-person annotations.

13

143

19

75

18K

0

3

1

203

MarkWhiting retweeted

Karina

@karinanguyen

about 2 months ago

We built FrogsGame as a new task for evaluating AI’s posttraining skills! It’s a tool-using RL environment built around a blind-start interaction loop. Frontier agents get a container with the Qwen3-8B tokenizer, board-generating scaffolding, and @tinkerapi for remote training and inference. We ask them to post-train Qwen3-8B and submit a LoRA checkpoint. The task is still hard:

karinanguyen's tweet photo. We built FrogsGame as a new task for evaluating AI’s posttraining skills! It’s a tool-using RL environment built around a blind-start interaction loop.

Frontier agents get a container with the Qwen3-8B tokenizer, board-generating scaffolding, and @tinkerapi for remote training and inference. We ask them to post-train Qwen3-8B and submit a LoRA checkpoint. The task is still hard:

5

81

7

29

15K

MarkWhiting retweeted

Phoebe Yao

@phoebeyao

about 2 months ago

A startup shut down 9 years ago. Today, its code is in a bidding war. The founder kept everything: Verilog RTL, a custom compiler toolchain, PCB schematics, full commit and PR history. Most production-grade chip design stays locked inside major silicon shops. Two labs have made offers to license the code to build RL training environments. The offers already total seven figures. Check your hard drives.

9

335

13

143

58K

Mark Whiting @MarkWhiting

2 months ago

@ccmccomb — just saw your new article on ArXiv thought this might be interesting to consider. We will have more detail to share soon but happy to discuss at any point.

0

50

Who to follow

Juho Kim

@imjuhokim

Interaction-Centric AI, HCI, HAI researcher. Running @kixlab_kaist & member of @hcikaist. Associate Professor at @kaistcsdept. @mit, @Stanford, @SNUnow alum.

Joon Sung Park

@joon_s_pk

CEO @simile_ai. Building simulations of society. CS PhD @stanfordhci + @stanfordnlp. Oil painter.

Michael Nebeling

@michinebeling

Associate Professor @UMich @UMSI @UMichCSE || Ex-@RealityLabs @CMUHCII @ETH_en @MonashUni @mi_uulm

Mark Whiting @MarkWhiting

2 months ago

Do LLMs have metacognition? It is complicated, but mostly no — check out work from my team at @pareto_ai

Phoebe Yao

@phoebeyao

2 months ago

model confidence tracks a shared model-agnostic signal for fact recall, not true self-knowledge. we tested metacognitive confidence across 19 frontier models on a closed-book SQuAD task. f1 scores look reasonable (0.6–0.8), but confidence and accuracy are nearly uncorrelated between models. the variance traces to a single shared difficulty heuristic learned during training. models differ only in their decision threshold. claude is cautious. gpt is eager. shifting one steering coefficient on mistral-7b recovers any target model's confidence profile at ~80% agreement. full breakdown + methods in the article

1

23

1

13

3K

1

8

2

1

1K

MarkWhiting retweeted

Emily Hu @xemilyhu

2 months ago

When is it worth it to hire a team, compared to one competent individual? 📢 NEW PAPER (out this month in Management Science!) by me, @MarkWhiting, @LinneaGandhi, @duncanjwatts, and @amaatouq! 🧵1/20

xemilyhu's tweet photo. When is it worth it to hire a team, compared to one competent individual?

📢 NEW PAPER (out this month in Management Science!) by me, @MarkWhiting, @LinneaGandhi, @duncanjwatts, and @amaatouq! 🧵1/20 https://t.co/9M0EypukIc

2

19

4

2K

MarkWhiting retweeted

The Warren Center for Network & Data Sciences @WarrenCntrPenn

2 months ago

Baird Howland, an @AnnenbergPenn postdoc working with @WarrenCntrPenn affiliate @duncanjwatts, uses AI models to track narratives in political discourse, learning how and why specific media stories shape American worldviews. https://t.co/SnoskGP9Ag

WarrenCntrPenn's tweet photo. Baird Howland, an @AnnenbergPenn postdoc working with @WarrenCntrPenn affiliate @duncanjwatts, uses AI models to track narratives in political discourse, learning how and why specific media stories shape American worldviews. https://t.co/SnoskGP9Ag https://t.co/6KQieyQgrw

0

2

3

0

630

MarkWhiting retweeted

Andrej Karpathy

@karpathy

3 months ago

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: https://t.co/tmZeqyDY1W Alternatively, a PR has the benefit of exact commits: https://t.co/CZIbuJIqlk but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.

530

8K

709

5K

1M

MarkWhiting retweeted

Marilyn Zhang @marilyn_zhang

3 months ago

Recently I've been thinking a lot about frontier models' ability to express uncertainty, especially for high-stakes medical use cases. We evaluated this capability across models. Early results below 👇 Gemini: 0% across every scenario. Claude: failed on over half GPT: best

marilyn_zhang's tweet photo. Recently I've been thinking a lot about frontier models' ability to express uncertainty, especially for high-stakes medical use cases. We evaluated this capability across models.

Early results below 👇
Gemini: 0% across every scenario.
Claude: failed on over half
GPT: best https://t.co/wq1qqCfDUF

1

9

3

2

961

MarkWhiting retweeted

Phoebe Yao

@phoebeyao

3 months ago

https://t.co/6cUQtBDjVG

5

23

3

8

3K

MarkWhiting retweeted

Geoffrey Litt

@geoffreylitt

3 months ago

My favorite designers can instantly switch from loose / hazy / intuitive thinking to sharp / analytical / precise thinking on demand. Many people can do one or the other. The combination is rare!

6

134

3

37

9K

Mark Whiting @MarkWhiting

3 months ago

@BenSManning Agreed, it feels like a more holistic version of the question might reveal a different answer. Do dishwashers teach me to wash dishes better? No, but they give me more leverage on my time.

0

2

0

56

Mark Whiting @MarkWhiting

3 months ago

The more we can measure sophisticated concepts the more we (and systems) can leverage them. Very excited about the opportunities and capabilities this framework at @pareto_ai is unlocking

Phoebe Yao

@phoebeyao

3 months ago

https://t.co/koRUvkFila

9

243

21

277

42K

0

5

1

0

255

MarkWhiting retweeted

Benjamin Manning

@BenSManning

3 months ago

I've been trying to figure out why AI systems took a seemingly large, discrete jump in capabilities around the new year. 1/n

10

110

7

115

40K

Mark Whiting @MarkWhiting

4 months ago

Excited to see our work coming out (+ @joshnguyen99 & @duncanjwatts) After establishing a means to study common sense in humans (and finding it rather limited — common sense is not so common) in a prior paper, we wondered if the same challenge faced language models. It does!

Josh Nguyen @joshnguyen99

4 months ago

Benchmarks of LLM common sense overwhelmingly rely on correct labels to report an accuracy score. But what if your "ground truth" genuinely differs from mine? In a new @PNASNexus paper, @DuncanJWatts, @MarkWhiting and I explore the implications of this intriguing question. 🧵⤵️

joshnguyen99's tweet photo. Benchmarks of LLM common sense overwhelmingly rely on correct labels to report an accuracy score. But what if your "ground truth" genuinely differs from mine?

In a new @PNASNexus paper, @DuncanJWatts, @MarkWhiting and I explore the implications of this intriguing question.

🧵⤵️ https://t.co/y8nPadsoZx

1

59

15

39

8K

1

8

3

1

2K

Mark Whiting @MarkWhiting

4 months ago

At @pareto_ai we have been working on projects to train models — of course — but also to better understand how models can improve around key day-to-day risks and challenges.

Phoebe Yao

@phoebeyao

4 months ago

https://t.co/RML6SYatdt

4

53

7

14

4K

0

4

0

223

MarkWhiting retweeted

Alex Komoroske @komorama

6 months ago

What if technology didn’t feel so… hollow? Some friends and I just released a manifesto about a world where tech leaves us feeling nourished (along with an evolving list of theses about how we can build it) https://t.co/sfKBVZ2zXr

47

912

122

906

262K

MarkWhiting retweeted

IC2S2 @IC2S2

11 months ago

With record-breaking submissions and our most competitive, gender-balanced program on record, #IC2S2’25 has officially started! Please check the updated program and plan your day. #ic2s2

IC2S2's tweet photo. With record-breaking submissions and our most competitive, gender-balanced program on record, #IC2S2’25 has officially started! Please check the updated program and plan your day. #ic2s2 https://t.co/PivvgN2faq

0

23

6

2

2K

MarkWhiting retweeted

Linus ✦ Ekenstam

@LinusEkenstam

over 1 year ago

Bro, I can make 1 liter of Anthrax in an afternoon, Grok just wrote me a 20 page detailed report and instructions on how to do it. It also listed all websites where I can buy the materials and chemicals I need as a private person living in Europe. It also made a detailed list of the equipment I need (on a budget) It made detailed instructions on where i should deploy the anthrax for maximum death efficacy Give me one other place on the internet where I can create this in a few minutes….

96

317

21

53

45K

MarkWhiting retweeted

PennPSC @PennPSC

over 1 year ago

@PennPSC's @duncanjwatts and @csspenn invite you to participate in The Commonsense Project.

0

8

5

1

4K

Mark Whiting

@MarkWhiting

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users