Matt Wilde @MattCWilde - Twitter Profile

about 22 hours ago

Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.

103

3K

420

318

155K

Matt Wilde

@MattCWilde

2 days ago

That Apple / OpenAI partnership seems to be going great

0

8

Matt Wilde

@MattCWilde

19 days ago

@TheStalwart Exactly. In our research on NewsBench at Forum AI we found that even when the models source well they still play too fast and loose with what those sources say.

0

111

MattCWilde retweeted

Katie Harbath @katieharbath

22 days ago

I’ve been helping @TheForumAI build NewsBench, a benchmark for how frontier AI covers the news that matters. We put the leading models through 3,000+ prompts and scored each one on accuracy, neutrality, & source quality. See where each model landed: https://t.co/unRq44qkuB

1

3

2

1

185

Who to follow

ev.jpeg

@ev_jpeg

Building @faradaymachines and @moaicash | Before @sloikaxyz and founder @500px, plus many more. Beep beep boop.

Ivona Tau

@ivonatau

Artist working with Technology/AI/GAN. PhD in AI (Computer Vision) Showed at Art Basel Miami, Sotheby's, Christie's. Works acquired by ZKM, FC Linz.

Abby Grills

@AGrillz

co-founder @RiveterHQ (YC F24) | Ex-Gusto, Middesk | climber & basketball fan

Matt Wilde

@MattCWilde

23 days ago

@mihai673 @ahall_research @ByForumAI We've done some small ablations around this. Once you iterate to a rubric that humans can apply consistently, the frontier models can also generally apply it pretty well. However there's still a modest amount of performance you're leaving on the table if you don't optimize on top

0

1

0

47

MattCWilde retweeted

Jillian Fisher @jrfisher552

23 days ago

Excited to have been part of this work exploring better ways to evaluate AI on hard, contested questions. For consequential topics, grounding evaluation in expert judgment feels especially important. Proud to have contributed and excited to see what comes next with @ByForumAI.

0

4

2

0

412

Matt Wilde

@MattCWilde

23 days ago

Check out way more details in the paper here: https://t.co/ydepwELbkX

0

2

0

31

Matt Wilde

@MattCWilde

23 days ago

@a1zhang's Mismanaged Genius hypothesis asks if poor LLM performance on certain tasks is due to a capability cap or poor utilization. At Forum AI, we've been researching what it would take to improve how LLMs handle high-stakes, subjective domains. We've found that first working to effectively manage a small set of humans unlocks the ability to use LLMs to scale to strong performance.

1

3

1

0

46

MattCWilde retweeted

Andy Hall

@ahall_research

23 days ago

How can we teach AI the right way to handle super contested questions on consequential topics like politics, news, finance, personal health, etc? I've been working with @ByForumAI to develop a way to teach AI models the judgments of some of the world's foremost experts in these areas. I'm thrilled to share our whitepaper detailing the method we've come up with after many months of tinkering and testing. Forum starts by recruiting an incredible cast of world experts of all partisan and ideological stripes---people who are bring their own beliefs to bear on hard problems, but who are also capable of intellectual honesty in the face of disagreements. We worked through tons of hard examples with them of how AI models respond to challenging questions, developing and iterating on a rubric that captured their judgments---not on whether the answer was "correct" but on whether it bore the hallmarks of rigor. Did it exhibit neutrality by seriously engaging with all relevant arguments? Did it draw on high-quality information sources? Where there are objective facts to bring to bear, did it report them accurately? Then, the engineers at Forum developed a unique process to take the judgment of these experts and teach it to LLM judges who could apply it at scale. We're able to show that our judges perform considerably better at our task than default LLMs (i.e., if we ask Claude or ChatGPT to simply evaluate the same responses but without our special training). We've put a ton of work into validating this process, far more than I've seen in any other eval company. There is certainly more work to be done, but we now have a process that produces LLM evaluations that do a good job of replicating what our human experts say. Check out way more details in the paper here: https://t.co/TLJPQ2cDR0

1

23

6

15

4K

Matt Wilde

@MattCWilde

about 2 years ago

@zeta_globin Boston

0

2

0

98

Matt Wilde

@MattCWilde

about 2 years ago

@karpathy Cosmic rays are just dropout in prod

0

2

0

18

Matt Wilde

@MattCWilde

over 2 years ago

@eshear Nothing. https://t.co/WiYDnUTWWO

0

14

Matt Wilde

@MattCWilde

over 2 years ago

@sdamico @ImpulseLabs_ This seems awesome. Is there a recommended path for folks who only have room for a range? Would you pair it with a wall oven and some fancy cabinet or something?

0

27