Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.
@TheStalwart Exactly. In our research on NewsBench at Forum AI we found that even when the models source well they still play too fast and loose with what those sources say.
I’ve been helping @TheForumAI build NewsBench, a benchmark for how frontier AI covers the news that matters.
We put the leading models through 3,000+ prompts and scored each one on accuracy, neutrality, & source quality.
See where each model landed: https://t.co/unRq44qkuB
@mihai673@ahall_research@ByForumAI We've done some small ablations around this. Once you iterate to a rubric that humans can apply consistently, the frontier models can also generally apply it pretty well. However there's still a modest amount of performance you're leaving on the table if you don't optimize on top
Excited to have been part of this work exploring better ways to evaluate AI on hard, contested questions. For consequential topics, grounding evaluation in expert judgment feels especially important. Proud to have contributed and excited to see what comes next with @ByForumAI.
@a1zhang's Mismanaged Genius hypothesis asks if poor LLM performance on certain tasks is due to a capability cap or poor utilization. At Forum AI, we've been researching what it would take to improve how LLMs handle high-stakes, subjective domains. We've found that first working to effectively manage a small set of humans unlocks the ability to use LLMs to scale to strong performance.
How can we teach AI the right way to handle super contested questions on consequential topics like politics, news, finance, personal health, etc?
I've been working with @ByForumAI to develop a way to teach AI models the judgments of some of the world's foremost experts in these areas. I'm thrilled to share our whitepaper detailing the method we've come up with after many months of tinkering and testing.
Forum starts by recruiting an incredible cast of world experts of all partisan and ideological stripes---people who are bring their own beliefs to bear on hard problems, but who are also capable of intellectual honesty in the face of disagreements.
We worked through tons of hard examples with them of how AI models respond to challenging questions, developing and iterating on a rubric that captured their judgments---not on whether the answer was "correct" but on whether it bore the hallmarks of rigor. Did it exhibit neutrality by seriously engaging with all relevant arguments? Did it draw on high-quality information sources? Where there are objective facts to bring to bear, did it report them accurately?
Then, the engineers at Forum developed a unique process to take the judgment of these experts and teach it to LLM judges who could apply it at scale. We're able to show that our judges perform considerably better at our task than default LLMs (i.e., if we ask Claude or ChatGPT to simply evaluate the same responses but without our special training).
We've put a ton of work into validating this process, far more than I've seen in any other eval company. There is certainly more work to be done, but we now have a process that produces LLM evaluations that do a good job of replicating what our human experts say.
Check out way more details in the paper here:
https://t.co/TLJPQ2cDR0
@sdamico@ImpulseLabs_ This seems awesome. Is there a recommended path for folks who only have room for a range? Would you pair it with a wall oven and some fancy cabinet or something?