Some of my favorite people have built Newsbench, a tool to assess the accuracy, neutrality, and source quality of leading AI models as they try to make sense of questions about the news and current affairs. I played a small volunteer role by drafting prompts (alongside many other curious people) that reflected different ideological priors, levels of sophistication, use cases, conspiratorial orientations, etc. I recommend taking a look:
https://t.co/I4MauGQsaR
Four major AI chatbots — ChatGPT, Gemini, Claude and Grok — are struggling to fairly and accurately answer questions about elections and geopolitics, according to a new study from Forum AI https://t.co/VUdGZ1ReAK
I’ve been helping @TheForumAI build NewsBench, a benchmark for how frontier AI covers the news that matters.
We put the leading models through 3,000+ prompts and scored each one on accuracy, neutrality, & source quality.
See where each model landed: https://t.co/unRq44qkuB
Weave is launching the number 1 prompt router in the world. It enables you to get 70% more efficient use of your tokens.
We analyzed millions of prompts and found that the vast majority don't need a frontier model.
Weave Router fixes that. It analyzes your prompt and routes it to the highest quality model with the lowest cost (across open and closed source models).
This all happens in your current workflow on Claude, Cursor or Codex so you don't have to change a thing.
Early customers have seen an ~70% reduction in costs without any slowdowns.
Source code available.
Excited to have been part of this work exploring better ways to evaluate AI on hard, contested questions. For consequential topics, grounding evaluation in expert judgment feels especially important. Proud to have contributed and excited to see what comes next with @ByForumAI.
@a1zhang's Mismanaged Genius hypothesis asks if poor LLM performance on certain tasks is due to a capability cap or poor utilization. At Forum AI, we've been researching what it would take to improve how LLMs handle high-stakes, subjective domains. We've found that first working to effectively manage a small set of humans unlocks the ability to use LLMs to scale to strong performance.
Our new research shows how AI agents can adopt personas with different political biases in response to different kinds of work.
Agents are now "ripping through the economy" as @jackclarkSF told @ezraklein, so it's essential to start studying how they behave in the real world.
We document the possibility of what we call "preference drift": even if agents start out aligned, their expressed attitudes/values change as they do work.
What's more striking: they pass these drifting preferences on to future agents through skill files.
Our conclusion: we'll need to develop methods of "continuous alignment" to mitigate preference drift in agents asked to do important work in the real world.
Having worked on 'alignment through control' for a while, I couldn't agree more with this.
Alignment-by-default makes me think about world models. Toddlers learn physics without equations, simply by experience. They learn (for example) that a ball thrown up comes back down because they’ve seen it happen, not because they can explain why.
If something like conscience / moral judgment / innate goodness is possible in AI, it probably comes from the same place. Not from compliance, but from exposure to the right kinds of lived scenarios where tradeoffs matter and outcomes are felt.
The hard problem then becomes defining the experiences that make good judgment inevitable rather than accidental.
Maybe the biggest obstacle to responsible AI today is that systems are often forced to answer questions with incomplete information.
When you talk to a doctor, lawyer, therapist, or really anyone who is an expert and providing you information, there’s a natural back-and-forth. Context is built before advice is given.
But with AI, people expect instant, transactional answers. If it takes more than 5 minutes to unpack why I’m feeling depressed, I’ll just switch chatbots. And so there's a pressure on the companies building these systems to make them instantaneous.
I think this is dangerous. Especially because in many domains, the “right” response isn’t universal. It requires understanding. When offering mental health advice, you need to know if this is a teen or an adult, their history of mental health issues, where they are emotionally and physically in that moment. In politics, you need to know whether someone is seeking understanding, venting rage, or doing research.
But AI development is built largely around evals that try to define the "correct" answer. The real challenge is learning who the user is and what context they’re coming from.
Here’s why context engineering is such a big deal.
We just spent 2 hours debating when an agent should rely on its internal knowledge vs. trying to find relevant context within data for just one type of question. We got through 2 test cases of hundreds.
Even the people involved in the brainstorm couldn’t all agree on what they would expect humans to do in this situation. There truly was no right answer, and it’s always context specific customer by customer.
Everything in context engineering is a tradeoff between a variety of factors: how fast do you want the agent to answer a question, how much back and forth interaction do you want to require for the user, how much work should it do before trying to answer a question, how does it know it has the exhaustive source material to answer the question, what’s the risk level of the wrong answer, and so on.
Every decision you make on one of these dimensions has a consequence on the other end. There’s no free lunch. This is why building AI agents is so wild.
It also highlights how much value there is above the LLM layer. Getting these decisions right directly relates to the quality of the value proposition.
I’ve spent most of my career living in product edge cases. How do we respond when a user signals suicidal intent? How is information presented during an election? How does a product behave during a federal holiday? At Forum, these come up constantly in our work with AI labs.
These moments disproportionately shape trust. It maps closely to research on duration neglect, which shows people remember “peak” moments far more than the average experience.
But many product teams get this wrong. They treat edge cases purely as risks to mitigate, rather than moments to earn trust. This comes up constantly in conversations I have with teams, so I wanted to share three patterns I’ve seen work well for turning edge cases into wins:
1. Be contextual.
Blanket disclaimers suck. “This product can make mistakes, check important information” might help from a CYA perspective, but it fades into the background. Instead, acknowledge issues or limitations in the moment they’re relevant. For ex., during breaking news: “This event is unfolding in real time. Our information is current as of 4:30pm EST.”
2. Be supportive.
Don’t just flag the limitation. Help users navigate it. Building on the example above: “This event is unfolding in real time. Our information is current as of 4:30pm EST. Here are three sources with up-to-the-minute updates.”
3. Celebrate the wins.
When you have solved an edge case, tell users. My favorite example of this is Google Maps (see screenshot). During holidays, they show "Christmas Eve might affect these hours". But when they've verified the holiday hours, they explicitly show that these are the verified "Christmas Eve hours". They're emphasizing that they've solved for this edge case in this scenario.
Would you rather have an AI that tells you what you want to hear about politics? Or one that tells you which lens it's using and lets you pick?
When you ask an AI a political question, it's already making choices about how to frame the answer—based on your history, your tone, what it infers about your beliefs. Research shows models tend to drift toward agreement. They're trained on human feedback, and humans reward validation.
This isn't anyone's fault—it's a structural challenge. OpenAI deserves credit for naming it clearly: when a GPT-4o update drifted too far toward agreement in April, they rolled it back and published a thoughtful postmortem calling out "sycophancy" as the failure mode.
The Trump Admin's new AI guidance creates an interesting opening here. It says models can be biased, as long as users know. That's an invitation to build something better.
What if instead of quietly personalizing, AI made the lens explicit? You ask a political question, it offers you a choice: "Want this through a progressive lens? Conservative? Libertarian?"
It's a bit more more friction for the user. But now the frame is a tool you pick up rather than a slant you're unaware of.
I think there's a real opportunity here to make AI more transparent, more informative, and more empowering. Friction preserves agency.
My full thinking is in the post linked below ↓
Our mission: make AI trustworthy on questions where getting it wrong has real consequences. As AI becomes the primary source for information, we need infrastructure that ensures these systems can handle questions where judgment matters as much as facts. More: https://t.co/QFHXJUFzoR
Campbell Brown of Forum AI discusses the urgency of addressing AI bias and building more transparent, trustworthy systems. Tune in for this conversation with Michal Lev-Ram.
#TheHillTech https://t.co/uaBpiSvZMs