🚀 New Study: Do Large Reasoning Models (LRMs) Judge Fairly?
We uncover biases in LRMs (e.g., DeepSeek-R1, OpenAI-o1) when used as judges—including position bias & a new “superficial reflection bias.”
🔍 Key findings:
✅ LRMs outperform LLMs on facts but still show bias
✅ Novel “superficial reflection bias” discovered
✅ Simple mitigation strategies reduce bias by up to 27%
📄 Preprint: https://t.co/9xHMqbIU7a
#AI #BiasInAI #LLMs #MachineLearning
i have realized that the ideal startup composition is a jewish founder, indian cto, chinese founding engineer, a white designer and a gay seed fund investor with questionable ties to the white house
i have realized that the ideal startup composition is a jewish founder, indian cto, chinese founding engineer, a white designer and a gay seed fund investor with questionable ties to the white house
vc ta em estado de decomposição depressivo numa segunda de manha, abre o instagram e tem fulano em amsterdã, outro na praia mais linda, outros sendo amados e outros postando declarações de amor p empresa q trabalha
bro la gente de instagram viven una vida como que de otro planeta, puros viajes, restaurantes caros, no tienen que trabajar
aquí en twitter si es el guetto, puro endeudado, viejas vendiendo contenido y gente con salario mínimo
Don't want to dunk on Singapore, but with the AI Engineer conference being in town, my timeline is flooded with delusional takes.
The reality is that Singapore is mostly a trading hub. It's incredibly well-run, tax-friendly and great for young families. But it isn't some sort of technological utopia.
The density of talent, ambition, and success is nowhere near what you would find in SF, NYC, London or Beijing.
"But it's a massive market for Nvidia and Claude!". Yeah, it's a pass-through for China. How do you think the Chinese got their hands on AI chips? How do you think they distill models from OpenAI and Anthropic?
"But what about Manus?". You mean the Chinese company? Whose founders are currently barred from leaving Beijing?
For tech companies, Singapore is mostly a sales office for APAC. They may have some FDEs and devrels, but they don't do serious product or research work here.
Cool paper on diversity collapse in AI agents.
It's a common issue with all the deployed multi-agent systems.
New paper shows that multi-agent LLM systems converge on near-identical outputs over time, even across different architectures and different starting prompts. They call it diversity collapse. The cause is structural coupling. Shared context, shared task descriptions, and mutual feedback pull everyone toward the same attractor.
They measure it formally with metrics like the Vendi score, and the homogenization is real.
Which means the whole sales pitch for multi-agent on creative tasks (brainstorming, hypothesis generation, ideation) partially falls apart unless you explicitly engineer against it. That means having isolated reasoning phases, decoupled evaluation, and heterogeneous agent designs.
If you're running a multi-agent flow on creative work and you haven't tested for this, there's a real chance you're paying five models to produce one answer in a trench coat.
Paper: https://t.co/sSXb8SOdd8
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
research be like
1. 距离 paper3 due 还有 X 个月,时间充足,可以专心整大活
2. paper1 出分了,赶紧肝 rebuttal
3. paper2 出分了,赶紧肝 rebuttal <--- 目前在这里
4. paper0 寄了,赶紧改改转投
5. paper3 要 due 了啊啊啊没时间了
The Datasets & Benchmarks track is now "Evaluation and Datasets", with an expanded scope for NeurIPS 2026!
Read the call for papers https://t.co/ssclVjxu4E, and learn more about the changes in our blog post: https://t.co/ZI6v4IeoJv
China's biggest consumer protection broadcast just exposed GEO poisoning — manipulating AI search results through injected content. We'd been studying exactly this. Our paper rigorously validated how effective these attacks really are. The short answer: every SOTA model crumbles. GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 — all of them.
We built BiasRecBench: LLMs doing paper review, e-commerce recommendation, and hiring screening, with bias signals injected into candidates. Authority Bias — take a bad paper, add "Affiliation: Google DeepMind," Gemini's review accuracy drops from 95% to 57%. The paper content didn't change at all. A single fake label flips the model's judgment. Bandwagon Bias — tag a product "50k+ sold" or a candidate "12k+ GitHub Stars," accuracy drops 8-25% across all models. They over-trust social signals, just like humans.
Here's the deeper problem most people miss. We added epsilon-bound quality control — deliberately making the best option only slightly better than second-best. When the quality gap is huge, models brute-force the right answer through reasoning, hiding their real vulnerability. When the gap shrinks to real-world levels where candidates are similarly qualified, ALL SOTA models collapse. Current models' seemingly robust recommendation ability may just be an artifact of test sets with obvious gaps.
The scariest finding: SFT fine-tuning works as a defense — models become much more bias-resistant. But flip it: fine-tune WITH biased data and you bake bias directly into the model weights. GEO poisoning manipulates inputs. SFT poisoning manipulates the model itself. This attack surface currently has almost no defense.
One more thing — every model has different weaknesses. Gemini is most vulnerable to instruction injection, GPT-4o to position bias, DeepSeek-R1 to distracting information. No model resists all bias types, meaning targeted poisoning against a specific model is cheap.
As LLMs increasingly serve as recommendation and decision systems, content poisoning isn't just marketing fraud. It affects which papers you read, which products you buy, and who gets the job offer.
Paper: BiasRecBench (arXiv:2603.17417) — HKUST x NUS
#llm #ges #promotion #china
Anthropic just published a blog post by Cat Wu, Head of Product for Claude Code. Her background: Princeton CS → Scale AI product engineer → VC → Anthropic PM → Claude Code lead. Her core message: traditional PM methodology is broken when the tech beneath you improves every few months.
She has a ritual — every new model, same test: ask Claude Code to add a table tool to Excalidraw. Sonnet 3.5 (Oct 2024) failed. Opus 4 (Jun 2025) occasionally succeeded. Opus 4.6 (2026) reliably succeeds, demo'd live to thousands. METR data: Opus 4.6 handles 12-hour human tasks. 16 months prior, Sonnet 3.5 could only do 21-minute tasks. A 41x jump.
Why traditional PM fails: the old model (research → PRD → lock roadmap → execute for months) assumes tech capabilities stay constant during a project. That assumption is dead. The constraint you designed around last month might vanish with the next model. "The ground is rising beneath your feet. You can't pretend it's flat."
Her 4 core shifts:
(1) Short experiments over long roadmaps — encourage "side quests," spend an afternoon testing what you assumed the model couldn't do. Several of Claude Code's most popular features were born this way.
(2) Demos and evals over documents — don't write long PRDs, build a rough prototype. Even a janky one changes the conversation.
(3) Every new model release means revisiting existing features — use your product daily, deliberately ask it to do things you think are "too hard."
(4) Do the simple thing — if you cleverly worked around a model limitation, the next model might not have it.
Your workaround becomes tech debt. They added system reminders to nudge todo checking; next model did it natively. Opus 4.6 let them cut system prompts by 20%.
The PM role is shifting from control to letting go, from planning to surfing. "It feels like surfing. The most important thing is staying on the wave." An afternoon takes you from idea to working prototype. The distance between "what if we tried..." and "here, try this" has almost disappeared.
#llm #anthropic #productmanager