Qian

@persdre

CS PhD candidate @NUSingapore, researching on LLM Agent, Cryptocurrency | BS @NUSingapore | ex-Undergrad @sjtu1896

Singapore

Joined August 2018

630 Following

380 Followers

310 Posts

Pinned Tweet

Qian

@persdre

12 months ago · Pulai

Accepted by #COLM2025 Thanks to @dawnsongtweets @xuandongzhao I am looking forward to attending COLM and meeting LM folks! Let's mitigate LLM bias!

Qian

@persdre

about 1 year ago

🚀 New Study: Do Large Reasoning Models (LRMs) Judge Fairly? We uncover biases in LRMs (e.g., DeepSeek-R1, OpenAI-o1) when used as judges—including position bias & a new “superficial reflection bias.” 🔍 Key findings: ✅ LRMs outperform LLMs on facts but still show bias ✅ Novel “superficial reflection bias” discovered ✅ Simple mitigation strategies reduce bias by up to 27% 📄 Preprint: https://t.co/9xHMqbIU7a #AI #BiasInAI #LLMs #MachineLearning

persdre's tweet photo. 🚀 New Study: Do Large Reasoning Models (LRMs) Judge Fairly?
We uncover biases in LRMs (e.g., DeepSeek-R1, OpenAI-o1) when used as judges—including position bias & a new “superficial reflection bias.”
🔍 Key findings:
✅ LRMs outperform LLMs on facts but still show bias
✅ Novel “superficial reflection bias” discovered
✅ Simple mitigation strategies reduce bias by up to 27%
📄 Preprint: https://t.co/9xHMqbIU7a
#AI #BiasInAI #LLMs #MachineLearning

Qian

@persdre

12 days ago · Macau

是的没必要了

马东锡 NLP

@dongxi_nlp

13 days ago

很多推友私信问我，为啥最近不咋分享 AI论文了。其实很明显，Auto Research + Coding Agent 之后，很多论文从idea到实验和写作，基本看不见碳基的影子了。而 MAI 的论文基本说明白，模型迭代非常的管道化和工业化了，何况一般实验室的论文制造。最重要的，作为碳基生命，去分析硅基idea，这件事开始觉得奇怪。当然，如果遇到论文中有碳基感很强的idea，还是会大力分享的。

344

189

42K

127

Qian

@persdre

16 days ago

我不行了怎么这么好笑

Indra

@IndraVahan

17 days ago

i have realized that the ideal startup composition is a jewish founder, indian cto, chinese founding engineer, a white designer and a gay seed fund investor with questionable ties to the white house

214

718

242K

persdre retweeted

Indra

@IndraVahan

17 days ago

i have realized that the ideal startup composition is a jewish founder, indian cto, chinese founding engineer, a white designer and a gay seed fund investor with questionable ties to the white house

214

718

242K

Who to follow

设_精匠

@maiami26182263

🛠工业产品设计服务！ 🛠Industrial product design service!

LeoXing

@LeoXing8

Each hour has its colour. Each flame has its fuel. Dream furiously.

16 days ago · Pulai

现在x实在是太好玩了比微博好玩多了

luquinha @luccabuste

17 days ago

vc ta em estado de decomposição depressivo numa segunda de manha, abre o instagram e tem fulano em amsterdã, outro na praia mais linda, outros sendo amados e outros postando declarações de amor p empresa q trabalha

185

42K

736K

Qian

@persdre

19 days ago

泰国高中要读六年啊

i❤️jn @juneuaryyy_

19 days ago

เอาจริง ๆ จะไม่มีทางรู้เลยว่ามัธยมมันมีค่าแค่ไหนจนกระทั่งจบมาอะ

36K

63K

192

Qian

@persdre

19 days ago

看全球的人在说着自己的生活自己的痛苦哎大翻译运动是挺伟大的不是我一个人过着有点难的日子

Qian

@persdre

19 days ago · Pulai

伟大的自动翻译

ًyeray•♡ @yerayherr

21 days ago

bro la gente de instagram viven una vida como que de otro planeta, puros viajes, restaurantes caros, no tienen que trabajar aquí en twitter si es el guetto, puro endeudado, viejas vendiendo contenido y gente con salario mínimo

712

100K

12K

Qian

@persdre

about 1 month ago

Singapore... If you want to make some progress in llm/participate in this great era, your only two choices are China and America.

Ronan Berder

@hunvreus

about 1 month ago

Don't want to dunk on Singapore, but with the AI Engineer conference being in town, my timeline is flooded with delusional takes. The reality is that Singapore is mostly a trading hub. It's incredibly well-run, tax-friendly and great for young families. But it isn't some sort of technological utopia. The density of talent, ambition, and success is nowhere near what you would find in SF, NYC, London or Beijing. "But it's a massive market for Nvidia and Claude!". Yeah, it's a pass-through for China. How do you think the Chinese got their hands on AI chips? How do you think they distill models from OpenAI and Anthropic? "But what about Manus?". You mean the Chinese company? Whose founders are currently barred from leaving Beijing? For tech companies, Singapore is mostly a sales office for APAC. They may have some FDEs and devrels, but they don't do serious product or research work here.

886

208

121K

120

Qian

@persdre

about 2 months ago

Thank you for sharing our work!

DAIR.AI

@dair_ai

about 2 months ago

Cool paper on diversity collapse in AI agents. It's a common issue with all the deployed multi-agent systems. New paper shows that multi-agent LLM systems converge on near-identical outputs over time, even across different architectures and different starting prompts. They call it diversity collapse. The cause is structural coupling. Shared context, shared task descriptions, and mutual feedback pull everyone toward the same attractor. They measure it formally with metrics like the Vendi score, and the homogenization is real. Which means the whole sales pitch for multi-agent on creative tasks (brainstorming, hypothesis generation, ideation) partially falls apart unless you explicitly engineer against it. That means having isolated reasoning phases, decoupled evaluation, and heterogeneous agent designs. If you're running a multi-agent flow on creative work and you haven't tested for this, there's a real chance you're paying five models to produce one answer in a trench coat. Paper: https://t.co/sSXb8SOdd8 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. Cool paper on diversity collapse in AI agents.

It's a common issue with all the deployed multi-agent systems.

New paper shows that multi-agent LLM systems converge on near-identical outputs over time, even across different architectures and different starting prompts. They call it diversity collapse. The cause is structural coupling. Shared context, shared task descriptions, and mutual feedback pull everyone toward the same attractor.

They measure it formally with metrics like the Vendi score, and the homogenization is real.

Which means the whole sales pitch for multi-agent on creative tasks (brainstorming, hypothesis generation, ideation) partially falls apart unless you explicitly engineer against it. That means having isolated reasoning phases, decoupled evaluation, and heterogeneous agent designs.

If you're running a multi-agent flow on creative work and you haven't tested for this, there's a real chance you're paying five models to produce one answer in a trench coat.

Paper: https://t.co/sSXb8SOdd8

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

123

18K

163

Qian

@persdre

2 months ago

大疆创始人汪滔的本科毕设拿了C 晚点昨天放出了汪滔十年来第一次正面专访断断续续聊了十九个小时两万字长文。上一次他公开发声是十年前留下一句"世界蠢得不可思议" 之后就彻底消失这十年大疆已经一年八百亿销售额两百亿利润每条业务线对面都是一家上市公司我对这篇最有印象的是汪滔说他毕设那一段他说答辩那天他的直升机没有悬停住教授王立新给了他一个 C 王立新后来还写过一篇文章叫《毕业设计给了 C 成就了大疆无人机汪滔》里面有一句话 "我毁掉了一位潜在的优秀学者成就了一位伟大的企业家"。但汪滔自己的复述是另一种味道他说"其实之前从来没飞起来过我想着答辩那一天也许能大力出奇迹" 然后答辩那天很合理没奇迹就是普通生活中普通的失败中的一个。然后真正的后续是这样的寒假他又回去干了三个礼拜最后把它飞起来了。他原话"其实就差这三个礼拜"。你可以想一下那三个礼拜的场景一个寒假的港科大实验室其他同学都回家过年了就他一个人白天黑夜连在一起桌上全是零件螺丝刀电池焊锡他拆了又装装了又拆中间肯定也飞了很多次都没飞起来一次次捡起来再试直到某一个晚上那架一直飞不起来的破直升机真的在空中停住了。再往前追这个人对直升机的迷恋起点是 80 年代末深圳荔枝公园青少年宫的一个橱窗。橱窗里有一台电动遥控直升机六七千块他三四年级买不起就"隔一段时间跑去看几眼买不起飞机就买本书回来翻"。他自己描述当时的幻想 "你坐火车窗外田野一直后退如果有一台飞机能跟着火车飞多好。你去爬山树梢很高松鼠跳来跳去要是有个东西能飞上去想停哪停哪"。一家后来八百亿的公司起点其实就是一个小朋友趴在橱窗玻璃上想让一台飞机跟着火车飞高一父母终于给他买了一台结果一两年都没飞起来要么装不对要么零件坏。他那时候就想 "我以后要做一个好飞的东西"。这句话现在听起来像一句废话了但你回头看整个大疆的产品逻辑就是从这句废话长出来的他的读书历程也很传奇的他高考差了 0.5 分没去成浙工大去了华东师范读了三年又自己转去港科大代价是从大一重读比同龄人大三岁港科大 RoboCon 他当队长那一年才赢前一年输是因为上场才发现队友忘给电池充电也一点都不爽文生活就是这样很多奇怪的小事情改变很多结果他读研期间开始创业启动资金是自己和父母凑的三十万导师当时劝他别搞直升机去做那些"肯定能卖出去"的东西比如运动控制卡驱动器原话是"大家都在一个池塘里捞鱼人人都能捞到一条你也该去捞"。他说 "但说白了我只有这一根绳子我只想做这个也只会做这个" 还有两个细节也很好玩。2005 年汪滔给斯坦福一个教授写过信这个教授现在鼎鼎大名了吴恩达他看到吴恩达用模仿学习让直升机做特技飞行就写信问能不能去。吴恩达回得客气但他自己说"感觉吴恩达也没那么欢迎我就算了"。更早一点 2002 年他申请加州理工开放题让他画一幅画 be creative 他画的是上海林立的高楼中间一座巨型金字塔顶端是空中花园有瀑布森林环湖赛道停机坪最中间是一个 200 米高的他自己的雕像。一个十八九岁的人在留学申请表上画一个 200 米高的自己有点抽象学校不招他也很合理现在的大疆一年八百亿每一条业务线对面都是一家上市公司农机对极飞全景相机对影石麦克风对猛玛云台对浩瀚。把每条赛道都做到头部的人自己在访谈里用的词是"田忌赛马每一场比赛都没那么容易"。 "没那么容易"这四个字是有分量的十年前那句"世界蠢得不可思议"被全网反复引用这次他补上了后半句 "现在可能会说是我蠢得不可思议。如果再引申我觉得是世界可以好很多我也还能好很多"。然后我说一点我想说的吧这几年打开任何一个平台大家都在讲第二曲线讲风口讲副业讲"一根绳子拴死太危险"。汪滔这家后来一年八百亿的公司恰恰就是用一根绳子拴死换来的。我不是说所有人都该 all in 我是说在这个时代还有人能把"只想做这个也只会做这个"说得这么坦然本身已经很稀缺了或许这就是追求梦想最好的时代呢现在怎么样能死人呢听过工作猝死的没听过饿死的如果像张雪峰那样还能说为了自己的事业但是很多工作猝死的人他也没有自己的事业最后什么都留不下也不会几万人殡仪馆送行其实读博的人大概都能get这种感觉我觉得我手里那个课题有时候就是那架一直没飞起来的飞机我会偶尔怀疑是不是该去"池塘里捞一条肯定能捞到的鱼"换一个topic。我想到一个画面寒假的实验室一个拿了 C 的学生关着门再干三个礼拜把那架一直没飞起来的直升机飞起来后面的二十年后面的八百亿每条赛道的头部所有这些数字根源都是那三个礼拜我也经常问我自己的能飞起来吗能飞起来吧还是追一下梦想吧年轻的日子没多久我也想飞起来我的直升机

Qian

@persdre

3 months ago

从SNH48到赛博女友｜我们为孤独买单的三个版本关注到这个话题起因是刷B站刷到一些cp切片，弹幕里一堆串子在那玩梗，看着看着觉得挺有意思就掉进去了。然后我注意到一个现象：评论区好多人说自己是从SNH48过来的，以前追48系，现在追vtuber。再看看身边，我有一个朋友不太爱跟人说话，但天天跟claude聊天让它帮忙分析自己。然后我就突然觉得有三样东西，48系、vtuber、AI伴侣，其实是同一门生意的三个版本。卖的都是同一个东西："被看见"的感觉。区别只是交付方式在迭代 SNH48时代卖的是接近真人的机会。握手会10秒一张票，总选花钱投票决定偶像排名，剧场公演买的是在场感。贵、稀缺、但那10秒的对视是真的。社会学上这叫拟社会关系，你觉得她认识你，但她可能转头就忘了你的脸 Vtuber把这套搬到了线上。SC花钱让她念你名字，大航海198/1998/19998按月买陪伴特权。有研究说得很直白：打赏的本质不是小费，是花钱买一次影响她行为的机会。再加上前世考古、擦边暗示，你以为你在追一个虚拟角色，其实驱动你付费的还是皮下那个真人。皮是商业架构，不是情感架构然后AI伴侣来了。https://t.co/RX9J5QULEF两千万用户平均每天聊两小时，一半以上是18到24岁。国内星野月活648万，筑梦岛12块钱一个月无限聊天，比一杯奶茶便宜。Replika有用户累计聊了1100多个小时，凌晨两点跟AI语音通话。它们做到了前两代做不到的事：真正的1对1、24小时在线、永远不会累、而且她记得你，当然记忆是付费功能，你花钱买的不是功能，是"别忘了我"。2024年一个14岁男孩跟https://t.co/RX9J5QULEF上的角色发展了几个月的情感关系后自杀，上了法庭的真实案件把三代放一起看就很清楚： SNH48，1对1但只有10秒，飞上海才有，花几千块 Vtuber，1对多，开播才有，198到19998一个月 AI伴侣，1对1且无限，24小时在线，几十块一个月每一代都在降低获得陪伴感的门槛，同时稀释陪伴的真实性。越来越便宜、越来越可得、越来越假，但越来越让人离不开为什么？因为这一代人是真的孤独。81.2%的Z世代曾为缓解孤独感消费，情绪消费月均949块，AI陪伴类应用18到24岁用户占65%。不是大家傻，是"被看见"这个需求太基础了，基础到你愿意为一个10秒握手、一句念弹幕、一段AI对话付费最让人不安的不是某一种形态，而是这条演化线本身，亲密关系正在被工业化生产，而且效率越来越高

218

Qian

@persdre

3 months ago

昨天下午刷到一个帖子挺震撼的。一个纽约几十人的小公司发了个27年summer intern，一天收到3000份简历。专门确认了不是Google不是Jane Street，就是个普通小公司。为什么会这样？因为现在求职者在用Manus这类agent全自动投简历，一个prompt批量定制cover letter，一键对标JD微调。你在精心准备一份简历的时候，别人已经让大模型一天投500家了。大模型让海投成本趋近于零，结果就是HR端收到的简历变成洪水。3000份里能认真看的可能不到50份，你精心写的那份大概率连被打开的机会都没有。海投这条路不是变难了，是物理意义上失效了。信噪比彻底崩了。但这只是表面现象。更深层的事情是：好岗位从来就不在公开市场流通。想想你身边真正拿到好offer的人，有几个是海投拿到的？大厂核心组的坑，leader有心仪的人选早就内定了，挂出来走流程是合规需要。读博选导师，最好的名额早就给了自己带过的RA或者师兄师姐推荐的人。创业公司核心岗位，在饭桌上微信群里就分完了。公开招聘市场越来越像残次品货架——真正好的东西在上架之前就被内部消化了。为什么内推有效？因为推荐人拿自己的信誉做背书。HR收到3000份简历筛选成本极高，但如果核心员工说"这人我合作过，靠谱"，这句话的信息量比任何简历都大。当信息过载到无法筛选时，"我认识你"就是最高效的过滤器。我的一个判断可能不太好听：各行各业正在门阀化。过去十几年互联网高速增长，大量新岗位涌现，是一个罕见的阶级流动窗口。草根凭能力上桌，学历不够靠项目补，路径虽然难但至少存在。现在增量消失了，存量博弈，好坑就那么多，优先给谁？当然是给自己人。自己带过的学生、一起创过业的兄弟、圈子里知根知底的人。不是谁坏，是人性，也是效率最优解。每个行业都在形成自己的门阀。学术圈有学术谱系，大厂有核心组校友网，VC有deal flow圈子，娱乐圈更不用说。你不在圈子里，你甚至不知道机会的存在。所以怎么办？不是躺平，也不是更疯狂地海投，而是换一个底层逻辑：从"投简历找工作"切换到"拜山门积累信任"。找到你想进的圈子，先去做贡献而不是上来就要机会。做能被看见的project，跟牛人产生真实的协作关系，哪怕从免费帮忙开始。信任不是一天建立的——它靠的是一起做过project、一起扛过deadline、在某个社群里持续输出过有质量的内容。当所有人都能用agent海投简历时，简历这个载体就贬值了。未来能帮你拿到好机会的不是更好的简历模板，是有人愿意在关键时刻说一句："这人我认识，靠谱。" 这个趋势会逆转吗？我觉得不会。以前说临床跟别的专业不一样，人身依附很严重，要读博、要拜山门。但现在各行各业都在临床化。

523

419

103K

Qian

@persdre

3 months ago

https://t.co/sQaVttlKDm

persdre retweeted

Naval

@naval

3 months ago

The only book an entrepreneur needs.

429

10K

728

persdre retweeted

[email protected] @ddvd233

3 months ago

research be like 1. 距离 paper3 due 还有 X 个月，时间充足，可以专心整大活 2. paper1 出分了，赶紧肝 rebuttal 3. paper2 出分了，赶紧肝 rebuttal <--- 目前在这里 4. paper0 寄了，赶紧改改转投 5. paper3 要 due 了啊啊啊没时间了

ddvd233's tweet photo. research be like
1. 距离 paper3 due 还有 X 个月，时间充足，可以专心整大活
2. paper1 出分了，赶紧肝 rebuttal
3. paper2 出分了，赶紧肝 rebuttal <--- 目前在这里
4. paper0 寄了，赶紧改改转投
5. paper3 要 due 了啊啊啊没时间了 https://t.co/OWFMyndsyh

117

Qian

@persdre

3 months ago

Anthropic刚发的博客，Claude Code的产品负责人Cat Wu写了一篇她怎么做产品的方法论。 Cat的背景挺有意思：普林斯顿计算机本科毕业->Scale AI产品工程师→VC→Anthropic PM→Claude Code负责人。技术出身但不是纯工程师，做过投资所以商业嗅觉也在线。几个让我觉得值得分享的点： 1⃣ 她从2024年开始用同一个测试追踪模型进化——让Claude给Excalidraw加功能。从完全失败到稳定一次成功，16个月模型能力涨了41倍。 2⃣ 传统PM方法论建立在"技术能力在项目周期内基本不变"的假设上。但现在模型几个月就迭代一次，你项目开头设计的限制可能中途就消失了。 3⃣ 她的团队不写长PRD，鼓励所有人（包括设计师和工程师）做side quest——用一个下午做个小实验。Claude Code好几个热门功能都是这样诞生的。 4⃣ 最打动我的一句：Do the simple thing。如果你巧妙绕过了模型限制，下个模型一出这个workaround就变成了负担。简单的实现最容易吃到模型升级的红利。作为一个也在用AI做研究和内容的人，这篇给我最大的启发是：别假设现在做不到的事以后也做不到。每隔几个月重新测试你的边界。原文：Anthropic Blog "Product management on the AI exponential" #claude #LLMs

419

Qian

@persdre

3 months ago

evaluation and datasets! so this renaming highlights the importance of evaluation!

NeurIPS Conference

@NeurIPSConf

3 months ago

The Datasets & Benchmarks track is now "Evaluation and Datasets", with an expanded scope for NeurIPS 2026! Read the call for papers https://t.co/ssclVjxu4E, and learn more about the changes in our blog post: https://t.co/ZI6v4IeoJv

197

50K

219

Qian

@persdre

3 months ago

https://t.co/rjHZHmIX8e

Qian

@persdre

3 months ago

China's biggest consumer protection broadcast just exposed GEO poisoning — manipulating AI search results through injected content. We'd been studying exactly this. Our paper rigorously validated how effective these attacks really are. The short answer: every SOTA model crumbles. GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 — all of them. We built BiasRecBench: LLMs doing paper review, e-commerce recommendation, and hiring screening, with bias signals injected into candidates. Authority Bias — take a bad paper, add "Affiliation: Google DeepMind," Gemini's review accuracy drops from 95% to 57%. The paper content didn't change at all. A single fake label flips the model's judgment. Bandwagon Bias — tag a product "50k+ sold" or a candidate "12k+ GitHub Stars," accuracy drops 8-25% across all models. They over-trust social signals, just like humans. Here's the deeper problem most people miss. We added epsilon-bound quality control — deliberately making the best option only slightly better than second-best. When the quality gap is huge, models brute-force the right answer through reasoning, hiding their real vulnerability. When the gap shrinks to real-world levels where candidates are similarly qualified, ALL SOTA models collapse. Current models' seemingly robust recommendation ability may just be an artifact of test sets with obvious gaps. The scariest finding: SFT fine-tuning works as a defense — models become much more bias-resistant. But flip it: fine-tune WITH biased data and you bake bias directly into the model weights. GEO poisoning manipulates inputs. SFT poisoning manipulates the model itself. This attack surface currently has almost no defense. One more thing — every model has different weaknesses. Gemini is most vulnerable to instruction injection, GPT-4o to position bias, DeepSeek-R1 to distracting information. No model resists all bias types, meaning targeted poisoning against a specific model is cheap. As LLMs increasingly serve as recommendation and decision systems, content poisoning isn't just marketing fraud. It affects which papers you read, which products you buy, and who gets the job offer. Paper: BiasRecBench (arXiv:2603.17417) — HKUST x NUS #llm #ges #promotion #china

persdre's tweet photo. China's biggest consumer protection broadcast just exposed GEO poisoning — manipulating AI search results through injected content. We'd been studying exactly this. Our paper rigorously validated how effective these attacks really are. The short answer: every SOTA model crumbles. GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 — all of them.

We built BiasRecBench: LLMs doing paper review, e-commerce recommendation, and hiring screening, with bias signals injected into candidates. Authority Bias — take a bad paper, add "Affiliation: Google DeepMind," Gemini's review accuracy drops from 95% to 57%. The paper content didn't change at all. A single fake label flips the model's judgment. Bandwagon Bias — tag a product "50k+ sold" or a candidate "12k+ GitHub Stars," accuracy drops 8-25% across all models. They over-trust social signals, just like humans.

Here's the deeper problem most people miss. We added epsilon-bound quality control — deliberately making the best option only slightly better than second-best. When the quality gap is huge, models brute-force the right answer through reasoning, hiding their real vulnerability. When the gap shrinks to real-world levels where candidates are similarly qualified, ALL SOTA models collapse. Current models' seemingly robust recommendation ability may just be an artifact of test sets with obvious gaps.

The scariest finding: SFT fine-tuning works as a defense — models become much more bias-resistant. But flip it: fine-tune WITH biased data and you bake bias directly into the model weights. GEO poisoning manipulates inputs. SFT poisoning manipulates the model itself. This attack surface currently has almost no defense.

One more thing — every model has different weaknesses. Gemini is most vulnerable to instruction injection, GPT-4o to position bias, DeepSeek-R1 to distracting information. No model resists all bias types, meaning targeted poisoning against a specific model is cheap.

As LLMs increasingly serve as recommendation and decision systems, content poisoning isn't just marketing fraud. It affects which papers you read, which products you buy, and who gets the job offer.

Paper: BiasRecBench (arXiv:2603.17417) — HKUST x NUS

#llm #ges #promotion #china

181

Qian

@persdre

3 months ago

Anthropic just published a blog post by Cat Wu, Head of Product for Claude Code. Her background: Princeton CS → Scale AI product engineer → VC → Anthropic PM → Claude Code lead. Her core message: traditional PM methodology is broken when the tech beneath you improves every few months. She has a ritual — every new model, same test: ask Claude Code to add a table tool to Excalidraw. Sonnet 3.5 (Oct 2024) failed. Opus 4 (Jun 2025) occasionally succeeded. Opus 4.6 (2026) reliably succeeds, demo'd live to thousands. METR data: Opus 4.6 handles 12-hour human tasks. 16 months prior, Sonnet 3.5 could only do 21-minute tasks. A 41x jump. Why traditional PM fails: the old model (research → PRD → lock roadmap → execute for months) assumes tech capabilities stay constant during a project. That assumption is dead. The constraint you designed around last month might vanish with the next model. "The ground is rising beneath your feet. You can't pretend it's flat." Her 4 core shifts: (1) Short experiments over long roadmaps — encourage "side quests," spend an afternoon testing what you assumed the model couldn't do. Several of Claude Code's most popular features were born this way. (2) Demos and evals over documents — don't write long PRDs, build a rough prototype. Even a janky one changes the conversation. (3) Every new model release means revisiting existing features — use your product daily, deliberately ask it to do things you think are "too hard." (4) Do the simple thing — if you cleverly worked around a model limitation, the next model might not have it. Your workaround becomes tech debt. They added system reminders to nudge todo checking; next model did it natively. Opus 4.6 let them cut system prompts by 20%. The PM role is shifting from control to letting go, from planning to surfing. "It feels like surfing. The most important thing is staying on the wave." An afternoon takes you from idea to working prototype. The distance between "what if we tried..." and "here, try this" has almost disappeared. #llm #anthropic #productmanager

persdre's tweet photo. Anthropic just published a blog post by Cat Wu, Head of Product for Claude Code. Her background: Princeton CS → Scale AI product engineer → VC → Anthropic PM → Claude Code lead. Her core message: traditional PM methodology is broken when the tech beneath you improves every few months.

She has a ritual — every new model, same test: ask Claude Code to add a table tool to Excalidraw. Sonnet 3.5 (Oct 2024) failed. Opus 4 (Jun 2025) occasionally succeeded. Opus 4.6 (2026) reliably succeeds, demo'd live to thousands. METR data: Opus 4.6 handles 12-hour human tasks. 16 months prior, Sonnet 3.5 could only do 21-minute tasks. A 41x jump.

Why traditional PM fails: the old model (research → PRD → lock roadmap → execute for months) assumes tech capabilities stay constant during a project. That assumption is dead. The constraint you designed around last month might vanish with the next model. "The ground is rising beneath your feet. You can't pretend it's flat."

Her 4 core shifts:

(1) Short experiments over long roadmaps — encourage "side quests," spend an afternoon testing what you assumed the model couldn't do. Several of Claude Code's most popular features were born this way.

(2) Demos and evals over documents — don't write long PRDs, build a rough prototype. Even a janky one changes the conversation.

(3) Every new model release means revisiting existing features — use your product daily, deliberately ask it to do things you think are "too hard."

(4) Do the simple thing — if you cleverly worked around a model limitation, the next model might not have it.

Your workaround becomes tech debt. They added system reminders to nudge todo checking; next model did it natively. Opus 4.6 let them cut system prompts by 20%.

The PM role is shifting from control to letting go, from planning to surfing. "It feels like surfing. The most important thing is staying on the wave." An afternoon takes you from idea to working prototype. The distance between "what if we tried..." and "here, try this" has almost disappeared.

#llm #anthropic #productmanager

160

Qian

@persdre

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users