rao

@raox

Joined February 2009

2K Following

126 Followers

4.5K Posts

raox retweeted

Viking

@vikingmute

19 days ago

分享一篇文章：《How LLMs Actually Work》 https://t.co/apHhTvjdiB 好像是前几天 HackerNews 排名第一来着，类似的文章很多，但是这篇深入浅出和直观的例子非常适合有一定编程但没深入学Transformer的人阅读，里面的比喻也恰当，一看就是活人写的，没什么 AI 味道。最近重新爱上了写东西，写了两篇技术文章，之后还会继续写，而且我的一个原则，活人写，绝对不用 AI，写作是一种乐趣，梳理逻辑，表达观点，不要让这种乐趣被 AI 剥夺。

648

134

83K

raox retweeted

Rohan Paul

@rohanpaul_ai

19 days ago

Great Stanford + MIT + Harvard + Anthropic paper. Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals. The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge. In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again. They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills. ---- Link – arxiv. org/abs/2605.29548 Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

rohanpaul_ai's tweet photo. Great Stanford + MIT + Harvard + Anthropic paper.

Gives a clear training-based reason for why larger models learn abilities smaller models miss.

Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals.

The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.

Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge.

In a crowded data mixture, common patterns get first claim on the model’s internal machinery.

Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again.

They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters.

The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less.

Larger models can remember weak rare signals long enough to turn them into real learned skills.

----

Link – arxiv. org/abs/2605.29548

Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

419

343

27K

raox retweeted

行者达达

@dali51334388

20 days ago

Matt Van Horn 写了一篇长文，把他用 Claude Code 的全套工作流公开了。看完之后我的感受是：他早就不在"用 AI 辅助写代码"这个层面上了。他的核心只有一句话——不用 IDE，只靠一个 plan.md 和一张嘴。他最反直觉的一点是：代码是最后才出现的东西。脑子里一冒出想法，第一反应永远是先生成一份计划。传统开发是 80% 写代码、20% 做规划，他把这个比例彻底反了过来。思考全部沉淀在计划里，执行交给机器。支撑这套流程的是 Every 公司的 Compound Engineering 插件。一条 plan 命令会同时派出好几个 Agent 并行干活：一个读你的代码库、一个翻你过去的踩坑记录、一个查外部最佳实践，最后汇总成一份带验收清单的 plan.md。这份文件就是永不丢失的存档点——上下文断了，新开会话指向它就能接着跑。他现在攒了 70 份计划，过去 30 天提交了 263 个 commit。输入方式是语音。过去语音转文字不好用，卡在转录精度上；但当接收方变成一个能理解上下文的 AI，这个瓶颈直接消失了——你可以含糊、跑题、重新起头，它都能猜对。他甚至在特斯拉 FSD 送孩子的路上口述完了一整段文章。他日常同时开四到六个终端：一个在写计划，一个在执行，一个在调研，一个在修 bug。窗口之间来回切换，等他转一圈回来，每个任务都往前走了一步。一个人活成了一个团队。代价是 MacBook 一小时就没电。最让我印象深的是"上下文的复利"。他跟人吃了顿饭、全程录音，饭后把记录丢给 Claude，它没有泛泛生成提案，而是拿这段对话去跟公司真实的代码库和他过去所有战略文档做交叉比对，一次就产出了一份完整提案。对方后来全职加入，现在正在做那个产品。你写过的每一份文档、做过的每一个决策，都会变成下一次决策的弹药——时间越长，差距越大。而最能说明问题的，是文章结尾那个跟代码毫无关系的例子：他在球场边帮另一个家长做迪士尼攻略。语音输入需求 → 抓取最新社区讨论 → 生成逐日行程 → 部署成网页 → 自动设好日历提醒，全程在场边完成。这套东西真正的价值不在工具，而在范式：人负责思考和决策，AI 负责调研和执行。当中间的协作足够顺滑，一个人能做的事会远超想象。而这一切的起点，只是一个小习惯：有想法的时候，先写计划。

225

435

59K

raox retweeted

elvis

@omarsar0

20 days ago

// Continual Learning Bench // One of the research areas with lots of investments is continual learning. While there are many efforts, there is very little progress in measuring it. So the big question is, do dedicated memory systems actually make agents learn from experience? Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management. CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances. If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning. Paper: https://t.co/iFd5SZFe3O Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // Continual Learning Bench //

One of the research areas with lots of investments is continual learning.

While there are many efforts, there is very little progress in measuring it.

So the big question is, do dedicated memory systems actually make agents learn from experience?

Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management.

CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances.

If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning.

Paper: https://t.co/iFd5SZFe3O

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

353

289

25K

Who to follow

Andrea Stroppa 🐺 Claudius Nero's Legion 🐺

@andst7

investor in Tesla, xAI, Neuralink, SpaceX and whatever Elon does.

Scott Squires @scottsquires.bsky.social

@scott_squires

Director, VFX Supe-The Mask, DragonHeart, Star Wars: Phantom Menace, etc, VES Fellow, Close Encounters- 1st film. Pol tweets my own.

XR Cambridge

@xr_cambridge

🌎 We are #ExtinctionRebellion in Cambridge, UK. Proud conscientious protectors. 🌍 Rebel against the government on behalf of all life. DMs open

raox retweeted

Vikas gupta

@vicky_grok

19 days ago

🚨 Anthropic just showed a 27-minute workshop on how to actually do prompts for Claude. Taught by the people who built it. Free. No registration. No paywall. I've seen $300 courses that don't cover what they teach in the first 8 minutes. Watch it and bookmark it now.

648

118

346K

raox retweeted

elvis

@omarsar0

19 days ago

This was one of the standout AI papers of the week. (bookmark it) It tackles a question most self-improving AI agents ignore: is the agent actually discovering anything, or just remixing what it already knows? How can you tell whether the agent is doing real discovery or just confident retrieval? The authors give three clean buckets: - Retrieval is looking something up in a notebook you already have. - Search is combining tools you already own in new ways. - Discovery is inventing a new concept that wasn't in your toolkit before. The issue is that most agents stop at the first two. The math behind their definition (category theory plus a left Kan extension, if you care) is basically a bookkeeping trick to ask: could the old version of me have produced this result? If yes, it's not discovery. If no, something genuinely new showed up. They build a Builder/Breaker agent that studies protein mechanics. Over four rounds, the model's fit accuracy actually drops (R² goes from 0.48 to 0.68 to 0.54 to 0.41). At first glance, that looks like a failing agent. It isn't. The agent kept taking on harder proteins and rewriting its theory to cover them. Data grew almost 10x while the model code grew only 1.3x. A smaller theory covering a bigger world is exactly what good science looks like. Why does it matter? If you optimize for accuracy alone, your self-improving agent will just settle into easy benchmarks and stop. This paper offers a cleaner success signal and asks whether the agent is compressing more of the world into less code over time. Paper: https://t.co/Vb4TcCb5YD Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. This was one of the standout AI papers of the week.

(bookmark it)

It tackles a question most self-improving AI agents ignore: is the agent actually discovering anything, or just remixing what it already knows?

How can you tell whether the agent is doing real discovery or just confident retrieval?

The authors give three clean buckets:

- Retrieval is looking something up in a notebook you already have.

- Search is combining tools you already own in new ways.

- Discovery is inventing a new concept that wasn't in your toolkit before.

The issue is that most agents stop at the first two.

The math behind their definition (category theory plus a left Kan extension, if you care) is basically a bookkeeping trick to ask: could the old version of me have produced this result? If yes, it's not discovery. If no, something genuinely new showed up.

They build a Builder/Breaker agent that studies protein mechanics. Over four rounds, the model's fit accuracy actually drops (R² goes from 0.48 to 0.68 to 0.54 to 0.41). At first glance, that looks like a failing agent.

It isn't.

The agent kept taking on harder proteins and rewriting its theory to cover them. Data grew almost 10x while the model code grew only 1.3x. A smaller theory covering a bigger world is exactly what good science looks like.

Why does it matter?

If you optimize for accuracy alone, your self-improving agent will just settle into easy benchmarks and stop. This paper offers a cleaner success signal and asks whether the agent is compressing more of the world into less code over time.

Paper: https://t.co/Vb4TcCb5YD

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

346

406

41K

raox retweeted

Rohan Paul

@rohanpaul_ai

about 1 month ago

New Google paper: A forecast needs context, not just history. Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other. Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves. In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting. That matters because most time series models are fluent in pattern, but mute about cause. A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear. Nexus separates those jobs instead of asking one prompt to do everything. One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors. The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series. The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting. But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move. ---- Paper Link – arxiv. org/abs/2605.14389 Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"

rohanpaul_ai's tweet photo. New Google paper: A forecast needs context, not just history.

Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other.

Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves.

In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting.

That matters because most time series models are fluent in pattern, but mute about cause.

A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear.

Nexus separates those jobs instead of asking one prompt to do everything.

One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors.

The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series.

The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting.

But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move.

----

Paper Link – arxiv. org/abs/2605.14389

Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"

484

393

62K

raox retweeted

Ivanka Trump

@IvankaTrump

about 1 month ago

Most of us spend years trying to change outcomes without examining the internal framework producing them. This article gets to the root by examining and then stripping away the conditioning that keeps you from becoming fully yourself and finding your bliss. Great read @thedankoe !

514

12K

raox retweeted

AI少年

@aehyok

about 2 months ago

使用 Hermes Agent 打造私人专属工作流：自动定时监控 X/Twitter 大佬推文的完整方案。 160个KOL账号 → RSS抓取 → Hermes AI 筛选判断 → 文案生成 → Discord推送其中所使用的几个技术点如下： 1、BestBlogs：一个开源项目提供X平台上 160 个 AI 圈中英文大佬的 OPML 账号列表，作为信息源入口层（但是BestBlogs每天从600+ 订阅源自动抓取文章、播客、视频与推文，所以不仅仅只是针对X平台的） 2、https://t.co/cs59ch8KIB：免费的 Twitter → RSS 转换服务，不能抓取转发和引用 3、TikHub：付费 API 服务（约 ¥0.001/请求），可抓取转发和引用 4、作者开源了x-intel-monitor提供了一个模板工作流，只需要根据自己的需要进行修改即可最后不滑锅也强调了：你需要做的，就是不断的调教 Hermes。采集管道是杠杆，写作标准是方向。方向错了，杠杆越强，伤害越大。

210

352

29K

raox retweeted

Huan

@Huanusa

about 2 months ago

《经济学人》杂志的封面，如果细分 4－5月，是新病毒的重新回到人们视野 10月，战争将卷土重来……？这封面的预言有些吓人

97K

raox retweeted

DAIR.AI

@dair_ai

about 2 months ago

Pay attention to this one if you build multi-agent systems. Coordination is as important as prompts or agent architecture. Multi-agent LLM systems fail in production at rates between 41% and 87%. The majority of those failures are coordination defects, not base-model capability. Most published comparisons of multi-agent architectures can't even tell you whether the gain came from coordination or from one configuration just having larger context windows. This new research argues that coordination should be treated as a configurable architectural layer, separable from agent logic and from information access. Then it backs the position with an information-controlled experiment: same LLM, same tools, same prompt template, same per-call output cap. The only thing that varies is coordination structure. Why it matters: until you control for information access, "multi-agent beats single-agent" doesn't actually mean coordination won. This paper gives you a cleaner methodology for actually testing it, and a vocabulary for reasoning about coordination as architecture instead of plumbing. Paper: https://t.co/8m0P8kCQ2a Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. Pay attention to this one if you build multi-agent systems.

Coordination is as important as prompts or agent architecture.

Multi-agent LLM systems fail in production at rates between 41% and 87%.

The majority of those failures are coordination defects, not base-model capability. Most published comparisons of multi-agent architectures can't even tell you whether the gain came from coordination or from one configuration just having larger context windows.

This new research argues that coordination should be treated as a configurable architectural layer, separable from agent logic and from information access. Then it backs the position with an information-controlled experiment: same LLM, same tools, same prompt template, same per-call output cap. The only thing that varies is coordination structure.

Why it matters:

until you control for information access, "multi-agent beats single-agent" doesn't actually mean coordination won. This paper gives you a cleaner methodology for actually testing it, and a vocabulary for reasoning about coordination as architecture instead of plumbing.

Paper: https://t.co/8m0P8kCQ2a

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

264

349

49K

raox retweeted

雪踏乌云

@Pluvio9yte

about 2 months ago

Karpathy这段话把AI工具使用的核心逻辑说透了：你不应该是那个瓶颈。以前用AI是一问一答的形式，你在loop里面当调度员。现在的正确姿势是——你设计好系统，按一下go，然后该干嘛干嘛去。 1. "Arrange things such that they're completely autonomous" = 把Skills、Rules、Hooks、验证循环全搭好，让Agent能自己跑完整流程 2. "Maximize token throughput and not be in the loop" = 你的杠杆率取决于系统设计得有多好，不取决于你打了多少字prompt 3. "I put in very few tokens, a huge amount of stuff happens" = 一次性投入设计成本，换持续的自动化产出以前的工作模式是：写prompt → 看结果 → 改prompt → 看结果 → 循环。你的产出上限等于你打字的速度。现在应该是：设计好Harness → 配好feedback loop → 点go → 去做别的事。产出上限等于系统的并发能力。不过这一切有一个前提：前提是你的工程基础得够好： 1. 要有好的验证循环——Agent跑完了能自己check对不对 2. 要有好的错误恢复——跑歪了能自己退回来重试 3. 要有好的模块化——每个环节可以独立测试和替换

13K

raox retweeted

DAIR.AI

@dair_ai

2 months ago

Most AI assistants wait for you to ask. But a truly useful agent should notice you need help before you say anything. New research takes a serious shot at building proactive agents that work in real time. The work introduces PASK with three components: IntentFlow for streaming demand detection, a hybrid memory system (workspace, user, global) for long-term context, and a proactive agent framework that forms a closed loop. They also release LatentNeeds-Bench, built from real user-consented data refined through thousands of rounds of human editing. IntentFlow scores 84.2 overall, matching Gemini-3-Flash (80.8) while most other models, including GPT-5-Mini (77.2) and Claude-Haiku-4.5 (66.2), struggle badly at this task. Why does it matter? The hardest part isn't complex reasoning. It's reliably detecting when a user has an unstated need versus when they don't. Most models are either too helpful or too silent, but rarely both calibrated. This is one of the first systems to tackle proactive assistance as a real product problem. Paper: https://t.co/EYIt2pv6fQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. Most AI assistants wait for you to ask.

But a truly useful agent should notice you need help before you say anything.

New research takes a serious shot at building proactive agents that work in real time.

The work introduces PASK with three components: IntentFlow for streaming demand detection, a hybrid memory system (workspace, user, global) for long-term context, and a proactive agent framework that forms a closed loop.

They also release LatentNeeds-Bench, built from real user-consented data refined through thousands of rounds of human editing. IntentFlow scores 84.2 overall, matching Gemini-3-Flash (80.8) while most other models, including GPT-5-Mini (77.2) and Claude-Haiku-4.5 (66.2), struggle badly at this task.

Why does it matter?

The hardest part isn't complex reasoning. It's reliably detecting when a user has an unstated need versus when they don't.

Most models are either too helpful or too silent, but rarely both calibrated. This is one of the first systems to tackle proactive assistance as a real product problem.

Paper: https://t.co/EYIt2pv6fQ

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

149

137

15K

raox retweeted

向阳乔木

@vista8

2 months ago

读完Deepmind创始人哈萨比斯传记，用NotebookLM提炼整理的所有书籍和论文清单：一、重要书籍与文学作品（一）科幻小说（对哈萨比斯等人的启发） 1. 《安德的游戏》（Ender's Game），奥森·斯科特·卡德（Orson Scott Card）著哈萨比斯在读博期间发现这本书，并产生强烈共鸣，他将自己视为安德那样的天才少年，肩负拯救人类的使命。 2. 《银河帝国三部曲》（Foundation series），艾萨克·阿西莫夫（Isaac Asimov）著书中主人公哈里·谢顿（Hari Seldon）预言帝国崩溃并试图挽救，这激发了哈萨比斯通过AI预测并避免灾难的想法。 3. 《文明》系列（Culture series），伊恩·班克斯（Iain Banks）著描述了一个AI系统主导的、物质极大丰富的星际社会，这让哈萨比斯相信AI可以与人类和平共处并丰富人类经验。 4. 《爱丽丝镜中奇遇记》（Through the Looking-Glass），刘易斯·卡罗尔（Lewis Carroll）著哈萨比斯在研究记忆时引用了书中"只能向后看是一种糟糕的记忆"的说法。（二）科学、哲学与技术著作 1. 《哥德尔、艾舍尔、巴赫：集异璧之大成》（Gödel, Escher, Bach），道格拉斯·霍夫施塔特（Douglas R. Hofstadter）著这本书对哈萨比斯、大卫·席尔瓦（David Silver）等许多AI科学家产生了巨大影响，阐述了模式、智能与意识的关系。 2. 《国际象棋计算机手册》（The Chess Computer Handbook），大卫·利维（David Levy）著哈萨比斯12岁时阅读此书，首次接触到克劳德·香农的象棋编程理论。 3. 《精神机器时代》（The Age of Spiritual Machines），雷·库兹韦尔（Ray Kurzweil）著肖恩·莱格（Shane Legg）深受此书影响，接受了摩尔定律将导致"奇点"到来的观点。 4. 《通用人工智能》（Artificial General Intelligence），本·格策尔（Ben Goertzel）著莱格建议格策尔将书名从Real AI改为此名，从而确立了AGI这一术语。 5. 《拥护你自己的阴影》（Owning Your Own Shadow），罗伯特·约翰逊（Robert A. Johnson）著穆斯塔法·苏莱曼（Mustafa Suleyman）曾向哈萨比斯推荐此书，探讨人类自我的阴暗面。 6. 《发明未来》（Inventing the Future），尼克·斯雷尼塞克与亚历克斯·威廉姆斯著这本左翼乌托邦宣言式著作深受苏莱曼认同，强调技术对社会的解放潜力。 7. 《智能时代》（The Coming Wave），苏莱曼与迈克尔·巴斯卡尔著苏莱曼在书中探讨了技术的权力与风险。 8. 《智人：人类简史》作者尤瓦尔·赫拉利等人的作品书中提到赫拉利也签署了呼吁暂停大模型训练的公开信。 9. 《斯宾诺莎传》（Spinoza），罗杰·斯克鲁顿（Roger Scruton）著哈萨比斯向作者推荐此书，并认同斯宾诺莎关于"理解自然是一项精神事业"的观点。 10. 《论智能》（On Intelligence），杰夫·霍金斯著该书曾让杨立昆（Yann LeCun）对当时的AI创业公司持怀疑态度。二、关键学术论文与研究报告（一）早期奠基论文 1. 《为计算机编制象棋程序》（1950）克劳德·香农（Claude Shannon）发表，指出象棋程序是迈向通用计算机的第一步。 2. 1956年达特茅斯会议公告由一批先驱提出，相信智能的每一项特征原则上都能被精确描述并由机器模拟。（二）深度学习与强化学习基础 1. 《深度置信网络的一种快速学习算法》（2006）杰弗里·希尔顿（Geoffrey Hinton）发表，是深度学习领域的里程碑。 2. 《利用图形处理器进行大规模深度无监督学习》（2009）吴恩达（Andrew Ng）等人发表，证明了GPU对AI的重要性。 3. 《深度残差学习在图像识别中的应用》（ResNet，2016）微软团队开发，AlphaZero也采用了这一架构。（三）DeepMind核心成就论文 1. 《利用海马体遗忘症患者无法想象新体验》（2007）哈萨比斯在《美国国家科学院院刊》（PNAS）发表的神经科学研究。 2. 《通过深度强化学习掌握雅达利游戏》（DQN） DeepMind在NIPS会议上展示，随后在《自然》（Nature）封面发表。 3. 《通过深度神经网络和树搜索掌握围棋》（AlphaGo） 2016年发表于《自然》封面。 4. 《在无人类知识的情况下掌握围棋游戏》（AlphaGo Zero） 2017年发表于《自然》。 5. 《通过自对弈掌握国际象棋和将棋》（AlphaZero） 2017年发表。 6. 《利用深度学习潜力改进蛋白质结构预测》（AlphaFold） 2020年发表于《自然》。 7. 《AlphaStar：掌握即时战略游戏星际争霸II》 2019年发表于《自然》。（四）大语言模型与安全论文 1. 《注意力就是你所需要的全部》（Attention Is All You Need，2017）谷歌研究者发表，发明了Transformer架构。 2. 《无监督情感神经元》（2017） OpenAI发表，发现了模型可以自发学会识别情感。 3. 《通过无监督学习提高语言理解能力》（GPT-1，2018） OpenAI亚历克·拉德福德（Alec Radford）发表。 4. 《训练计算优化的大语言模型》（Chinchilla论文，2022）指出数据缩放比模型大小更重要。 5. 《让我们一步步验证》（Let's Verify Step by Step，2023） OpenAI关于思维链（CoT）推理微调的研究。 6. 《DeepSeek-R1: 通过强化学习激励LLM的推理能力》（2025） DeepSeek团队发表，证明了纯RL也能催生复杂推理。 7. 《欢迎来到经验时代》（2025）大卫·席尔瓦与理查德·萨顿共同发表，宣告AI正在从数据驱动转向通过经验（RL）学习的新阶段。

132

207

27K

raox retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

2 months ago

SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences? "SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is ≈20%."

iScienceLuvr's tweet photo. SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?

"SciPredict addresses two critical questions: (a) can
LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is ≈20%."

120

11K

raox retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

2 months ago

code: https://t.co/fSmpshFbsk abs: https://t.co/k167dNDb6S

raox retweeted

Thomas DiFiore

@ThomasADiFiore

2 months ago

@lukOlejnik Here it is formalized in LEAN , https://t.co/W3WVltPU4S

raox retweeted

Lukasz Olejnik

@lukOlejnik

2 months ago

Physicist has written a fascinating big beautiful paper.Let’s not be afraid to call it what it is - groundbreaking. For hundreds of years, mathematics had dozens of “basic” functions: sine, cosine, logarithm, square root, exponential. You know these from school. Everyone does. Now it turns out that all of it is one single operator: E(x, y) = exp(x) - ln(y), and the constant 1. Sin, cos, π - everything follows from this neatly , just nest it properly. Nature hid the simplest possible description of reality. And it was just been found. The whole thing is beautiful and remarkable, here the word “groundbreaking” is not a marketing buzzword. For instance, instead of writing π or 3.14, one can now elegantly write E(E(E(1,E(E(1,E(1,E(E(1,E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(E(E(E(E(1,E(E(1,E(1,E(E(1,E(E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(E(1,E(E(1,E(E(1,E(E(1,1),1)),E(E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(1,1)),1))),1)),1)),1)),1))),1)),E(E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(E(1,E(E(1,E(1,E(E(1,E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(1,1))),1))),1)),1)),1)),1),1),1))),1))),1)),E(E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(E(1,E(E(1,E(1,E(E(1,E(E(1,E(E(1,E(1,E(E(1,1),1))),1)),E(1,1))),1))),1)),1)),1)),1) https://t.co/Pv2UUbTEay