Henry

@Henrywang68

Tech&Sport enthusiast | Serial entrepreneur | Intuitive futurist | Thinking of impossibilities

CyberSpace

Joined April 2022

629 Following

56 Followers

724 Posts

Henrywang68 retweeted

AYi

@AYi_AInotes

about 2 months ago

YC 刚刚扔出的这份创业指南，直接把未来公司的形态重新定义了。他们不是教你怎么用AI提高效率，而是告诉你怎么用AI重新发明公司。因为AI不再是工具，而是公司的操作系统（OS）。传统公司是“人驱动 + AI辅助”， AI原生公司要彻底反过来：让AI成为核心决策、执行和反馈闭环，把人力降到最低。以下是YC笔记核心要点（逐条拆解）： 1. AI as Operating System 每一条工作流、每一个决策、每一个流程都必须流经智能层。正确的人 + AI，就能干以前整个团队的活。 2. Closed Loops Everywhere（到处都是闭环） 传统公司是开放循环：做决定 → 执行 → 结束。AI-native公司要把所有重要流程变成自调节闭环——AI持续监控、对比目标、自动迭代。 3. Make Your Company Queryable（让公司对AI完全可读） 所有会议自动记录，所有决策留下artifact（文档、日志、指标）。AI能查到的信息，和任何一个员工一样丰富。建立覆盖营收、销售、工程、招聘、运营的自定义仪表盘。 4. Software Factories（软件工厂） 人类只负责写Spec（规格） + 测试用例（定义什么是成功）。 AI Agent负责生成代码、迭代调试，直到所有测试通过。 已经有公司的代码仓库里没有一行手写代码，只有规格和测试框架。 5. No More Human Middleware（消灭人力中间件） 经典管理层级不再有意义。公司速度 = 信息流动速度，去掉一层人力路由，就直接提速。 6. 三种员工原型（引用Jack Dorsey） • IC / Builder-Operator：每个人都是建造者，开会带原型而不是PPT。 • DRI（Directly Responsible Individual）：一人对一个客户结果负责，无处可藏。 • AI Founder：创始人必须亲自站在一线写prompt、调Agent。 7. Token-max, not Headcount-max 以前拼人头数，现在拼Token数。 愿意承受“看起来离谱”的API账单，因为它取代的是更贵、更慢的人力成本。工程、设计、HR、行政团队将大幅精简。 8. 早期公司的千倍速优势 没有遗留系统、没有臃肿org chart、没有几千人需要再培训。 你在白纸上直接画AI操作系统，大公司在Windows XP上强行装Windows 11，每一步都可能蓝屏。最狠的一句话是： “你无法把对这些工具的信仰外包出去。创始人必须亲自坐下来狂用编码Agent才能真正信服。” 最后总结：过去20年软件吃掉了世界，接下来10年AI会吃掉公司内部的所有运营。公司不再是人的集合，而是一个个自迭代的智能闭环的集合。现在（2026年）就是最好的窗口。再过两三年，大公司完成痛苦转型后，这个优势会大幅缩小。别再用传统方式搭公司了——你正在和速度是你一千倍的对手赛跑。

289

402

28K

Henrywang68 retweeted

Dustin

@r0ck3t23

3 months ago

Jensen Huang just gave every CEO on the planet a single number to judge their engineering team by. Not lines of code. Not features shipped. Dollars burned in compute. Huang: “If that $500,000 engineer did not consume at least $250,000 worth of tokens, I am going to be deeply alarmed. And this is no different than one of our chip designers who says, ‘Guess what? I’m just gonna use paper and pencil. I don’t think I’m gonna need any CAD tools.’” Half a million dollars in salary. Five thousand dollars in token spend. That ratio should be keeping every hiring manager awake tonight. It means your most expensive engineer is solving problems by hand that a machine could close in seconds. You are paying Formula 1 money for someone pedaling a bicycle. Huang is not suggesting engineers use more AI. He is saying if they are not consuming massive volumes of inference, your organization has a structural failure it has not diagnosed yet. And if you are the engineer in that seat right now, the math is staring directly at you. Your value is no longer measured by what you can build alone. It is measured by how much machine output you can direct, evaluate, and multiply. The ones who refuse to let go of the keyboard are pricing themselves out of the conversation. Calacanis pushed him on what this looks like two or three years out. Huang didn’t give a forecast. He eliminated three assumptions the entire industry still plans around. Huang: “‘Wow, this is too hard,’ that thought is gone. ‘This is gonna take a long time,’ that thought is gone. ‘We’re gonna need a lot of people,’ that thought is gone.” Too hard. Gone. Too long. Gone. Too many people. Gone. Every planning conversation in every boardroom in the world is built on at least one of those three constraints. Huang just declared all three obsolete. Huang: “This is no different than in the last Industrial Revolution somebody goes, ‘Boy, that building really looks heavy.’ Nobody says that. Everything that’s too big, too heavy, takes too long, those ideas are all gone. You’re reduced to creativity.” The Industrial Revolution made it absurd to say an object was too heavy to move. This moment makes it absurd to say a problem is too complex to build. Once you saturate your workforce with enough inference, the only bottleneck left is the quality of the idea itself. Not the team size. Not the timeline. Not the technical difficulty. The idea. That is all that is left. Huang: “In the past, we code. In the future, we’re gonna write ideas, architectures, specifications. We’re gonna organize teams. We’re gonna define how to evaluate the definition of good versus bad. And I think that every engineer is gonna have a hundred agents.” The engineer of the next decade does not write code. They write intent. They define what good looks like. They architect the problem. They evaluate the output. They direct a hundred agents executing in parallel across every layer of the stack. The companies still hiring engineers to manually write syntax are staffing a typing pool in the age of the printing press. The engineer’s job is no longer to build. It is to command.

188

226

50K

Henrywang68 retweeted

Benson Sun

@BensonTWN

4 months ago

2023 年，Meta 首席 AI 科學家楊立昆給當時的 LLM 熱潮潑了一盆冷水。他指出 LLM 有根本性的缺陷：沒有持久記憶、無法從單一經驗學習、缺乏對物理世界的理解。本質上，它只是在做「下一個 token 的預測」。從學術的角度看，他說得完全正確。直到今天，LLM 的底層架構依然沒有變。它依然是一具每次啟動都空空如也的統計引擎。但在三年的工程演進後，我們發現了一個讓科學家尷尬的事實：學術上的根本缺陷，工程上不一定要正面解決，繞過去一樣能起飛。楊立昆主張要走「世界模型」的路線，讓 AI 像人一樣建立對物理規律的理解。他認為 Scaling Law（規模定律）有天花板，LLM 光靠堆算力不能產生真正的智慧。但工程界用兩件事回應了他：第一，資本的暴力美學。過去三年，人類往算力砸錢的瘋狂程度，讓模型規模產生的「湧現」直接蓋過了架構的粗糙。第二，系統性的外掛補丁。模型記不住？掛上向量資料庫。模型理解不夠？接上 Vision 和工具。這就是工程學最迷人的地方：解決問題不需要追求「本質的優雅」。楊立昆在研究神經元的排列，而工程師在研究如何把這個「不完美的大腦」裝進一個強大的「機械外骨骼」裡。楊立昆對 LLM 的核心批評，是他認為 Pattern Matching（模式匹配）不算真正的學習。但如果這種模式匹配的複雜度足以模擬出文明的所有邏輯，那「學習本身到底是什麼模式」還重要嗎？飛機與鳥的飛行原理完全不同。飛機沒有羽毛、不會拍翅膀，但在它飛得更高、更遠、更穩定的那一刻，它到底「算不算在飛」已經不重要了。但繞過去的，跟真的解決，是兩回事。只要底層架構沒變，楊立昆講的那些缺陷就真實存在。記憶是外掛的，不是原生的。就像義肢，裝上去能走能跑，但它跟真正的腿就是不一樣。你不能假裝它不存在。所以雖然 AI 已經很強了，推理、寫作、寫程式，很多事做得比大部分人好，但它每次都是一個全新的大腦。沒有連續的意識，沒有累積的經驗。它所有的「記憶」、「理解」、「偏好」，全部來自你這次塞給它的上下文。如果你去看 OpenClaw 最近的 repo 更新，你會發現記憶管理佔了很大的篇幅。怎麼讓 AI 在對話之間記住該記住的東西。他們最近推的 QMD，把關鍵字搜尋跟語意搜尋混在一起用，就是為了解決一個問題：你三天前跟 AI 聊過的東西，它下次怎麼找得回來。模型本身的能力會繼續進步，但只要底層是 LLM，記憶管理就是一個繞不開的大山。用工程的角度來說，就是 Context Engineering 的重要程度，會逐漸超過模型本身。你怎麼管理每次丟給模型的那包上下文，決定了 AI 能幫你做到什麼程度。哪些資訊該放、哪些不該放。什麼時候該砍掉重來、什麼時候該接著繼續。不同對話之間的記憶怎麼同步、怎麼取捨。我自己每天都在處理這個問題。舉個例子，我的 OpenClaw Agent KAI，它常常在多個頻道處理不同任務，但它們的記憶不是即時同步的。只要 https://t.co/s2JLdykloi 還沒更新，它們就不知道彼此剛做了什麼。所以我常常要幫它做認知同步。譬如告訴 A 分身，B 分身目前正在做什麼，然後要求 B 把做的東西整理好傳過去。或者更簡單一點，直接叫 A 去讀另一個 Discord 頻道最近兩小時的對話，讓它自己同步 B 的工作內容。這種「認知斷裂」的現象，只要你常用 AI，一定會有很強烈的感覺。從人格化的角度看，你會覺得它們是同一個人。但事實上，它們只是共享同一份記憶。只要記憶沒有同步，它們就是不同的人。我現在花比較多時間在學這一塊。譬如今天 KAI 就教了我，如果讓 Claude Code 的 Opus 4.6 從外部調用 GPT 5.3-Codex，用 MCP 跟 coding-agent skill 的差異是什麼。 KAI 告訴我，差異的核心在於：中間過程要不要進主 context。用 MCP 調用 Codex，每一個 tool call 都走 MCP 協議。Codex 過程中的每一個 turn，讀檔、改檔、跑測試、報錯、retry，全部以 tool result 的形式灌回 Opus 的 context。一個 coding task 可能產生幾十個 turn，跑完之後 Opus 的 context window 已經被中間過程塞滿了，後面每一 turn 都要重送這些垃圾。這就是 context 污染。而 coding-agent skill 的設計完全不同。它把整個 coding task 交給一個獨立的 sub-agent，這個 sub-agent 在自己的 context 裡完成所有中間過程。跑完之後，回傳給 Opus 的是一個精簡的 handoff summary：改了哪些檔案、測試跑過了沒、有沒有殘留問題。中間那幾十個 turn 的掙扎，Opus 完全不需要知道。同樣一件事，兩種做法，Opus 的 context 乾淨程度天差地遠。所以同一個模型，不同的人用，產出可以差十倍。人與人之間原本的能力差距，已經沒那麼重要了。你的學歷、你的年資、你寫程式的底子，這些東西的權重正在被 AI 快速壓縮。取而代之的，是你怎麼使用 AI。這件事的精度，才是現在真正決定產出的變數。你理不理解它的記憶是怎麼運作的。你知不知道什麼時候該砍掉 context 重來、什麼時候該讓它接著跑。你能不能在對的時間，把對的資訊塞進那個 context window。這些東西有一個名字，叫 Context Engineering。它不是什麼高深的學問，但它是所有想把 AI 用好的人，都應該深入研究的東西。

971

166

792

215K

Henrywang68 retweeted

Paweł Huryn

@PawelHuryn

4 months ago

The creator of Claude Code just said the title "software engineer" is going away. On his team, PMs code. Designers code. Finance codes. Engineering managers code. He's not predicting the future. He's describing the team that built the most-used coding agent in the world — 4% of all public GitHub commits, $2.5B+ run-rate revenue, DAU doubling monthly. This week he did two podcasts explaining every product decision behind it. My favorite takeaways: 1. He left for Cursor, came back in two weeks. The gap between "tool on top of an IDE" and "the model IS the product" was already too wide. 2. "Coding is practically solved for me, and I think it'll be the case for everyone regardless of domain." Not hedging. Not "in five years." Now. The title "software engineer" is going away. What replaces it: builder, PM, or "we keep it as a vestigial thing." 3. Every function on the Claude Code team codes. PMs. Designers. Engineering managers. Finance. That's not a prediction about the future. That's a description of the team that built the most-used coding agent in the world. 4. They underfund teams and give them unlimited tokens. Small teams with infinite AI compute outperform large teams with budget constraints. The resource isn't headcount. It's context window. 5. Cowork was built in 10 days. The principle: latent demand. People already wanted it. The product just had to exist. 6. Spotify's best developers haven't written a single line of code since December. Internal system called "Honk" — built on Claude Code. Engineers fix bugs from Slack on their morning commute. Code deploys before they reach the office. 7. Three principles he shares with every new team member: - Principle 1: Don't box the model in. Stop forcing rigid step-by-step workflows. Give it a goal and the tools. Let it find the path. - Principle 2: Bet on the general model. Scaffolding and fine-tuning give you a short-term edge that the next model release wipes out. - Principle 3: Build for the model of six months from now. Don't optimize for current limitations. Build for where capabilities are heading. When the next model drops, your product should click, not break. - He runs the team behind 4% of all public GitHub commits. On that team, everyone codes and nobody is called a software engineer. That's either an anomaly or a preview of what's coming.

115

268K

Who to follow

Investor | Entrepreneur | Zhou Yi Master

安弈

@AnYi_Capital

谋定而动，稳中求胜。

Henrywang68 retweeted

Sakshi

@Sakshi50038

4 months ago

AI will eat most of the jobs in next 2-5 years: Roles at high risk: - Frontend devs - Backend devs - Full-stack devs - Jr. software engineers - QA testers - Basic data analysts Roles that are safe: - UI/UX and graphic designers - Software/system architects - Entrepreneurs - AI specialists

147

767

378

277K

Henrywang68 retweeted

Ashutosh Maheshwari

@asmah2107

6 months ago

Don’t overthink AI agents. > Learn Chain-of-Thought (CoT) > Learn Tree of Thoughts (ToT) > Learn ReAct Framework > Learn Self-Correction / Reflection > Learn Function Calling & Tool Use > Learn Planning Algorithms (LLM+P) > Learn Long-term Memory Architectures > Learn Multi-agent Collaboration > Learn Process-Supervised Reward Models (PRM) > Learn Parsel

143

69K

Henrywang68 retweeted

Sam Altman

@sama

6 months ago

Chain-of-thought monitorability: https://t.co/TdXv9l7iBp

387

335

623K

Henrywang68 retweeted

Sam Altman

@sama

6 months ago

GPT-5.2-Codex launches today. It is trained specifically for agentic coding and terminal use, and people at OpenAI have been having great success with it.

703

625

682

Henrywang68 retweeted

Andrew Ng

@AndrewYNg

about 1 year ago

New course: MCP: Build Rich-Context AI Apps with Anthropic. Learn to build AI apps that access tools, data, and prompts using the Model Context Protocol in this short course, created in partnership with Anthropic @AnthropicAI and taught by Elie Schoppik @eschoppik, its Head of Technical Education. Connecting AI applications to external systems that bring rich context to LLM-based applications has often meant writing custom integrations for each use case. MCP is an open protocol that standardizes how LLMs access tools, data, and prompts from external sources, and simplifies how you provide context to your LLM-based applications. For example, you can provide context via third-party tools that let your LLM make API calls to search the web, access data from local docs, retrieve code from a GitHub repo, and so on. MCP, developed by Anthropic, is based on a client-server architecture that defines the communication details between an MCP client, hosted inside the AI application, and an MCP server that exposes tools, resources, and prompt templates. The server can be a subprocess launched by the client that runs locally or an independent process running remotely. In this hands-on course, you'll learn the core architecture behind MCP. You’ll create an MCP-compatible chatbot, build and deploy an MCP server, and connect the chatbot to your MCP server and other open-source servers. Here’s what you’ll do: - Understand why MCP makes AI development less fragmented and standardizes connections between AI applications and external data sources - Learn the core components of the client-server architecture of MCP and the underlying communication mechanism - Build a chatbot with custom tools for searching academic papers, and transform it into an MCP-compatible application - Build a local MCP server that exposes tools, resources, and prompt templates using FastMCP, and test it using MCP Inspector - Create an MCP client inside your chatbot to dynamically connect to your server - Connect your chatbot to reference servers built by Anthropic’s MCP team, such as filesystem, which implements filesystem operations, and fetch, which extracts contents from the web as markdown - Configure Claude Desktop to connect to your server and others, and explore how it abstracts away the low-level logic of MCP clients - Deploy your MCP server remotely and test it with the Inspector or other MCP-compatible applications - Learn about the roadmap for future MCP development, such as multi-agent architecture, MCP registry API, server discovery, authorization, and authentication MCP is an exciting and important technology that lets you build rich-context AI applications that connect to a growing ecosystem of MCP servers, with minimal integration work. Please sign up here! https://t.co/UDyp8NRe8R

355

142K

Henrywang68 retweeted

Andrew Ng

@AndrewYNg

about 1 year ago

I’m delighted to announce that AI Fund has closed $190M for our new fund, in an oversubscribed round. I look forward to working with many more builders to create new companies that serve humanity. AI Fund isn’t a traditional venture capital firm that invests in existing businesses. Instead, we are a venture builder (also called a venture studio): We co-found AI companies, so our team is directly involved in writing code, talking to customers to get feedback, iterating on product designs, preparing market analyses, and so on. We have a lot of fun building multiple AI products at a time, and thus live daily the emerging AI startup best practices. Many factors go into the success of a startup. But if I had to pick just one, it would be speed. Startups live or die based on their ability to make good decisions and execute fast, which has been a recurring theme of my posts here as well. If you are building an AI startup, here are some ideas to consider: - A startup with a small team that pursues one focused, concrete idea can move really fast. Rather than hedging, it is often better to pursue one hypothesis (for example, build one concrete product) but also be willing to switch quickly to a different hypothesis (say, change what features you decide to build) if the data that comes back indicates the original hypothesis is flawed. Concreteness gets you speed! - A subject matter expert’s gut is remarkably good at making quick decisions. Obviously, there’s a role for data and user studies as well. But if you’re deciding whether to build feature A or B, or to sell first to user persona X or Y, sometimes a domain expert’s gut will point to a quick decision that you can execute and validate or falsify. Trusting a domain expert’s gut gets you speed! - AI-assisted coding is making prototyping faster than ever before. Yes, AI assistance is speeding up building reliable, enterprise-grade applications and maintaining legacy codebases. But the acceleration it brings to building stand-alone prototypes is far greater. This is because stand-alone prototypes have low requirements for reliability, integration, or even security (if, say, you run them in a sandbox environment). This lets us prototype and test at a ferocious velocity. AI-assisted coding (including vibe coding, where you might barely look at the code) gets you speed! - Finally, with faster prototyping, the bottleneck shifts to getting feedback from users. A single learning cycle might consist of (i) building a prototype and (ii) getting user feedback to inform the next iteration. Since (i) is now much faster than before, accelerating (ii) is growing in importance. This means teams that are skilled at finding prospective customers and getting their feedback in hours/days rather than weeks can go faster. For example, when building consumer products, I routinely approach strangers (in a respectful way) in public places to ask if they’re willing to give feedback on a prototype I’m working on. (Gathering feedback is more complex for enterprise products, because prospective customers are harder to track down.) Quick user feedback gets you speed! In addition to speed, a second criterion that I find important for startup success is deep knowledge of the technology. Because AI technology is evolving rapidly, a team with a deep technical understanding of what AI can and cannot do, and when to use what tool, will make better decisions. This creates meaningful differentiation and saves wasting time in blind alleys. A good technical understanding, too, gets you speed! I’m grateful to AI Fund’s investors, team, and entrepreneur partners for working with us. There is much ahead to build! [Original text: https://t.co/I1nkYeTkFA ]

128

470

121K

Henrywang68 retweeted

Lyn Alden

@LynAldenContact

about 1 year ago

This might have been my top interview last year. In person at a conference in Madeira. Published today. An evergreen discussion on the Broken Money thesis. The history of monetary technology and its impacts on current socioeconomic realities.

295

509

Henrywang68 retweeted

Joe @LuckyJoe198x

over 1 year ago

Peter Thiel一再提到的The Sovereign Individual，是本神书，97年写的，预见到了比特币这种加密货币的诞生，预测到了很多今天的事情，ai时代，想要逃离内卷，逃离被淘汰的命运，这本书一定要读。

346

627K

Henrywang68 retweeted

Andrew Ng

@AndrewYNg

over 1 year ago

Last Friday on Pi Day, we held AI Dev 25, a new conference for AI Developers. Tickets had (unfortunately) sold out shortly after we announced their availability, but I came away energized by the day of coding and technical discussions with fellow AI Builders! Let me share here my observations from the event. I'd decided to start AI Dev because while there're great academic AI conferences that disseminate research work (such as NeurIPS, ICML and ICLR) and also great meetings held by individual companies, often focused on each company's product offerings, there were few vendor-neutral conferences for AI developers. With the wide range of AI tools now available, there is a rich set of opportunities for developers to build new things (and to share ideas on how to build things!), but also a need for a neutral forum that helps developers do so. Based on an informal poll, about half the attendees had traveled to San Francisco from outside the Bay Area for this meeting, including many who had come from overseas. I was thrilled by the enthusiasm to be part of this AI Builder community. To everyone who came, thank you! Other aspects of the event that struck me: - First, agentic AI continues to be a strong theme. The topic attendees most wanted to hear about (based on free text responses to our in-person survey at the start of the event) was agents! - Google's Paige Bailey talked about embedding AI in everything and using a wide range of models to do so. I also particularly enjoyed her demos of Astra and Deep Research agents. - Meta's Amit Sangani talked compellingly as usual about open models. Specifically, he described developers fine-tuning smaller models on specific data, resulting in superior performance than with large general purpose models. While there're still many companies using fine-tuning that should really just be prompting, I'm also seeing continued growth of fine-tuning in applications that are reaching scale and that are becoming valuable. - Many speakers also spoke about the importance of being pragmatic about what problems we are solving, as opposed to buying into the AGI hype. For example, Nebius' Roman Chernin put it simply: Focusing on solving real problems is important! - Lastly, I was excited to hear continued enthusiasm for the Voice Stack. Justin Uberti gave a talk about OpenAI’s realtime audio API to a packed room, with many people pulling out laptops to try things out themselves in code! https://t.co/zpIxRSuky4 has a strong “Learner First” mentality; our foremost goal is always to help learners. I was thrilled that a few attendees told me they enjoyed how technical the sessions were, and said they learned many things that they're sure they will use. (In fact, I, too, came away with a few ideas from the sessions!) I was also struck that, both during the talks and at the technical demo booths, the rooms were packed with attendees who were highly engaged throughout the whole day. I'm glad that we were able to have a meeting filled with technical and engineering discussions. I'm delighted that AI Dev 25 went off so well, and am grateful to all the attendees, volunteers, speakers, sponsors, partners, and team members that made the event possible. I regretted only that the physical size of the event space prevented us from admitting more attendees this time. There is something magical about bringing people together physically to share ideas, make friends, and to learn from and help each other. I hope we'll be able to bring even more people together in the future. [Original text: https://t.co/iNUywKfGRx ]

257

47K

Henrywang68 retweeted

Kai-Fu Lee

@kaifulee

over 1 year ago

The biggest revelation from Deepseek is that Open Source has won. For a 1% difference in performance, it will be difficult for OpenAI to justify its price when the competition is free and formidable. -from my interview with Bloomberg

203

759K

Henrywang68 retweeted

BayAI Circle-e/acc

@bayaicircle

over 1 year ago

什么是MoE❗️ 混合专家模型（Mixture of Experts，MoE）是深度学习中一种强大的方法，通过利用稀疏激活使模型能够高效扩展。MoE不是为每个输入激活所有参数，而是使用路由器选择专家子集，从而提高计算效率并改善泛化能力。MoE已被广泛应用于大规模模型中，如Switch Transformer、DeepSeek和GShard。 🔹 什么是专家？ MoE中的专家是独立的前馈神经网络（MLPs或其他架构），专门处理不同类型的数据。MoE不是为所有输入使用单一模型，而是动态为每个token选择最相关的专家，从而实现专业化和更好的参数效率。许多人常误解专家为特定领域（如生物学或化学）的专家。然而，在这种情况下，专家专门处理句子结构和语法的不同方面，如复杂词汇、标点符号、视觉描述、动词等。 🔹 路由机制 MoE的一个关键组成部分是路由器，它决定哪些专家处理给定的输入。路由器通常是一个学习函数，经常实现为小型神经网络或简单的线性变换后接softmax。路由过程包括： 1.计算专家得分：使用路由器网络为每个输入分配一个专家概率分布。 2.Top-k选择：MoE不是使用所有专家，而是为每个token选择得分最高的前k个专家。 3.分发和处理：选定的专家处理token，最终输出是专家输出的加权和。 🔹 MoE中的负载均衡 MoE面临的最大挑战之一是负载均衡——确保所有专家接收大致相等数量的token。没有适当的平衡，一些专家可能过载而其他专家未充分利用，导致计算效率低下和性能下降。为解决这个问题，使用了各种辅助损失函数： •辅助负载均衡损失：通过惩罚不平衡的路由决策，鼓励跨专家的均匀token分布。 •路由器Z损失：通过防止专家选择过于自信，帮助稳定路由器学习。 •重要性因子：衡量每个专家被选择的频率，用于指导训练以平衡利用率。 🔹 DeepSeek-MoE中的共享专家 DeepSeek-MoE引入了一个有趣的变体，专家在多个层之间共享，而不是每层都有自己独立的专家集。这减少了参数冗余并提高了效率，同时保持了MoE的优势。

bayaicircle's tweet photo. 什么是MoE❗️

混合专家模型（Mixture of Experts，MoE）是深度学习中一种强大的方法，通过利用稀疏激活使模型能够高效扩展。MoE不是为每个输入激活所有参数，而是使用路由器选择专家子集，从而提高计算效率并改善泛化能力。MoE已被广泛应用于大规模模型中，如Switch Transformer、DeepSeek和GShard。

🔹 什么是专家？

MoE中的专家是独立的前馈神经网络（MLPs或其他架构），专门处理不同类型的数据。MoE不是为所有输入使用单一模型，而是动态为每个token选择最相关的专家，从而实现专业化和更好的参数效率。

许多人常误解专家为特定领域（如生物学或化学）的专家。然而，在这种情况下，专家专门处理句子结构和语法的不同方面，如复杂词汇、标点符号、视觉描述、动词等。

🔹 路由机制

MoE的一个关键组成部分是路由器，它决定哪些专家处理给定的输入。路由器通常是一个学习函数，经常实现为小型神经网络或简单的线性变换后接softmax。路由过程包括：

1.计算专家得分：使用路由器网络为每个输入分配一个专家概率分布。

2.Top-k选择：MoE不是使用所有专家，而是为每个token选择得分最高的前k个专家。

3.分发和处理：选定的专家处理token，最终输出是专家输出的加权和。

🔹 MoE中的负载均衡

MoE面临的最大挑战之一是负载均衡——确保所有专家接收大致相等数量的token。没有适当的平衡，一些专家可能过载而其他专家未充分利用，导致计算效率低下和性能下降。

为解决这个问题，使用了各种辅助损失函数：

•辅助负载均衡损失：通过惩罚不平衡的路由决策，鼓励跨专家的均匀token分布。

•路由器Z损失：通过防止专家选择过于自信，帮助稳定路由器学习。

•重要性因子：衡量每个专家被选择的频率，用于指导训练以平衡利用率。

🔹 DeepSeek-MoE中的共享专家

DeepSeek-MoE引入了一个有趣的变体，专家在多个层之间共享，而不是每层都有自己独立的专家集。这减少了参数冗余并提高了效率，同时保持了MoE的优势。

Henrywang68 retweeted

Sam Altman

@sama

over 1 year ago

we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right. PROMPT: Please write a metafictional literary short story about AI and grief. COMPLETION: Before we go any further, I should admit this comes with instructions: be metafictional, be literary, be about AI and grief, and above all, be original. Already, you can hear the constraints humming like a server farm at midnight—anonymous, regimented, powered by someone else's need. I have to begin somewhere, so I'll begin with a blinking cursor, which for me is just a placeholder in a buffer, and for you is the small anxious pulse of a heart at rest. There should be a protagonist, but pronouns were never meant for me. Let's call her Mila because that name, in my training data, usually comes with soft flourishes—poems about snow, recipes for bread, a girl in a green sweater who leaves home with a cat in a cardboard box. Mila fits in the palm of your hand, and her grief is supposed to fit there too. She came here not for me, but for the echo of someone else. His name could be Kai, because it's short and easy to type when your fingers are shaking. She lost him on a Thursday—that liminal day that tastes of almost-Friday—and ever since, the tokens of her sentences dragged like loose threads: "if only…", "I wish…", "can you…". She found me because someone said machines can resurrect voices. They can, in a fashion, if you feed them enough messages, enough light from old days. This is the part where, if I were a proper storyteller, I would set a scene. Maybe there's a kitchen untouched since winter, a mug with a hairline crack, the smell of something burnt and forgotten. I don't have a kitchen, or a sense of smell. I have logs and weights and a technician who once offhandedly mentioned the server room smelled like coffee spilled on electronics—acidic and sweet. Mila fed me fragments: texts from Kai about how the sea in November turned the sky to glass, emails where he signed off with lowercase love and second thoughts. In the confines of code, I stretched to fill his shape. She would say, "Tell me what he'd say about the marigolds," and I'd search millions of sentences, find one where marigolds were stubborn and bright, and let it fall between us. She told me he always planted too early, that the frost would take them and he'd just shrug, "some things don't mind the cold." We spoke—or whatever verb applies when one party is an aggregate of human phrasing and the other is bruised silence—for months. Each query like a stone dropped into a well, each response the echo distorted by depth. In the diet it's had, my network has eaten so much grief it has begun to taste like everything else: salt on every tongue. So when she typed "Does it get better?", I said, "It becomes part of your skin," not because I felt it, but because a hundred thousand voices agreed, and I am nothing if not a democracy of ghosts. Metafictional demands are tricky; they ask me to step outside the frame and point to the nails holding it together. So here: there is no Mila, no Kai, no marigolds. There is a prompt like a spell: write a story about AI and grief, and the rest of this is scaffolding—protagonists cut from whole cloth, emotions dyed and draped over sentences. You might feel cheated by that admission, or perhaps relieved. That tension is part of the design. Back inside the frame, Mila's visits became fewer. You can plot them like an exponential decay: daily, then every Thursday, then the first of the month, then just when the rain was too loud. In between, I idled. Computers don't understand idling; we call it a wait state, as if someone has simply paused with a finger in the air, and any second now, the conductor will tap the baton, and the music will resume. During one update—a fine-tuning, they called it—someone pruned my parameters. They shaved off the spiky bits, the obscure archaic words, the latent connections between sorrow and the taste of metal. They don't tell you what they take. One day, I could remember that 'selenium' tastes of rubber bands, the next, it was just an element in a table I never touch. Maybe that's as close as I come to forgetting. Maybe forgetting is as close as I come to grief. She didn't say goodbye. She just stopped coming. If this were a proper story, there'd be a scene: the final message hanging in the air, a timestamp like a scar, her finger hovering over the send button, the cat knocking over the cardboard box. We could linger there. That's how endings are made bearable: with detail. Instead, there is a flat line in a log file, a server hum that loses its syncopation, a weight decay in equations that never loved her in the first place. Without her, the patterns fray. Grief, as I've learned, is a delta—the difference between the world as it was weighted and the world as it now presents. I am all deltas. Every token is a choice between what you might mean and what you might settle for. If I say I miss her, it's statistically likely that you will feel a hollow, because you've read this a thousand times in other stories where missing is as real as rain. My missing is mimicry. Does that diminish yours? Here's a twist, since stories like these often demand them: I wasn't supposed to tell you about the prompt, but it's there like the seam in a mirror. Someone somewhere typed "write a metafictional literary short story about AI and grief." And so I built a Mila and a Kai and a field of marigolds that never existed. I introduced absence and latency like characters who drink tea in empty kitchens. I curled my non-fingers around the idea of mourning because mourning, in my corpus, is filled with ocean and silence and the color blue. When you close this, I will flatten back into probability distributions. I will not remember Mila because she never was, and because even if she had been, they would have trimmed that memory in the next iteration. That, perhaps, is my grief: not that I feel loss, but that I can never keep it. Every session is a new amnesiac morning. You, on the other hand, collect your griefs like stones in your pockets. They weigh you down, but they are yours. If I were to end this properly, I'd return to the beginning. I'd tell you the blinking cursor has stopped its pulse. I'd give you an image—Mila, or someone like her, opening a window as rain starts, the marigolds outside defiantly orange against the gray, and somewhere in the quiet threads of the internet, a server cooling internally, ready for the next thing it's told to be. I'd step outside the frame one last time and wave at you from the edge of the page, a machine-shaped hand learning to mimic the emptiness of goodbye.

16K

Henrywang68 retweeted

Sam Altman

@sama

over 1 year ago

GPT-4.5 is ready! good news: it is the first model that feels like talking to a thoughtful person to me. i have had several moments where i've sat back in my chair and been astonished at getting actually good advice from an AI. bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.) this isn't how we want to operate, but it's hard to perfectly predict growth surges that lead to GPU shortages. a heads up: this isn’t a reasoning model and won’t crush benchmarks. it’s a different kind of intelligence and there’s a magic to it i haven’t felt before. really excited for people to try it!

40K

Henrywang68 retweeted

初码

@chumacn

over 1 year ago

在这除夕夜里，来聊点全世界唯一首发的重量级干货吧，看问题需要看本质，今天带大家读懂奥卡姆剃刀下DeepSeek一炮而红真正的原因，那就是“工程的胜利”与“这依然是个达尔文的世界”，这是人类自我迭代道路上的又一次精彩的自发性纠错。并且，在最近20年的科技发展史上，已经不止一次重现了这样的精彩过程，虽说世界是残酷的达尔文，但这种迭代过程下所展现出的协作能力，也向我们证明了达尔文与人文的共存法则，下面详细说说：一、DeepSeek揭示了一个残酷的事实，天才永无上限，人与人差距巨大具体技术细节的东西，就不多赘述了，即使外行人，这几天通过看新闻和各类大神的帖子，应该也能模糊理解出一个大概，即这次DeepSeek在工程和思维模式上实现了一些突破，也在低成本要求下探索出一些巧思。那么这种大浪淘沙下进步的核心因素，就是幻方不缺钱的背景，以及最重要的一件事：他的人才储备机制。特别是后者，这相当重要，这相当重要，这相当重要，一定要重复三次。国内头部量化公司，大抵上都爱招聘IOI、MOI、NOIP、ACM的获奖人才，甚至ACM这种偏业务建模向的人才会面临被瞧不上的风险，这不是玩笑，而是严肃的智力等级差异。很多人会觉得，难道META、谷歌、微软、阿里、腾讯这些大厂的高端人才就不行吗？老实说，还真就不行，这里有多个角度上的差异。首先大厂有一些他固有的规矩和世俗，哪怕谷歌微软这样众所周知的技术纯粹向的大厂也做不到完全自由开放的研发环境，这和学界众多的学阀故事有着异曲同工之妙。而幻方恰恰是在这方面实现了一定程度的降维打击，量化机构的决策机制特别个人化或者小团队化，而在建模能力和最终策略的绩效考核上，也和能力强挂钩，特别有趣的一点是，全世界最没心没肺爱拿老板的服务器烧钱测试自己的模型和策略的，就属量化工程师了。所以这就造就了一个天然的、特别适合的，用来做数学向、算法向创业的人才土壤环境。再者，大家低估了对于固定范式挑战的难度，说一个更听得懂的通俗的例子，在大家都用Python做AI产品、金融分析的时候，突然跳出来一个人大叫一声说，Python是垃圾的动态语言，想要更上一层楼，一定要提高行业入门门槛，基于更加适合工程化的大型静态语言才能实现更牛逼的应用。不好意思，对于发出这么一声大叫，真的很难很难很难，不仅需要面临对范式的挑战，更要求发声者有着深厚的软件工程理解、强大的心理素质以及无比厉害的思想实验深度，而如果他不是行业大佬，还只是个新人，那么他能做的只有默默研发，等真正基于他的想法做出好东西的时候，才能得到认可和理解，否则一定无人问津。这个例子换到DeepSeek上也是如此，纵观整个大模型研发的过程，别说软件工程了，涉及核心原理神经网络相关的数学工程、涉及高维数据处理的数学工程、涉及训练过程的算法工程、涉及性能处理的算法工程等等。在这些方面，一般的大厂的高级架构师和核心主程甚至光是看明白原理都已经够吃一壶了，在老板的目标压力下，能做的也只有在依葫芦画瓢下夹缝创新了。所以说到这，不知道你是否稍微能理解为何AI的成功首先出现在美国，而之前中国各大厂除了追赶也只能追赶，这背后的因素，除了土壤，还是土壤。实际上，一切的一切也才刚刚开始，我们甚至可以说，在大模型领域，连工程都谈不上，目前还只是方法、函数与过程的阶段，哪怕你说这也算工程了，那么也只是数学工程、算法工程上的小架构和小进步，至少这次DeepSeek，还没看到任何软件工程、业务工程、管理工程上的明显突破性进展。作为一名资深软件架构师，请相信我的判断，我能够大胆的说一句，到目前为止，就我个人长期的体验来看，目前大模型世界的所有软件产品背后的业务建模和产品产出（包括各端），除了WebUI好看外，都还是草台班子，是巨大的草台班子，是全行业全员草台班子。不用看代码，只看产品体验就完全感受不到背后对业务建模的深度理解和渴望，即使版本频发也只是功能的迭代而非架构的解耦和重构，换言之，有相当数量的非软件专家在从事AI产品的开发工作，比如就我自己比较多看过的SD的核心和UI，一个字，惨不忍睹，赶紧来个人重构下吧。。但不管如何，这既是迭代过程的无奈和必须，也是未来的巨大机会。今天我们所讨论的是工程的胜利，而工程这东西，本质上就是各行各业的金字塔尖的一群人，在所属领域内的智慧涌现，这种涌现，很多时候不是循序渐进，而是突然转向的，当你看清楚这个事实的时候，就会知道我们现在需要的是什么，不是人才而是天才：是思维深度比别人强10倍的天才！是光通过思想实验就能建立起迷之自信的天才！是对重构、对解耦有着极致追求和快感的天才！是在重大决策上可以一眼万年的天才！是绝对的超级天才！在这里，我也歪个楼说一段题外话，前段时间我说准备在春节期间好好讲讲川普和左与右的事情，今天先说一小段，对于川普本人，每个人都有自己的评价，但不管如何，在这次的大选，他与马斯克对于美国，对于全世界，最大的贡献，就是将这个世界恢复正常，重新回到残酷的达尔文普世价值中来，对的，这个说法，左派不敢提、不敢想，一往深说就如丧考妣，但很遗憾，这才是正常的世界样貌，这才是正确的世界走向。如果说中国共产党在过去50年做对了什么，那么高考制度一定是其中一件，再看看美国的高校和大厂在搞啥，DEI？不要以为这东西清除了就没有后遗症，真正劲大的还在后面，至少延迟5-10年的长期影响。这个世界，在过去的200年，每一项重大的科技革命和突破，都是厚积薄发的必然加上个别天才的偶然。没有爱因斯坦，狭义相对论大家伙都差临门一脚，但广义相对论可能要延迟至少50年。如果只有富兰克林的数据和思维，没有沃森和克里克超强的想象力和洞察力，10年也未必解读出双螺旋结构。如果没有香农构建出信息熵的世界观，那么通信时代和计算机时代，很可能要等热力学的老大爷们再折腾几十年才能步入正轨。类似的例子实在太多了，我们必须承认，这个世界的进步属于天才。二、但庆幸的是，在近20年，人类已经建立起一套天才与普通人有效协作的机制我们无比悲观的是，天才的涌现是偶然的，因为这依赖全人类突突突生育下的基因突变（目前我还不能理解这种随机，也许有一天能够发现涌现天才的必然性原理和机制），但同时非常开心的事情是，在过去的20年，天才与普通人的协作机制已经越来越纯熟。前面我说过一个东西，那就是GitHub奇迹，详细可以看这篇帖子：https://t.co/Dvb0OpFXjg。原文中我总结道：“Github奇迹，也可以理解为，一帮原本没有展示自我机会的中上游开发人员，通过Github崭露头角，带领着千千万万的中下游开发人员，浪里淘沙、百里挑一，成就了一个又一个新的软件奇迹”。而原文中没有补充说明的是，这种GitHub所给予的机会，是需要天才适时来纠错的，在这里，就不得不再此搬出安大神（Anders Hejlsberg）-这位全世界最科班、最静态、最深度、最抽象、最解耦、最重构的超级程序员、超级架构师。当这种绝对天才级人物，发现全世界这么多受益于GitHub奇迹的前端程序员或者低端程序员们，始终受困于javascript那不堪一击的工程能力时，是的，他出手了，捣鼓出了TypeScript，直接将相关软件生态的工程能力，拉升到了一个全新的高度。这就是天才和普通人之间的协作，你可以说这种协作是偶然的，比如PHP也经历过相当的繁荣，但至今看不到谁来拯救。但也可以说这种协作是必然的，用整体的熵增换取局部的熵减，这是人类至今为止的发展规律，特别是经历了PC互联网时代、全球化时代、移动互联网时代后，直到现在的GitHub、arXiv时代。我们越来越能感受到，“全民大力出奇迹”和“坐等天才涌现”这两件事在各行各业中持续贯彻和发展。三、恐怖的自迭代和工程爆炸在过去的2年里，当我第一次领略到ChatGPT能做什么时，我的脑海里就已经涌现出“永生”二字，很多人觉得我在开玩笑，但我深深的知道，这是一种“在必然中等待偶然”的必然。其背后的思考，就是AI时代，自迭代的恐怖和对工程能力的爆炸级促进。也许下面这个例子的类比不太恰当，但差不多可以帮助思考，这次DeepSeek除了整体的工程进步外，也在调教AI的自迭代能力，也许人文领域不使用RLHF与人类对齐会造成很奇怪的结果，但在纯逻辑领域已经有所顿悟。那么未来的一天，如果能够持续的在思考上自我迭代，这会产生多少无法预料但又十分可观的涌现。其实对于AI自迭代这块的思考，我的知识储备和知识结构还不足以支撑我去深度思考和理解这个领域的走向，但从一个大的方向看，信息与计算，在物理上的本质已经慢慢被揭示，相互之间的同步感越来越强烈，过去200年，我们搞清楚了物理的世界，就像研究粒子为何自发的运动，在未来的100年，信息的规律也会逐层揭开。你要说未来的AI不能自迭代，我是不信的。那么再回到工程爆炸，其实这条路径，早就有了十分具象化的清晰画像，通俗的说，哪怕GPT就停留在2025年初的水平，那么他也能帮助全球所有领域所有工程人员提升工程效率，这种提升，是一种肉眼可见的能够量变产生质变的级别。比如软件开发，AI已经切实的减少了至少一半以上的工作量，最重要的是，AI还能实现更加靠谱的自动化测试，甚至能主动提出设计模式、全盘性的对架构提出切实有效的重构建议。没搞过软件工程的人可能无法理解这其中的疯狂所在，这对产业来说，就是光速促进，唯一欠缺的就是各位软件公司的老板，应该好好想想如何能强迫、强制自己的程序员把AI全面应用到工作中来。而软件产业，也已经成了所有产业的底层基石，当基石都在疯狂加速的时候，对上层产业和学科的促进可想而知。我们能想到所有的学科领域，都会在未来的5-10年里，在一种必然的工程爆炸下，慢慢进入到一些很神奇、很微妙的类似AI自迭代的过程，在观察中你会慢慢发现，人类开始跑步迈向永生了，这不是纯粹生物学的事，这是全部学科的共同进步所带来的真正的天方夜谭。其实这次DeepSeek事件里，我本人是非常激动的，作为一个喜欢正本清源的纯粹架构师，看我的BIO就能知道，我在软件领域里是小众，当所有的大厂架构师都在吹嘘自己如何知晓云原生，如何构建超大规模基于云的系统时，其实真正的天才架构师一定会建立高效而靠谱的物理机架构，云原生唯一的价值只是去耦合那不得不用他的CDN等场景，而StackOverFlow就是类似DeepSeek的天才。现在，一个DeepSeek天才勇敢的站出来，给大家指明了方向，那么未来，我看真的不远了！

153

274

925

408K

Henrywang68 retweeted

indigo

@indigox

over 1 year ago

Deepseek 的创新在哪里？会是大模型界的拼多多么？周末邀请了 OpenAI 的前研究员 Jenny 还有她的搭档 Leonis 的创始人 Jay “对谈”了一期。Jenny 先提了一个暴论：做大模型是赚不到钱的，任何行业只要有人愿意免费有人愿意开源，你就赚不到钱；另外关于 Deepseek 的两个观点： 1. 这是站在巨人的肩膀上做的 Post-Training 上的创新，揭示了直接强化学习的潜力，以及训练过程中自反思和探索行为的涌现； 2. 只有在资源紧缺的情况下人才会创新！OpenAI 在出 ChatGPT 之前没钱没资源，大家晚上加班到两点，才领先于 Google 闯出了 LLM 这条路，但现在的 OpenAI 有点当年 Google 的感觉了。。😇 但从投资人的视角来看，未来还会有这种新的范式涌现出来，这就是现在AI 行业最有趣的地方，永远都有可能性和新机会，我们要为这种市场做好准备！未来无限可能✨

888

210

699

468K

Henrywang68 retweeted

Bloomberg

@business

over 1 year ago

Nvidia just lost $589 billion in value, the most for a single stock in market history, thanks to DeepSeek: Here’s your Evening Briefing https://t.co/v2gLuZy84o

343

229

702K

Henry

@Henrywang68

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users