ShawnRoom @haoxiaoru - Twitter Profile

about 1 month ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

510

6K

740

3K

2M

haoxiaoru retweeted

Arena.ai

@arena

2 months ago

Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5: - #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking) - #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark and Gemini 3.1 Pro. Huge congrats again to the @Kimi_Moonshot team on the open source progress!

arena's tweet photo. Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5:
- #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking)
- #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark and Gemini 3.1 Pro.

Huge congrats again to the @Kimi_Moonshot team on the open source progress!

12

257

17

35

96K

ShawnRoom @haoxiaoru

2 months ago

cooooool

Kimi.ai @Kimi_Moonshot

2 months ago

Meet Kimi K2.6 Agent Swarm 👋 Highlights： 🔹 Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from 100 / 1,500 in K2.5). 🔹 Outputs are real files, not chat - one run delivers 100+ files, 100,000-word literature reviews, or 20,000-row datasets. 🔹Heterogeneous skills - search, analysis, coding, long-form writing, and visual generation all running in parallel 🔗Try it at: https://t.co/2Tu8McUaUa

104

4K

330

2K

612K

0

3

haoxiaoru retweeted

Nous Research

@NousResearch

2 months ago

The Hermes Agent Creative Hackathon starts now 16 Days, $25k in Prizes Presented by @Kimi_Moonshot & @NousResearch For the tinkerers pushing Hermes Agent into creative domains: video, image, audio, 3D, long-form writing, creative software, interactive media and more. Show us what your Hermes Agent can do. Details Below ↓

131

2K

237

1K

1M

Who to follow

DuSTiSe

@dustise

没用的的架构师，营寄生生活，发推逗自己玩。

Li Sun

@forinec

以前从不在乎不远的明天，现在害怕每个清晨的到来。

haoxiaoru retweeted

3 months ago

We evaluated Composer 2 in our React Native evals, and I'll say this: the @cursor_ai team is cooking 🧑‍🍳

44

1K

60

189

109K

ShawnRoom @haoxiaoru

3 months ago

😄

Cloudflare @Cloudflare

3 months ago

Kimi K2.5 is now available on #WorkersAI. You can now build and run agents end-to-end on the Cloudflare Developer Platform. Read about how we tuned our inference stack to drive down costs for internal agent workflows. https://t.co/kEQ6HHpoJS

6

118

15

32

11K

0

19

ShawnRoom @haoxiaoru

3 months ago

cool

Maarten Van Segbroeck

@mvansegb

3 months ago

Per Moonshot AI CEO Zhilin Yang at GTC today, this is the "most beautiful image" he's ever seen. 📉 Made possible by NVIDIA H800 clusters. ✨ https://t.co/NAXkGGfP8Z

mvansegb's tweet photo. Per Moonshot AI CEO Zhilin Yang at GTC today, this is the "most beautiful image" he's ever seen. 📉

Made possible by NVIDIA H800 clusters. ✨

https://t.co/NAXkGGfP8Z https://t.co/n0TdBF2QQZ

10

526

38

152

89K

0

1

0

23

ShawnRoom @haoxiaoru

3 months ago

@karminski3 第一句少写了 50% 😂

0

83

ShawnRoom @haoxiaoru

4 months ago

wow

0

13

ShawnRoom @haoxiaoru

4 months ago

the first time

0

5

ShawnRoom @haoxiaoru

4 months ago

3x is better than 2x 😄

Kimi.ai @Kimi_Moonshot

4 months ago

Good news: the Kimi Code 3X Quota Boost is here to stay. No expiration. No catch. Just 3 times the power, permanently. From quick fixes to full-scale production, there's a plan for every need. Go build something amazing.

Kimi_Moonshot's tweet photo. Good news: the Kimi Code 3X Quota Boost is here to stay.
No expiration. No catch. Just 3 times the power, permanently.
From quick fixes to full-scale production, there's a plan for every need.
Go build something amazing. https://t.co/1LuklqB7ZB

102

2K

92

328

155K

0

14

ShawnRoom @haoxiaoru

4 months ago

帅

stdrc

@istdrc

4 months ago

最近沉迷于中国传统色，vibe 了一个网站以便查询 https://t.co/PhiN03wa9c 颜色列表来源于 https://t.co/s6Egzo0AbA

16

414

49

403

56K

0

21

ShawnRoom @haoxiaoru

5 months ago

wow kimi code web

luolei

@luoleiorg

5 months ago

我又双叒叕来吹 Kimi K2.5 了，迫于我的几个境外势力 AI 都被限额了，今天把 Vibe 的重任交给 Kimi，一句提示词，整个全项目范围的改动，增减近 300 个文件，一次性搞定。你们就说强不强吧。🤓

37

100

5

48

31K

0

17

ShawnRoom @haoxiaoru

5 months ago

👍 show me you code

Versun

@VersunPan

5 months ago

被说Kimi水军后，我重新进行了测试并开源上次夸 Kimi 2.5 超过 opus-4.5 和 gpt-5.2 的文章一发(推特和小红书)，我评论区直接炸了。 "推广软文"、"Kimi给你多少钱"、"国产模型怎么可能比 Opus 强"……一排评论看下来，给我整不会了，说实话，被这么质疑挺憋屈的，毕竟 Kimi 也没给我钱呐，我莫名其妙的就成了水军了？行吧，空口无凭吗，这几天我干脆做了个对照实验，非得验证下到底是我真傻，还是某些兄弟眼睛自带滤镜。这次任务比上次还复杂，把我的 Rails 博客彻底重构为纯 Rails CMS，还要加上 Jekyll 静态文件生成功能。这意味着既要保留 CMS 的灵活性，又要搞定静态站点的速度、主题定制那些事儿，数据库迁移、文件系统操作、模板引擎全得动，算是个中度复杂的活。 -------------- TLDR：完成度：GPT-5.2-high > Kimi 2.5 > Opus 4.5 速度：Kimi 2.5 > Opus 4.5 > GPT-5.2-high 代码质量： Kimi 2.5 = GPT-5.2-high > Opus 4.5 指令遵循度：Kimi 2.5 > GPT-5.2-high > Opus 4.5 价格：Kimi 2.5 > GPT-5.2-high > Opus 4.5 综合体验：Kimi 2.5 > GPT-5.2-high > Opus 4.5 相关代码： Github 仓库: https://t.co/gyYZMWQaBF 重构计划书：Rables/docs/plans/drifting-crafting-pillow.md GPT-5.2-high 代码：https://t.co/nKm03eKHAA Opus-4.5 代码：https://t.co/8ccKVOxmM0 Kimi-2.5 代码：https://t.co/OM0ffQdER9 -------------- 计划详情首先，我用 Opus 4.5 开 Thinkin 和 Plan 模式，写了一份详细的实施计划书，然后让 GPT-5.2-high 和 Kimi 2.5 分别审了一遍，查漏补缺，最后三家都用同一份计划书，提示词就一句话："根据文档内容，开始实施"，看谁执行得最到位。最让我意外的Opus 4.5 刚开始还挺正常，完成阶段一后，问我是否继续，我说"继续，全部实施所有阶段，别问我了"。结果五分钟后它又停下了："阶段二已完成，是否添加测试用例"，之后无论我怎么确定，它都会继续询问确认，这跟之前那个骄傲自信、一次性干到底的 Opus 差太多了吧？在最后提醒我全部完成后，我手动测试发现，很多小功能没有实现，而且最重要的同步功能报错无法使用，其它的功能比如迁移和交叉发布的设置页面有问题，简单说，是失败的一次重构。老态龙钟的 GPT-5.2 值得一提的是，完成度达到了95%， but……太慢了。我坐在那儿看它一行行磨，整整花了快三个小时才完成，而且做到后半段，它突然开始用英文回复我，然后开始实施不在计划内的功能，很明显是上下文压缩后失忆了，忘了之前的计划和自己的身份，得我手动提醒"请重新查看计划书"，它才想起来该干嘛，像是喝断片了，这应该是 codex cli 的锅。最后说 Kimi 2.5 Claude Code + Kimi 2.5，半小时内搞定了，完成度和GPT差不多，都在90%以上，而且省心，中间只询问了一次是否继续，在得到肯定回答后，一直执行到全部功能完成，中间上下文也压缩了好几次，重要的同步逻辑、数据库结构都没有出错，且始终用中文回复，整个思考过程看着很舒服。感想说实话，我比谁都希望 GPT / Opus 赢，毕竟我的订阅2月15号才过期呢，为了省点 Kimi 的额度，现在只用他们来 review 代码或者写些不重要的代码，有点心疼，毕竟价格比 Kimi 贵好多！！当然，我不敢说这就能证明 Kimi 全面超越谁，可能你的 Python 脚本 Opus 写得更好，或者你的 React 项目GPT 更稳，但至少在这个 Rails 后端重构的特定场景下，Kimi 2.5 的表现是最佳的。代码在那儿，数据在那儿，信不过的话欢迎去下载下来跑一遍过来打脸。最后希望国内的模型都卷起来，最后受益的还是我们这些烧着额度干活的程序员！有啥实测结果的兄弟，欢迎交流，咱们就聊技术，别扣帽子哈

44

186

14

110

35K

0

1

0

13

ShawnRoom @haoxiaoru

5 months ago

👍

GrugNotes @GrugNotes

5 months ago

Cool

0

2

0

136

0

23

ShawnRoom @haoxiaoru

5 months ago

+1

Akshay Kothari

@akothari

5 months ago

don't think market has yet priced in kimi k2.5

51

909

16

151

87K

0

22

ShawnRoom @haoxiaoru

5 months ago

nice job

Kimi.ai @Kimi_Moonshot

5 months ago

🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%) 🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%) 🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion. 🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup. - 🥝 K2.5 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode. 🥝 K2.5 Agent Swarm in beta for high-tier users. 🥝 For production-grade coding, you can pair K2.5 with Kimi Code: https://t.co/A5WQozJF3s - 🔗 API: https://t.co/EOZkbOwCN4 🔗 Tech blog: https://t.co/6h2KkoA0xd 🔗 Weights & code: https://t.co/H38KegeDIY

Kimi_Moonshot's tweet photo. 🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence.

🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.
🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
-
🥝 K2.5 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
🥝 K2.5 Agent Swarm in beta for high-tier users.
🥝 For production-grade coding, you can pair K2.5 with Kimi Code: https://t.co/A5WQozJF3s
-
🔗 API: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/6h2KkoA0xd
🔗 Weights & code: https://t.co/H38KegeDIY

772

16K

2K

10K

7M

0

14

ShawnRoom @haoxiaoru

7 months ago

学到了！

宝玉

@dotey

7 months ago

AK 建议你跟LLM（大语言模型）对话时，少用“你”怎么看，而是先问问这个领域的专家都是谁，然后让 AI 模拟这个领域的专家回答问题，这样能得到更好的效果。 2年多前Andrej Karpathy在State of GPT也说过类似的话。（见评论） Andrej Karpathy：千万别把大语言模型 (LLMs) 当作是一个个活生生的“实体”，要把它们看作是超级强大的“模拟器”。举个例子，当你想要深入探索某个话题时，千万别问： “关于 xyz，你怎么看？” 因为根本就不存在什么“你”。下次试试换个问法： “如果要探讨 xyz 这个话题，最合适的一群人（比如专家或利益相关者）会是谁？他们会说些什么？” 大语言模型可以信手拈来地引导和模拟各种各样的视角。但它并不像我们人类那样，是经过长时间对 xyz 的“思考”和沉淀，才形成自己的观点的。如果你非要用“你”这个词去强行提问，模型就会被迫根据它微调 (finetuning) 数据的统计规律，调用一种隐含的“人格嵌入向量 (personality embedding vector)”，然后扮演这种人格来给你模拟一个答案。 (注释：简单来说，当你问“你”时，AI 只是根据训练数据中最常见的回答模式，戴上了一个“大众脸”的面具来配合你，而不是它真的产生了一个拥有自我意识的人格。) 这样做当然没问题，你也能得到答案。但我发现很多人天真地把这归结为“去问问 AI 怎么想”，觉得这事儿特玄乎。其实一旦你明白了它是如何模拟的，这层神秘的面纱也就被揭开了。

53

1K

239

1K

189K

0

15

ShawnRoom @haoxiaoru

7 months ago

😱

Bloomberg Opinion

@opinion

7 months ago

A serious threat to ChatGPT is rising in China. But it’s not getting nearly enough attention. @cathythorbecke explains the Kimi K2 chatbot 🎥

2

25

9

11

20K

0

15

ShawnRoom @haoxiaoru

7 months ago

got it

Andrej Karpathy

@karpathy

7 months ago

Don't think of LLMs as entities but as simulators. For example, when exploring a topic, don't ask: "What do you think about xyz"? There is no "you". Next time try: "What would be a good group of people to explore xyz? What would they say?" The LLM can channel/simulate many perspectives but it hasn't "thought about" xyz for a while and over time and formed its own opinions in the way we're used to. If you force it via the use of "you", it will give you something by adopting a personality embedding vector implied by the statistics of its finetuning data and then simulate that. It's fine to do, but there is a lot less mystique to it than I find people naively attribute to "asking an AI".

1K

28K

3K

18K

4M

0

13

ShawnRoom

@haoxiaoru

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users