Siqi Zhu

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

306

207

74K

Siqi Zhu @realagi25

2 days ago

@Phoenixyin13 注册Claude Code和Codex

158

realagi25 retweeted

Fv.ik

@efubiku

3 days ago

高考不用太紧张，我的经历证明了就算考了 700+ 也不过是变成一个平凡油腻还神神叨叨的一般通过社畜罢了。人会走到哪里，最后还是看天命。

154

11K

realagi25 retweeted

Stephan Rabanser @steverab

6 days ago

Very excited to share that our paper "Towards a Science of AI Agent Reliability" was accepted at ICML 2026! See you in Seoul! 🎉 We just released our camera ready version with three important updates (details below). We also recorded a short video on the paper's contributions. Main changes (full discussion at https://t.co/1a5r1jNFF4): 1️⃣We have added the latest set of frontier models to our evaluation (GPT 5.5, Gemini 3.1 Pro and 3.5 Flash, and Claude Opus 4.7) and find that they are not meaningfully more reliable than previously released models. Agent reliability is still far from being solved. 2️⃣We have updated the definition and measurement of our outcome consistency metric, which contained a typo in the pre-print we initially released. This caused us to under-estimate outcome consistency in our initial set of results. We have updated the paper and our codebase to the corrected metric. Despite this change, our new results show that outcome consistency is still surprisingly low across many reported models. 3️⃣We discovered multiple issues in our HAL Generalist Agent scaffold that we used for our experiments on GAIA. Notably, we discovered multiple instances of answer leakage and agents cheating on our evaluation. This caused us to slightly over-estimate both accuracy and reliability. At the same time, we noticed that the scaffold was overly constrained in terms of permissible software library imports. This caused us to slightly under-estimate both accuracy and reliability. We have done a rigorous audit of the scaffold and have fixed those issues. Overall, we saw that our resulting accuracy and reliability numbers are not meaningfully impacted by this change when compared to our original numbers. 📄Our paper: https://t.co/HAKHzASrOZ 📊Our dashboard: https://t.co/apbtxtsdvz 🎥Short video: https://t.co/uqIourw6C6 Joint work w/ @sayashk, @PKirgis, @khl53182440, @SaitejaUtpala, and @random_walker.

246

255

24K

realagi25 retweeted

Sai Surya Duvvuri

@dvsaisurya

5 days ago

Just read this, nice research. We did something similar long back on LoRA factors. https://t.co/e2dt04Rs4A

24K

realagi25 retweeted

paperpaper

@paperpaper886

5 days ago

感觉蛮励志的, 一作通讯都是广东工业大学的本科生, 实验甚至是用古早的Titan来跑的。最终不仅中了CVPR还拿了Best Student Paper & Honorable Mention。 https://t.co/bExCrhbK6P

paperpaper886's tweet photo. 感觉蛮励志的, 一作通讯都是广东工业大学的本科生, 实验甚至是用古早的Titan来跑的。最终不仅中了CVPR还拿了Best Student Paper & Honorable Mention。

https://t.co/bExCrhbK6P https://t.co/3xFNBbAzlv

538

279

237K

Siqi Zhu @realagi25

7 days ago

@luo_yuehan Calculus, Jianlian Cui

realagi25 retweeted

Datou @Datou

8 days ago

微软还是爱惜羽毛的，刻意避开合成数据，只用人类数据训练出一个 base 模型，然后一生三，训练三个不同领域的专家模型，然后自己蒸馏自己，把三种能力蒸回 bese 模型（权重配比很考验经验），再强化学习一轮让蒸馏模型懂得看问题下菜碟灵活运用这三种能力。确实讲了很多细节，一生三三蒸一有意思。

438

376

91K

realagi25 retweeted

Max Lv

@m0d8ye

10 days ago

当我谈 github 上 AI 写的垃圾代码时我在谈什么？

realagi25 retweeted

nini

@nini_incrypto_

11 days ago

没资源没背景有时候也是一种优势。前美银美林高管Raj Malhotra说，他更愿意指导非名校出身的年轻人。不是看轻名校，而是他在华尔街见过太多人，发现真正能活下来的，往往是那些没有退路、把每一步都算得很清楚的人。因为没有背景可以依赖，所以只能把因果链想得比别人深一层。而就是这一层的差距，让你遇到的对手少了90%。这个逻辑放在生活里也一样，学历只是门槛，结果才是证明。当然，这不是让你看轻名校生。因为名校和机构真正的枢纽，不在金钱，而是更高维度的东西……

realagi25 retweeted

土豆本豆

@Potatoloogs

11 days ago

目前Agent的发展分成了两条路线。大部分讨论都集中在第一条上，但第二条可能更值得关注。第一条：Harness式多Agent系统多个agent共享上下文、共享目标、中心化调度，本质上是Workflow Engine加了个Ontology让它更灵活。今天大部分多Agent系统的本质就是LLM Orchestration：一个大模型调度多个子角色完成复杂推理。这里面的Agent更像可调用的函数、带人格的工具、任务节点。 Prompt Engineering、Context Management、Task Routing、Tool Calling、Planning、Memory、Workflow。本质仍属于软件工程问题。所以当年擅长编程的人都重获新生。第二条：Protocol-Native Agent System 这条路的核心变化是：每个人拥有自己的Personal Agent，甚至是专属的无人公司。这是一个极其巨大的变化。当Agent真正属于"个人"时，它的性质会发生根本变化：从task-scoped（任务级实例）变成identity-scoped（身份级实体）。它有长期记忆、持续身份、偏好、资源、权限、历史、关系网络、利益边界，它代表"你"。这时候Agent之间的协作没法再依赖Prompt、Workflow、Shared Context，只能依赖协议。因为当海量Agent独立存在时，它们之间必须解决：身份确认、权限边界、信任机制、委托关系、协商机制、激励机制、声誉系统、价值交换。 Agent之间的交互已经不是API Call，更像制度性交互。 AI世界的核心会从Prompt Engineering转向Protocol Engineering。协议即组织这个演变有三个阶段。传统互联网里协议是数据通信：TCP/IP、HTTP、SMTP，发送和接收端约定说话的格式。区块链里协议进化成状态计算：以太坊的本质是全网共同执行状态转换规则。 Agent Society阶段协议会继续升级，不仅定义通信和计算，还定义协调、权限、激励、身份、组织关系。协议开始承担"组织"的功能。我的理解现在读起来可能不太直观，因为第二条路线还没成为现实。但回想三年前有人说"LLM提供基座能力、中间层提供原子能力、用户自己攒趁手的工具"。当时也觉得抽象，现在卡来就是vibe coding的日常。第二条路线可能也是这样，过半年再看可能就能就get到了。

180

196

19K

Siqi Zhu @realagi25

11 days ago

@failcatcat 90后的时代红利

715

realagi25 retweeted

Trajectory

@trajectorylabs

11 days ago

🏹5 Days of Trajectory. Day 3 - An Open Source Training Stack for Continual Learning Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today. Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone. Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base. The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards. We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster. We’re very excited to see what you build, please reach out!

trajectorylabs's tweet photo. 🏹5 Days of Trajectory.

Day 3 - An Open Source Training Stack for Continual Learning

Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today.

Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone.

Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base.

The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards.

We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster.

We’re very excited to see what you build, please reach out!

511

395

93K

realagi25 retweeted

Lun Wang

@lunwang1996

24 days ago

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://t.co/F1lUWxDG2D

200

616K

Siqi Zhu @realagi25

12 days ago

@Barret_China 我的一篇文章讨论了这件事: https://t.co/CQZGjZyMds

534

Siqi Zhu @realagi25

13 days ago

@Stanleysobest Fun fact :全世界都是这样😂

445

realagi25 retweeted

Elon Musk

@elonmusk

13 days ago

😂

281K

18K

40M

realagi25 retweeted

Salesforce AI Research

@SFResearch

13 days ago

Can Language Models Remember What They Learn? Introducing Procedural Memory Distillation (PMD): https://t.co/8fcAEPbkE4 PMD turns model attempts into reusable training memory, conditions a self-teacher on it, and distills the guidance into the student's weights.

realagi25 retweeted

Kenton Varda

@KentonVarda

13 days ago

Chinese AI Twitter is obsessing over my wife. How was your day?

117

437

671K

Siqi Zhu

@realagi25

Last Seen Users on Sotwe

Trends for you

Most Popular Users