Jie Liu

@Arnuojo

Opinions on my own // open source // tech trends // She

Joined November 2013

1.4K Following

182 Followers

544 Posts

Jie Liu @Arnuojo

4 days ago

第一阶段比谁更聪明。第二阶段比谁更会干活。第三阶段比谁掌握更多权限。

0x小师妹

@0xshimei

4 days ago

为什么 Codex 和 Claude Code 已经打得水深火热了，Google 这个行业龙头却迟迟没有推出真正对标级别的产品？问题在于，Anthropic 和 OpenAI 输得起，Google 输不起。 Claude Code 删错文件、改崩项目，最多开发者骂几天。但如果 Gemini Agent 误删 Gmail、误发邮件、误改日历、误操作企业数据，影响的可能是几十亿用户和无数企业客户。创业公司追求的是速度，Google 追求的是稳定，google的体量让它不得不考虑这些。但很多人忽略了另一件事：如果未来 Google 真把 Agent 做到 Codex 现在的成熟度，它拥有的权限几乎是降维打击。 Claude 和 GPT 本质上都是访客：发邮件要授权。看日历要授权。操作浏览器要授权。访问云服务还是要授权。而 Gmail、Chrome、Android、Drive、Docs、Calendar，本来就是 Google 自己的地盘。别人是在敲门，Google 手里拿的是钥匙。所以这场战争的终局可能根本不是模型战争。第一阶段比谁更聪明。第二阶段比谁更会干活。第三阶段比谁掌握更多权限。 Codex 和 Claude Code 正在疯狂抢占第二阶段。而 Google 想直接跳到第三阶段。问题只剩一个：它还要多久才愿意踩下这个油门？

31K

Arnuojo retweeted

@wangyuanzju

24 days ago

https://t.co/erMq4P9014

10K

Arnuojo retweeted

OpenAI

@OpenAI

about 1 month ago

Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organizations deploy frontier AI to production for business impact. https://t.co/GnyjGFaLLA

679

11K

Jie Liu @Arnuojo

about 1 month ago

@mubeitech AI 已学会根据任务难度选择最适合的武器：同一任务，高级版选了 Python 追求效率；顶配版直接掏出 C 语言，挑战计算底层。

509

Who to follow

Omkhar Arasaratnam

@_omkhar

https://t.co/PsBN05Eqsb || https://t.co/aJ4RwewfJB || https://t.co/U2xuM1N95Z || https://t.co/npQi7LLTn1 || https://t.co/JUIQS205kQ

Andy Peng

@pymhq

MLE @AmazonScience: post-train, inference & benchmark; @CNCFAmbassadors; Lecturer @UW; @ai2incubator Resident Expert; Opinions are my own.

Zhiqiang Yu

@zhiqiangyu

Standard and Open Source Liaison | Founding Chair of the first OSPO Summit (China) '23 | TODO(OSPO) Group Ambassador | Table Tennis 🏓️ | Tesla Model Y🚘⚡🔋

Arnuojo retweeted

Ethan Mollick

@emollick

about 1 month ago

So Mythos was, indeed, not marketing hype. Remember this is a general purpose model that just happens to be good at finding exploits because good models are good at lots of things. Expect similar from OpenAI & Google. And from open models in 8 months. https://t.co/KbhalQYX8R

emollick's tweet photo. So Mythos was, indeed, not marketing hype.

Remember this is a general purpose model that just happens to be good at finding exploits because good models are good at lots of things. Expect similar from OpenAI & Google. And from open models in 8 months. https://t.co/KbhalQYX8R https://t.co/UoxtGXrOAw

135

304

585K

Jie Liu @Arnuojo

about 1 month ago

The future of programming won’t be languages easiest for humans. It’ll be languages easiest for agents.

Richard He

@RealRichomie

about 2 months ago

The future of programming won’t be languages easiest for humans. It’ll be languages easiest for agents. We just shipped a Mac app where our engineers didn’t know a single line of Rust (or Tauri) beforehand. Result: ~1/10th the size of a normal Mac app, highly performant Agents are the new programmers. Languages should optimize for them.

421

Jie Liu @Arnuojo

about 1 month ago

In contrast, with a Just-In-Time Compiled DSL, we can specialize more aggressively without sacrificing iteration pace.

Jie Liu @Arnuojo

about 1 month ago

CuTeDSL offers unrestricted access to hardware primitives, providing us with the level of control we require to achieve peak performance in all edge cases.

Perplexity

@perplexity_ai

about 1 month ago

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

perplexity_ai's tweet photo. We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs.

With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

119

353

161K

Jie Liu @Arnuojo

about 1 month ago

With CUDA, opportunities for compile-time specialization were limited, as each additional parameter added an unreasonable cost to compilation times or complexity to the build system.

Arnuojo retweeted

meng shao

@shao__meng

about 1 month ago

OpenAI 把 Responses API 提速 40%：为什么 Agent 时代，API 本身成了瓶颈？ OpenAI 最新发布了一篇工程博客，讲了一件容易被忽视的事：当模型推理越来越快，API 框架本身反而成了 Agent 工作流的最大瓶颈。他们的解法是给 Responses API 加上 WebSocket 模式，端到端提速 40%，并让 GPT‑5.3‑Codex‑Spark 真正跑出 1000 TPS、峰值 4000 TPS 的体验。 https://t.co/uUnXAYFfsO 问题是怎么浮现的一次 Codex 修 Bug，背后是几十次 Responses API 的来回：决定下一步动作 → 在用户机器上执行工具 → 把结果回传 → 再次推理。整个 Agent loop 的耗时主要分三段：API 服务处理、模型推理、客户端工具执行。过去 GPU 推理慢，API 的开销被自然掩盖。但 GPT‑5/5.2 时代约 65 TPS，新一代 Codex‑Spark 借助 Cerebras 专用硬件目标是 1000+ TPS，推理快了一个数量级，API 那点"固定成本"就被无情放大。用户在等的，不再是 GPU，而是跑在 CPU 上的 API 框架本身。第一轮优化：挤掉单次请求的水分 25.11 起，他们做了几件常规但关键的事: · 把已渲染的 token、模型配置缓存在内存，跳过多轮对话里重复的 tokenization 和网络调用 · 砍掉中间服务的网络跳数 (如图像处理)，直接调推理服务优化安全栈，让分类器更快标记问题对话结果：首 token 时间 (TTFT) 改善了约 45%。但对 Codex‑Spark 来说，还是不够。真正的结构性问题每次 Codex 请求都被当作独立请求处理，即使对话大部分没变，服务端仍然要把完整历史重跑一遍验证、处理和上下文构建。对话越长，这种重复成本越贵——这是协议层面的浪费，不是某个组件能调优解决的。 WebSocket：把 Agent rollout 当作一次"长 Response" 他们重新思考传输层：能不能保持长连接、把可复用状态缓存在内存里，只传增量？在 WebSockets 和 gRPC 双向流之间，选了 WebSockets——简单、对开发者友好、不需要改动现有的 Responses API 输入输出结构。第一版原型很激进：把整个 Agent rollout 建模成一个长时间运行的 Response。工作方式类比 hosted tool call：模型调 web search 时，推理循环会阻塞、等服务返回、再继续 sample。WebSocket 模式下，本地工具调用走的是同一套机制——只是"远程服务"换成了通过 WebSocket 连接的客户端。模型发出 response.done，客户端执行工具后回 response.append, sampling loop 解除阻塞继续推理。效果立竿见影：整个 rollout 只做一次预处理、一次后处理，中间的工具往返不再重复消耗 API 框架开销。最后的取舍：激进设计 vs 开发者熟悉的形状原型虽好，但 API 形态变了，开发者要重写集成。正式版做了折中：保留 response.create 原有 body，继续用 previous_response_id 串上下文，但底层在 WebSocket 连接生命周期内维护一份连接级内存缓存，包括: · 上一个 response 对象 · 历史输入输出 items · 工具定义和命名空间 · 已渲染 token 等可复用 sampling 产物带来的具体优化： · 安全分类器和请求校验只处理新增输入，不再扫全历史 · 已渲染 token 增量追加，跳过重复 tokenization · 模型路由结果跨请求复用 · 计费等非阻塞后处理与下一个请求重叠执行最终结果 · Codex 大部分流量已切到 WebSocket 模式 · Codex‑Spark 稳定 1000 TPS、峰值 4000 TPS · Vercel AI SDK 集成后延迟下降最多 40% · Cline 多文件工作流提速 39% · Cursor 上的 OpenAI 模型快了最多 30%

shao__meng's tweet photo. OpenAI 把 Responses API 提速 40%：为什么 Agent 时代，API 本身成了瓶颈？

OpenAI 最新发布了一篇工程博客，讲了一件容易被忽视的事：当模型推理越来越快，API 框架本身反而成了 Agent 工作流的最大瓶颈。他们的解法是给 Responses API 加上 WebSocket 模式，端到端提速 40%，并让 GPT‑5.3‑Codex‑Spark 真正跑出 1000 TPS、峰值 4000 TPS 的体验。
https://t.co/uUnXAYFfsO

问题是怎么浮现的
一次 Codex 修 Bug，背后是几十次 Responses API 的来回：决定下一步动作 → 在用户机器上执行工具 → 把结果回传 → 再次推理。整个 Agent loop 的耗时主要分三段：API 服务处理、模型推理、客户端工具执行。

过去 GPU 推理慢，API 的开销被自然掩盖。但 GPT‑5/5.2 时代约 65 TPS，新一代 Codex‑Spark 借助 Cerebras 专用硬件目标是 1000+ TPS，推理快了一个数量级，API 那点"固定成本"就被无情放大。用户在等的，不再是 GPU，而是跑在 CPU 上的 API 框架本身。

第一轮优化：挤掉单次请求的水分
25.11 起，他们做了几件常规但关键的事:
· 把已渲染的 token、模型配置缓存在内存，跳过多轮对话里重复的 tokenization 和网络调用
· 砍掉中间服务的网络跳数 (如图像处理)，直接调推理服务
优化安全栈，让分类器更快标记问题对话

结果：首 token 时间 (TTFT) 改善了约 45%。但对 Codex‑Spark 来说，还是不够。

真正的结构性问题
每次 Codex 请求都被当作独立请求处理，即使对话大部分没变，服务端仍然要把完整历史重跑一遍验证、处理和上下文构建。对话越长，这种重复成本越贵——这是协议层面的浪费，不是某个组件能调优解决的。

WebSocket：把 Agent rollout 当作一次"长 Response"
他们重新思考传输层：能不能保持长连接、把可复用状态缓存在内存里，只传增量？在 WebSockets 和 gRPC 双向流之间，选了 WebSockets——简单、对开发者友好、不需要改动现有的 Responses API 输入输出结构。

第一版原型很激进：把整个 Agent rollout 建模成一个长时间运行的 Response。
工作方式类比 hosted tool call：模型调 web search 时，推理循环会阻塞、等服务返回、再继续 sample。WebSocket 模式下，本地工具调用走的是同一套机制——只是"远程服务"换成了通过 WebSocket 连接的客户端。模型发出 response.done，客户端执行工具后回 response.append, sampling loop 解除阻塞继续推理。

效果立竿见影：整个 rollout 只做一次预处理、一次后处理，中间的工具往返不再重复消耗 API 框架开销。

最后的取舍：激进设计 vs 开发者熟悉的形状
原型虽好，但 API 形态变了，开发者要重写集成。正式版做了折中：保留 response.create 原有 body，继续用 previous_response_id 串上下文，但底层在 WebSocket 连接生命周期内维护一份连接级内存缓存，包括:
· 上一个 response 对象
· 历史输入输出 items
· 工具定义和命名空间
· 已渲染 token 等可复用 sampling 产物

带来的具体优化：
· 安全分类器和请求校验只处理新增输入，不再扫全历史
· 已渲染 token 增量追加，跳过重复 tokenization
· 模型路由结果跨请求复用
· 计费等非阻塞后处理与下一个请求重叠执行

最终结果
· Codex 大部分流量已切到 WebSocket 模式
· Codex‑Spark 稳定 1000 TPS、峰值 4000 TPS
· Vercel AI SDK 集成后延迟下降最多 40%
· Cline 多文件工作流提速 39%
· Cursor 上的 OpenAI 模型快了最多 30%

Arnuojo retweeted

Maggie Appleton @Mappletons

about 2 months ago

Got to talk at @aiDotEngineer conf last week about the need for collaborative AI engineering. All our current coding agents are single player. We're trying to scale up individual productivity, but creating tons of alignment problems in the process. We have no good tools for...

Mappletons's tweet photo. Got to talk at @aiDotEngineer conf last week about the need for collaborative AI engineering.

All our current coding agents are single player. We're trying to scale up individual productivity, but creating tons of alignment problems in the process.

We have no good tools for... https://t.co/x3zxRaJyGX

274

178

27K

Arnuojo retweeted

meng @meng59739449

about 2 months ago

1)A16 node delay until H2 2027 can confirm before when i said high defect and chip hot spot reliability problem only Nvidia AI GPU use it . 2)A13 node (no BSPDN version + 2 gen GAAFET) 3)A12 node (BSPDN version + 2 gen GAAFET) 4)N2U ??? 18A-U ??? 5)A10 node use High Na EUV

meng59739449's tweet photo. 1)A16 node delay until H2 2027 can confirm before when i said high defect and chip hot spot reliability problem only Nvidia AI GPU use it .
2)A13 node (no BSPDN version + 2 gen GAAFET)
3)A12 node (BSPDN version + 2 gen GAAFET)
4)N2U ??? 18A-U ???
5)A10 node use High Na EUV https://t.co/EiiXScbemF

11K

Arnuojo retweeted

Akshay 🚀

@akshay_pachaar

about 2 months ago

Kimi K2.6 raises the bar for open-source models. Moonshot released it yesterday, and for the first time, an open-weight model holds its ground against Claude Opus 4.6 on the benchmarks that matter for agentic work. It also costs a fraction of the price. 𝗧𝗵𝗲 𝗽𝗿𝗶𝗰𝗶𝗻𝗴 Kimi K2.6 runs at $0.95 per million input tokens and $4 per million output tokens. Claude Opus 4.6 runs at $5 and $25. With cache hits, the gap widens. K2.6 drops to $0.16 per million on cached inputs. Opus 4.6 drops to $0.50. That's roughly 5-6x cheaper across the board, before and after caching. 𝗧𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 K2.6 leads Opus 4.6 on four of the six head-to-head comparisons Moonshot published: - SWE-bench Pro: 58.6 vs 53.4 (agentic coding) - HLE with tools: 54.0 vs 53.0 (agentic reasoning) - DeepSearchQA: 92.5 vs 91.3 (deep research) - LiveCodeBench: 89.6 vs 88.8 Opus 4.6 still wins on SWE-bench Multilingual and BrowseComp, but the gap is under a point in both. 𝗧𝗵𝗲 𝗽𝗮𝗿𝘁 𝘁𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 Benchmarks are the easy story. The harder and more interesting story is long-horizon execution. K2.6 ran a single autonomous task for over 12 hours, making 4,000+ tool calls, to port and optimize inference for a small LLM in Zig, a language most models barely touch. It ended up running around 20% faster than LM Studio on the same hardware. Separately, it refactored an 8-year-old financial matching engine across 13 hours, delivering a 133% peak throughput gain. This is the capability gap that usually separates frontier closed models from open ones. K2.6 closes a meaningful chunk of it. You get weights you can actually deploy, a Modified MIT license, 5-6x lower inference cost, and performance that no longer forces you to compromise on agentic workloads. The moat around Frontier Labs is shrinking fast. Read more: https://t.co/ye6bkXBYTD

919

143

356

89K

Jie Liu @Arnuojo

about 2 months ago

AI-first process

Peter Pang

@intuitiveml

about 2 months ago

https://t.co/oWr2pTk4Hu

157

566

10K

Arnuojo retweeted

Piotr Nawrot

@p_nawrot

2 months ago

⚡🔧 PyTorch inference optimization just got a lot simpler Introducing AITune — NVIDIA's new library that automatically finds the fastest inference backend for any PyTorch model. It covers TensorRT, Torch Inductor, TorchAO and more, benchmarks all of them on your model and hardware, and picks the winner. No guessing, no manual tuning. The production path (Ahead-of-time): AITune profiles all backends, validates correctness automatically, and serializes the best one as an .ait artifact — compile once, zero warmup on every redeploy. Something torch.compile alone doesn't give you. Pipelines are also supported — each submodule gets tuned independently. The fast path (Just-in-time): set env variable, run your script unchanged. No code changes, no setup — AITune auto-discovers modules and optimizes them. Good for quick exploration before committing to AOT. Not competing with vLLM or TRT-LLM — fills the gap for everything else: diffusion, CV, speech, embeddings. Works on any PyTorch model. Check it out: https://t.co/WHrGcWEp1e

p_nawrot's tweet photo. ⚡🔧 PyTorch inference optimization just got a lot simpler

Introducing AITune — NVIDIA's new library that automatically finds the fastest inference backend for any PyTorch model. It covers TensorRT, Torch Inductor, TorchAO and more, benchmarks all of them on your model and hardware, and picks the winner. No guessing, no manual tuning.

The production path (Ahead-of-time): AITune profiles all backends, validates correctness automatically, and serializes the best one as an .ait artifact — compile once, zero warmup on every redeploy. Something torch.compile alone doesn't give you. Pipelines are also supported — each submodule gets tuned independently.

The fast path (Just-in-time): set env variable, run your script unchanged. No code changes, no setup — AITune auto-discovers modules and optimizes them. Good for quick exploration before committing to AOT.

Not competing with vLLM or TRT-LLM — fills the gap for everything else: diffusion, CV, speech, embeddings. Works on any PyTorch model.

Check it out: https://t.co/WHrGcWEp1e

386

308

31K

Jie Liu @Arnuojo

2 months ago

Open-source software never stops. It only accelerates.

NVIDIA

@nvidia

2 months ago

Open-source software never stops. It only accelerates. Dynamo, @sgl_project, TensorRT LLM, and @vllm_project are constantly optimized by a vast ecosystem of developers building on top of the NVIDIA platform. The result: your token output keeps improving and token cost keeps decreasing on the same hardware resources while your developer velocity stays at its peak. Build on the foundation continuously optimized by the world’s best developers. ⚡ 🔗 https://t.co/eH2xhsw8mG

304

65K

Jie Liu @Arnuojo

2 months ago

@ivysage_ @nvidia @sgl_project @vllm_project it's that the entire open source ecosystem optimizes for your platform first and everything else second

Jie Liu @Arnuojo

2 months ago

@KKaWSB 将复杂技术降维成“情感体验”的叙事方式

Arnuojo retweeted

Epoch AI

@EpochAIResearch

2 months ago

Compute may be the most important input to AI. So who owns the world’s AI compute? Introducing our new AI Chip Owners explorer, showing our analysis of how leading AI chips are distributed among hyperscalers and other major players, broken down by chip type over time.

EpochAIResearch's tweet photo. Compute may be the most important input to AI. So who owns the world’s AI compute?

Introducing our new AI Chip Owners explorer, showing our analysis of how leading AI chips are distributed among hyperscalers and other major players, broken down by chip type over time. https://t.co/KLtaAiMZIV

629

119

257

255K

Arnuojo retweeted

Chubby♨️

@kimmonismus

2 months ago

Google had the foresight to develop TPUs back in 2012. Today, they have by far the most compute. In the long term, Google is in one of the best starting positions: a solid revenue and product base, compute, and above all: distribution.

kimmonismus's tweet photo. Google had the foresight to develop TPUs back in 2012. Today, they have by far the most compute. In the long term, Google is in one of the best starting positions: a solid revenue and product base, compute, and above all: distribution. https://t.co/OkIq022wmA

900

172

68K

Jie Liu

@Arnuojo

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users