Zihao Ye

@ye_combinator

Seattle

Joined October 2017

637 Following

2.1K Followers

258 Posts

ye_combinator retweeted

Thien Tran @gaunernst

3 days ago

Beating CuBLAS was not a goal, but it came out pretty good. I think this is more useful as a concise and hackable "template", rather than being the fastest kernel: bring ur own epilogues, roll a megakernel, ask Codex to fork it. Just like when I first learned Triton.

gaunernst's tweet photo. Beating CuBLAS was not a goal, but it came out pretty good. I think this is more useful as a concise and hackable "template", rather than being the fastest kernel: bring ur own epilogues, roll a megakernel, ask Codex to fork it. Just like when I first learned Triton. https://t.co/hvsNUAuG8N

2

52

3

20

2K

ye_combinator retweeted

7 days ago

Miles + multi-adapter LoRA training, presented by @Osmosis_AI!

4

65

9

33

11K

ye_combinator retweeted

Ruihang Lai @ruihanglai

8 days ago

Two moments every ML researcher knows. You get onto a new cluster, and week one goes to fitting the framework to your setup, not training. A new architecture lands, and trying it means hacking through a gigantic codebase to stay compatible with the pipeline. What you want to change is small. The code you wade through to change isn't. This experience is likely not alone, and many researchers we’ve talked to run into similar issues. A year of this on CMU's FLAME cluster left us with one question: what if a framework were built for an agent to adapt and evolve, not just for humans to maintain? So we introduce PithTrain: a compact, agent-native MoE training system, now ~11K lines of Python, on four principles: - Compact: fits in one context window - Python-native: readable tracebacks, no compiled-extension rebuilds - No implicit indirection: direct calls, each model in its own file - Agent skills: in-repo playbooks for recurring tasks Then we measured the thing nobody measures. Same agent, same tasks, only the framework underneath changes: on PithTrain it finishes with up to 62% fewer turns and 64% less GPU time than production frameworks, while training just as fast. We call this second axis agent-task efficiency, and we believe it deserves to sit alongside training throughput as a metric worth optimizing. Excited to see what people build with it. Built with amazing collaborators @haok1402, Haozhan Tang, Akaash Parthasarathy, @Zichun_Yu, @junrushao, Todd Mowry, @XiongChenyan and @tqchenml. Blog: https://t.co/byOKPs9rGQ Code: https://t.co/AH5ZbwYluV Paper: https://t.co/hkmDGx9Hc6

ruihanglai's tweet photo. Two moments every ML researcher knows. You get onto a new cluster, and week one goes to fitting the framework to your setup, not training. A new architecture lands, and trying it means hacking through a gigantic codebase to stay compatible with the pipeline. What you want to change is small. The code you wade through to change isn't.

This experience is likely not alone, and many researchers we’ve talked to run into similar issues. A year of this on CMU's FLAME cluster left us with one question: what if a framework were built for an agent to adapt and evolve, not just for humans to maintain?

So we introduce PithTrain: a compact, agent-native MoE training system, now ~11K lines of Python, on four principles:

- Compact: fits in one context window
- Python-native: readable tracebacks, no compiled-extension rebuilds
- No implicit indirection: direct calls, each model in its own file
- Agent skills: in-repo playbooks for recurring tasks

Then we measured the thing nobody measures. Same agent, same tasks, only the framework underneath changes: on PithTrain it finishes with up to 62% fewer turns and 64% less GPU time than production frameworks, while training just as fast.

We call this second axis agent-task efficiency, and we believe it deserves to sit alongside training throughput as a metric worth optimizing. Excited to see what people build with it.

Built with amazing collaborators @haok1402, Haozhan Tang, Akaash Parthasarathy, @Zichun_Yu, @junrushao, Todd Mowry, @XiongChenyan and @tqchenml.

Blog: https://t.co/byOKPs9rGQ
Code: https://t.co/AH5ZbwYluV
Paper: https://t.co/hkmDGx9Hc6

2

169

37

135

21K

Zihao Ye @ye_combinator

9 days ago

@madebyollin @CVPR best for your next journey!

0

1

0

0

545

Who to follow

Verified account

Inference @meta | Prev: Engineer @xAI, Ph.D. @UCBerkeley, Co-founder @lmsysorg

Verified account

👩 she/her 🐗 engineer @mobbin 🖼️ building @luneformac 👀 exploring creative visuals

Verified account

building @vllm_project at @meta | ex-openai | cs phd @ 🌁 uc berkeley | machine learning system | the real agi is the friends we made along the way

ye_combinator retweeted

14 days ago

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: https://t.co/P31SGl0cyT Code: https://t.co/9nkO2hmeOl

47

2K

237

2K

865K

ye_combinator retweeted

Luis Ceze @luisceze

11 days ago

That was a great talk @marksaroufim — started MLsys on an exciting note!

0

3

1

0

1K

ye_combinator retweeted

11 days ago

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://t.co/yAqClXrJUz code: https://t.co/D4pgIr1wM7 For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://t.co/85OzoOQlnF code: https://t.co/eqMYZ0U6qO

YifeiZuoX's tweet photo. For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.

Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.

paper: https://t.co/yAqClXrJUz
code: https://t.co/D4pgIr1wM7

For the origin of Parallax, check out the LLA paper at ICLR 2026:
paper: https://t.co/85OzoOQlnF
code: https://t.co/eqMYZ0U6qO

6

353

45

274

77K

ye_combinator retweeted

13 days ago

Build your first game with Gemini 3.5 Flash. Translate everyday objects directly into interactive, digital experiences without complex 3D modeling. Start with a Nano Banana prompt, turn your image into a game in Canvas, and refine your vision for optimal gameplay.

0

2K

155

536

24M

ye_combinator retweeted

Lequn Chen @abcdabcd987

13 days ago

We are obsessed with performance and low level details. Great work by my teammate @xyzw_io . Chat with us if this interests you.

1

25

2

7

3K

ye_combinator retweeted

Yoav Gelberg @yoav_gelberg

15 days ago

Excited about this new work As KV compaction becomes increasingly important, we ask whether it’s worth adapting the model itself to perform better under compaction Turns out, it can really matter

3

155

28

105

22K

ye_combinator retweeted

19 days ago

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

18

1K

128

946

132K

ye_combinator retweeted

Charles 🎉 Frye

18 days ago

rolling up to MLSys '26 to meet with @ye_combinator and the winners of our B200 kernel perf competition quick trip, so i packed a single bag, just my essentials

charles_irl's tweet photo. rolling up to MLSys '26 to meet with @ye_combinator and the winners of our B200 kernel perf competition

quick trip, so i packed a single bag, just my essentials https://t.co/0mQWtjDXgk

3

54

2

2

13K

ye_combinator retweeted

Edward Z. Yang @ezyang

19 days ago

Pro-tip: using CUDA graphs and annoyed that all the kernels have no labels in your profiles? Get a nightly that has mark_kernels context manager: https://t.co/IF3sZhb4S2 (thanks Natalia and Shangdi for implementing!) You need 13.1 driver, but user mode driver is enough

1

123

11

43

15K

ye_combinator retweeted

20 days ago

It was an honor to give the keynote at MLSys Covered how AI systems have evolved, why AI is needed to improve them, why results have disappointed, why the future looks amazing, and why I’m working on this at Core Auto Recording should be out soon, in the meantime slides

marksaroufim's tweet photo. It was an honor to give the keynote at MLSys
Covered how AI systems have evolved, why AI is needed to improve them, why results have disappointed, why the future looks amazing, and why I’m working on this at Core Auto
Recording should be out soon, in the meantime slides https://t.co/5pbyUHTAVC

15

446

44

298

66K

ye_combinator retweeted

20 days ago

#MLSys26 NVIDIA is hosting ice cream social on Friday https://t.co/jFZGXxqtbe after the competition session, looking forward to see folks there

2

37

2

2

10K

ye_combinator retweeted

Edward Z. Yang @ezyang

about 1 month ago

Historically, we used https://t.co/uMrrnBoSPF for this. I think this forum is still good for things that are much more about discussion. But actual one big impetus for devlogs as a SSG website https://t.co/wDTjRlq1jK was to make the posts more easily accessible to LLM agents

1

18

3

5

2K

ye_combinator retweeted

about 1 month ago

Super happy to see my UCSD colleague help port our DFlash to TPU. Big speedup too!! More to come.

10

113

9

17

14K

ye_combinator retweeted

about 1 month ago

This likely means OpenAI does interleaved stages of SFT-RL-SFT-RL rather than the simpler SFT-RL-done pipeline we see with open models

nrehiew_'s tweet photo. This likely means OpenAI does interleaved stages of SFT-RL-SFT-RL rather than the simpler SFT-RL-done pipeline we see with open models https://t.co/FA4XQAVzHn

12

467

22

248

42K

ye_combinator retweeted

AI Dance @AI_Whisper_X

about 1 month ago

挺有意思的研究。闭源实验室都对模型规模讳莫如深，但他们其实藏不住模型"知道什么"。而模型知道什么，恰恰就是参数量的指标。核心逻辑：推理能力可以靠蒸馏压缩到小模型里，事实知识不行。一个模型记得多少冷门事实，直接跟它的参数量挂钩。知乎博主李博杰为这个写了一篇小论文，构建了一套叫 IKP（不可压缩知识探针）的数据集：1400 个问题、7 层稀有度，扔到 27 家厂商的 188 个模型上跑了一遍，只看事实准确率。结果在 89 个公开参数的开源模型上，准确率 vs log(参数量) 的拟合 R²=0.917，基本是一条直线。把闭源模型投影上去，规模就估出来了： GPT-5.5 ≈ 9T Claude Opus 4.7 ≈ 4T GPT-5.4 ≈ 2.2T Claude Sonnet 4.6 ≈ 1.7T Gemini 2.5 Pro ≈ 1.2T （90% 置信区间：0.3-3 倍规模）另外两个发现也挺反直觉：一是引用数和 h-index 不能预测一个研究者是否被前沿模型认识。两个引用数相近的人，模型给的回答可能完全不一样。它记的是有影响力的工作，不是论文数量。二是事实容量不会被时间压缩。跨 3 年的 96 个开源模型，IKP 时间系数统计上为零（p<10⁻¹⁵），直接拒绝了 Densing Law 预测的 +0.0117/月衰减。benchmark 在饱和，但事实容量还在随参数继续扩张。来源：知乎博主李博杰侵权联系删 https://t.co/Bt5CiMGc5M

AI_Whisper_X's tweet photo. 挺有意思的研究。

闭源实验室都对模型规模讳莫如深，但他们其实藏不住模型"知道什么"。而模型知道什么，恰恰就是参数量的指标。
核心逻辑：推理能力可以靠蒸馏压缩到小模型里，事实知识不行。一个模型记得多少冷门事实，直接跟它的参数量挂钩。

知乎博主李博杰为这个写了一篇小论文，构建了一套叫 IKP（不可压缩知识探针）的数据集：1400 个问题、7 层稀有度，扔到 27 家厂商的 188 个模型上跑了一遍，只看事实准确率。

结果在 89 个公开参数的开源模型上，准确率 vs log(参数量) 的拟合 R²=0.917，基本是一条直线。把闭源模型投影上去，规模就估出来了：

GPT-5.5 ≈ 9T
Claude Opus 4.7 ≈ 4T
GPT-5.4 ≈ 2.2T
Claude Sonnet 4.6 ≈ 1.7T
Gemini 2.5 Pro ≈ 1.2T
（90% 置信区间：0.3-3 倍规模）

另外两个发现也挺反直觉：
一是引用数和 h-index 不能预测一个研究者是否被前沿模型认识。两个引用数相近的人，模型给的回答可能完全不一样。它记的是有影响力的工作，不是论文数量。
二是事实容量不会被时间压缩。跨 3 年的 96 个开源模型，IKP 时间系数统计上为零（p<10⁻¹⁵），直接拒绝了 Densing Law 预测的 +0.0117/月衰减。benchmark 在饱和，但事实容量还在随参数继续扩张。

来源：知乎博主李博杰
侵权联系删
https://t.co/Bt5CiMGc5M

AI_Whisper_X's tweet photo. 挺有意思的研究。

闭源实验室都对模型规模讳莫如深，但他们其实藏不住模型"知道什么"。而模型知道什么，恰恰就是参数量的指标。
核心逻辑：推理能力可以靠蒸馏压缩到小模型里，事实知识不行。一个模型记得多少冷门事实，直接跟它的参数量挂钩。

知乎博主李博杰为这个写了一篇小论文，构建了一套叫 IKP（不可压缩知识探针）的数据集：1400 个问题、7 层稀有度，扔到 27 家厂商的 188 个模型上跑了一遍，只看事实准确率。

结果在 89 个公开参数的开源模型上，准确率 vs log(参数量) 的拟合 R²=0.917，基本是一条直线。把闭源模型投影上去，规模就估出来了：

GPT-5.5 ≈ 9T
Claude Opus 4.7 ≈ 4T
GPT-5.4 ≈ 2.2T
Claude Sonnet 4.6 ≈ 1.7T
Gemini 2.5 Pro ≈ 1.2T
（90% 置信区间：0.3-3 倍规模）

另外两个发现也挺反直觉：
一是引用数和 h-index 不能预测一个研究者是否被前沿模型认识。两个引用数相近的人，模型给的回答可能完全不一样。它记的是有影响力的工作，不是论文数量。
二是事实容量不会被时间压缩。跨 3 年的 96 个开源模型，IKP 时间系数统计上为零（p<10⁻¹⁵），直接拒绝了 Densing Law 预测的 +0.0117/月衰减。benchmark 在饱和，但事实容量还在随参数继续扩张。

来源：知乎博主李博杰
侵权联系删
https://t.co/Bt5CiMGc5M

AI_Whisper_X's tweet photo. 挺有意思的研究。

闭源实验室都对模型规模讳莫如深，但他们其实藏不住模型"知道什么"。而模型知道什么，恰恰就是参数量的指标。
核心逻辑：推理能力可以靠蒸馏压缩到小模型里，事实知识不行。一个模型记得多少冷门事实，直接跟它的参数量挂钩。

知乎博主李博杰为这个写了一篇小论文，构建了一套叫 IKP（不可压缩知识探针）的数据集：1400 个问题、7 层稀有度，扔到 27 家厂商的 188 个模型上跑了一遍，只看事实准确率。

结果在 89 个公开参数的开源模型上，准确率 vs log(参数量) 的拟合 R²=0.917，基本是一条直线。把闭源模型投影上去，规模就估出来了：

GPT-5.5 ≈ 9T
Claude Opus 4.7 ≈ 4T
GPT-5.4 ≈ 2.2T
Claude Sonnet 4.6 ≈ 1.7T
Gemini 2.5 Pro ≈ 1.2T
（90% 置信区间：0.3-3 倍规模）

另外两个发现也挺反直觉：
一是引用数和 h-index 不能预测一个研究者是否被前沿模型认识。两个引用数相近的人，模型给的回答可能完全不一样。它记的是有影响力的工作，不是论文数量。
二是事实容量不会被时间压缩。跨 3 年的 96 个开源模型，IKP 时间系数统计上为零（p<10⁻¹⁵），直接拒绝了 Densing Law 预测的 +0.0117/月衰减。benchmark 在饱和，但事实容量还在随参数继续扩张。

来源：知乎博主李博杰
侵权联系删
https://t.co/Bt5CiMGc5M

52

1K

158

758

206K

ye_combinator retweeted

Edward Z. Yang @ezyang

about 1 month ago

A quick survey of PyTorch APIs that allow for controlling precision in a fine-grained way. 🧵

1

121

8

91

12K

Last Seen Users on Sotwe

Trends for you

Most Popular Users