Haohui Mai

@wheat9

OS hacker+GPU optimization

Bay area

Joined October 2009

453 Following

247 Followers

1.9K Posts

Haohui Mai

@wheat9

about 1 month ago

@tianyin_xu @chinqrw Is he interested in doing infra work on inference? I know a couple fun places in both China and US

135

wheat9 retweeted

Tianyin Xu

@tianyin_xu

about 1 month ago

RT appreciated. Anyone looking for an excellent Linux kernel developer? Ruowen (@chinqrw) is one of the best. He is on the market due to the shutdown of Red Hat China. He's mainly looking in China, but also open to jobs elsewhere. He co-leads the Rex project (https://t.co/B2CZSwiJyq) with @Jinghao_J which they started it at UIUC. He also has extensive experience working on Red Hat's kernel-QE. I worked with Ruowen as my TA of CS 423 and on the Rex project. He is great!

Haohui Mai

@wheat9

about 1 month ago

LLMs can write GPU kernels, but they still struggle to make them assembly-fast. Real-world performance requires complex, tightly coupled optimizations across the whole kernel. ARGUS is the first agentic framework to achieve assembly-fast performance on real-world GPU kernels. On AMD MI300X, it reaches 99–104% of hand-optimized assembly throughput on GEMM, FlashAttention, and fused MoE, while running 2–1543× faster than existing agentic systems. ARGUS makes these global properties explicit through data-flow invariants. These invariants specify what should match at key program points, such as ensuring tensor core instructions see consistent matrix operands despite changes to swizzled memory layouts, tiling, and pipelining. That gives both the compiler and the LLM dense guidance beyond sparse unit tests, verified at compile time with abstract interpretation and SMT solving. https://t.co/xpEtZyVk7I

Haohui Mai

@wheat9

about 2 months ago

It turns out that the only reliable network connection on plane is UDP. TCP over UDP saves the day!

114

Who to follow

about 2 months ago

@SeTriones long live the try!

wheat9 retweeted

Yiying Zhang

@yiying__zhang

2 months ago

I wrote a post-mortem article on how glitches in an AI paper writing assistant tool in the last 30 minutes caused my group a missed SOSP deadline that we worked on for more than a year. https://t.co/9ZkBQTGdaR

Haohui Mai

@wheat9

about 2 months ago

@SeTriones 肾！

wheat9 retweeted

NeurIPS Conference

@NeurIPSConf

2 months ago

We want to speak directly to the concern many of you have expressed, and we owe you a clear explanation of what happened, why it happened, and where we stand now. We understand this situation caused genuine alarm and we take that seriously. In preparing the NeurIPS 2026 handbook, we included a link to a US government sanctions tool that covers a significantly broader set of restrictions than those NeurIPS is actually required to follow. This error was due to miscommunication between the NeurIPS Foundation and our legal team; there was never an intention to restrict participation beyond our mandatory compliance obligations. The responsibility for that error is ours as an organization, and we deeply apologize for the alarm and impact this miscommunication had on our community. We have updated the link and clarified the text of our policy, which is consistent with that of ACM and IEEE, as well as other international conferences and NeurIPS in the past. As in previous years, NeurIPS welcomes submissions from all compliant institutions and individuals. We want to reiterate that NeurIPS is a community-driven event, created by and for the community, and strives to be inclusive. The NeurIPS 2026 organizing committee was particularly saddened to learn of this institutional miscommunication. The organizing committee has taken on the responsibility of running the conference this year with the goal of fostering open communication, knowledge sharing, and global scientific discourse. We thank the community for bringing this issue to our attention and working with us through this situation.

265

503

127

138

497K

wheat9 retweeted

LaurieWired

@lauriewired

4 months ago

if you’re a CS/EE student write your thesis on JIT compilation of eBPF for NVMe controllers there’s huge career alpha in computational storage; the standards are *just* starting to exist (TP4091)

lauriewired's tweet photo. if you’re a CS/EE student

write your thesis on JIT compilation of eBPF for NVMe controllers

there’s huge career alpha in computational storage; the standards are *just* starting to exist (TP4091) https://t.co/69r3PxIHIE

253

237K

Haohui Mai

@wheat9

3 months ago

@SeTriones @shao__meng 🦞？

wheat9 retweeted

i Expose Racists & Pedos

@SeeRacists

4 months ago

HEARTBREAKING: Ex-PhD student Brendt Christensen found GUILTY of posing as cop, luring, abducting, R*ping & d*capitating Chinese scholar Yingying Zhang in his apartment in 2017. Her dism*mbered remains STILL missing. Never forget Yingying’s story.

314

27K

824K

wheat9 retweeted

Zhijian Liu

@zhijianliu_

4 months ago

The paper is now available: https://t.co/7hMxYpLxDt More updates coming soon!

302

153

40K

wheat9 retweeted

Aakash Gupta

@aakashgupta

4 months ago

Sounds incredible until you read the fine print. The compiler generates less efficient code than GCC with all optimizations disabled. It doesn’t have its own assembler or linker. It can’t produce a 16-bit x86 code generator. And Carlini himself says it has “nearly reached the limits of Opus’s abilities.” New features and bugfixes kept breaking existing functionality. So what did $20,000 and two weeks actually buy? A compiler that passes 99% of GCC’s torture tests but can’t match the output quality of a tool that’s had 37 years of human engineering. That’s the constraint nobody’s pricing in. The real story is in the cost curve, not the capability demo. $20,000 for 100,000 lines means $0.20 per line of generated code. A senior compiler engineer costs roughly $150/hour. At maybe 50 polished lines per hour for something this complex, that’s $3/line. AI just did it at 15x cheaper, and it will only get cheaper from here. But the code isn’t equivalent. The AI version needs a human to finish the assembler, fix the linker, optimize the output, and prevent regressions. Those are the hardest 20% of the problem, and they represent 80% of the engineering value. Anthropic built the demo. Shipping the product still requires humans. This tells you exactly where we are in the autonomous software timeline. AI can now produce impressive first drafts of complex systems at trivial cost. Turning those drafts into production software still requires the judgment that costs $300K+ per year in compiler engineer salary. The gap between “compiles the Linux kernel” and “replaces GCC” is measured in decades of accumulated engineering wisdom that no model has internalized yet. The companies that understand this will use agent teams to generate the 80% and hire engineers to finish the 20%. The companies that don’t will ship $20,000 compilers that produce slower code than a free tool from 1987.

187

309

969

374K

Haohui Mai

@wheat9

4 months ago

@Yuchenj_UW My experience is that Codex seems to have better world knowledge which make it more effective on triaging and debugging. Claude code excels in day to day software engineering tasks that need more automation.

215

Haohui Mai

@wheat9

5 months ago

@HotAisle For dense model nvfp4 works out of the box (Petit). We are adding MoE support these days. Stay tuned

Haohui Mai

@wheat9

5 months ago

@SeTriones 必须卷起来

wheat9 retweeted

Jeff Dean

@JeffDean

6 months ago

Performance Hints Over the years, my colleague Sanjay Ghemawat and I have done a fair bit of diving into performance tuning of various pieces of code. We wrote an internal Performance Hints document a couple of years ago as a way of identifying some general principles and we've recently published a version of it externally. We'd love any feedback you might have! Read the full doc at: https://t.co/jej95g236P