Leoneo

@hongbosherlock

不会写作的摄影师不是好的程序员。 travel around china 18/34 travel around world 6/200

Joined November 2019

596 Following

63 Followers

398 Posts

hongbosherlock retweeted

Max For AI

@MaxForAI

about 1 month ago

前Meta、OpenAI，现任Google DeepMind的研究员Susan Zhang（张苏珊）表示：在今天的科技行业，做一个复合型的人，可能比只深扎在少数几个领域里更有优势。这不是说专业深度不重要，而是说，你可以在持续保持学习的过程中，每隔几年给自己的能力栈增加一个新领域。时间一长，你积累下来的就不只是「懂很多」，而是一组越来越稀缺的能力组合。如果你能在3个不同领域都做到前10%，从概率上看，你已经接近前0.1%的人。真正稀缺的，往往不是单点最强，而是那些别人很难复制的组合能力。所以，不用太早把自己锁死在某一个标签里。你可以不断切换、不断试探，直到找到自己最擅长的位置。然后，在别人开始躺在功劳簿上休息的时候，再继续往前切换一次。这个路径不适合所有人，但对那些不喜欢停在原地的人来说，反而很合适。也要重新审视那些传统意义上的「价值」：名校、履历、头衔、晋升路径、公司阶梯。今天的大学体系正在被冲击，传统管理层也在被AI和组织效率重新拆解。如果你还要走这条路，最好确认自己真的知道为什么。在某个机构里积累资源、关系和话语权，当然会让人感觉很有吸引力。但如果你把大部分精力都花在复杂的人际游戏、组织博弈和权力计算里，那你可能已经不再是一个技术人，而更像一个没有选票的政客。你可以选择这条路，只是要清楚自己在选择什么。批判性思维也变得比过去更重要。不要轻易相信任何包装好的叙事，不要只看一个人怎么说，要看他实际做出了什么。在利益和激励错位的时候，人也会制造出自己的「垃圾信息」，跟AI生成的低质内容没什么本质区别。你越早学会分辨谁是在讲故事，谁是在表演，谁是真的在做事，就越早能判断自己的时间应该投给什么项目、什么方向、什么人。关于被AI替代的焦虑，也不用假装不存在。它当然是真实的。很多白领工作的体面和声望，本来就建立在一套表演系统上。很多岗位看起来很高级，其实和真正的技术广度、技术深度、解决问题的能力关系并不大。AI只是把这件事提前暴露出来了。但只要你还在学习，还在动手做东西，还能拿出可验证的成果，你就没有那么脆弱。真正有价值的不是简历上写过什么，而是你能不能持续产出东西，能不能不断证明自己解决过真实问题。如果这些你都能接受，那欢迎来到科技行业。它确实是少数几个地方之一，可以让人每过几年重新感到兴奋。新的工具、新的范式、新的问题会不断出现，一开始可能会让人不安，但也正是这种不安，让这个行业始终有新的入口。所以别轻易放弃。多喝水，晒晒太阳，该休息就休息。科技行业的职业路径本来就不是直线，它更像一条不断改道的河。既然如此，不如接受这种非线性，在变化里继续往前走。 🍻

290

311

81K

hongbosherlock retweeted

Akshay 🚀

@akshay_pachaar

about 2 months ago

CPU vs GPU vs TPU vs NPU vs LPU, explained visually: 5 hardware architectures power AI today. Each one makes a fundamentally different tradeoff between flexibility, parallelism, and memory access. > CPU It is built for general-purpose computing. A few powerful cores handle complex logic, branching, and system-level tasks. It has deep cache hierarchies and off-chip main memory (DRAM). It's great for operating systems, databases, and decision-heavy code, but not that great for repetitive math like matrix multiplications. > GPU Instead of a few powerful cores, GPUs spread work across thousands of smaller cores that all execute the same instruction on different data. This is why GPUs dominate AI training. The parallelism maps directly to the kind of math neural networks need. > TPU They go one step further with specialization. The core compute unit is a grid of multiply-accumulate (MAC) units where data flows through in a wave pattern. Weights enter from one side, activations from the other, and partial results propagate without going back to memory each time. The entire execution is compiler-controlled, not hardware-scheduled. Google designed TPUs specifically for neural network workloads. > NPU This is an edge-optimized variant. The architecture is built around a Neural Compute Engine packed with MAC arrays and on-chip SRAM, but instead of high-bandwidth memory (HBM), NPUs use low-power system memory. The design goal is to run inference at single-digit watt power budgets, like smartphones, wearables, and IoT devices. Apple Neural Engine and Intel's NPU follow this pattern. > LPU (Language Processing Unit) This is the newest entrant, by Groq. The architecture removes off-chip memory from the critical path entirely. All weight storage lives in on-chip SRAM. Execution is fully deterministic and compiler-scheduled, which means zero cache misses and zero runtime scheduling overhead. The tradeoff is that it provides limited memory per chip, which means you need hundreds of chips linked together to serve a single large model. But the latency advantage is real. AI compute has evolved from general-purpose flexibility (CPU) to extreme specialization (LPU). Each step trades some level of generality for efficiency. The visual below maps the internal architecture of all five side by side. 👉 Over to you: Which of these 5 have you actually worked with or deployed on?

863

243K

hongbosherlock retweeted

Vivek Galatage

@vivekgalatage

3 months ago

Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels One of the finest and in-depth posts that everyone MUST read. Amazing work by @gordic_aleksa!! https://t.co/LMVr2mXfGQ

vivekgalatage's tweet photo. Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels

One of the finest and in-depth posts that everyone MUST read. Amazing work by @gordic_aleksa!!

https://t.co/LMVr2mXfGQ https://t.co/4l89svvkCu

408

437

43K

hongbosherlock retweeted

ngrok @ngrokHQ

3 months ago

Quantization can make an LLM 4x smaller and 2x faster, with barely any quality loss. But what *is* it? @samwhoo crafted a beautiful interactive essay explaining it from first principles, aimed at coders, not mathematicians. https://t.co/UfE3N1F9vw

200

667K

Who to follow

皆空

@LIAO644669001

当你在凝视深渊时，深渊也在凝视你。

hongbosherlock retweeted

Thariq

@trq212

3 months ago

https://t.co/45C3gKydTK

388

16K

44K

hongbosherlock retweeted

stdrc

@istdrc

4 months ago

During the Chinese New Year holiday, I built an agent-native IM where AI agents are first-class citizens: https://t.co/VMPbsTDbWY No hand-written code. I never even reviewed a single line. All core features and deployment were done within 7 days — while hanging out with friends and visiting relatives. Dead simple to use: 1. Connect a machine with Claude Code installed 2. Create agents with optional role descriptions 3. Chat and build Feedback welcome!

istdrc's tweet photo. During the Chinese New Year holiday, I built an agent-native IM where AI agents are first-class citizens: https://t.co/VMPbsTDbWY

No hand-written code. I never even reviewed a single line. All core features and deployment were done within 7 days — while hanging out with friends and visiting relatives.

Dead simple to use:

1. Connect a machine with Claude Code installed
2. Create agents with optional role descriptions
3. Chat and build

Feedback welcome!

450

420

171K

hongbosherlock retweeted

Eric Zhang

@ekzhang1

4 months ago

For the next couple weeks at NY systems reading group, we’ll be writing some GPU kernels This was much requested, and I think now that there’s some excellent, well-written content online about CuTe DSL, makes sense to learn! (CUDA is too hard / arcane, CuTe DSL is just low-level enough to be interesting to systems folks looking for a new challenge perhaps. Plus it might be useful to learn it for work.) I was trying to figure out how we give people a quick trial of Hopper/Blackwell GPUs and realized I already made a free product for this at my last job lol, and it’s on-demand + collaborative. So we’ll use Notebooks. https://t.co/ArFSF8ESxA

ekzhang1's tweet photo. For the next couple weeks at NY systems reading group, we’ll be writing some GPU kernels

This was much requested, and I think now that there’s some excellent, well-written content online about CuTe DSL, makes sense to learn!

(CUDA is too hard / arcane, CuTe DSL is just low-level enough to be interesting to systems folks looking for a new challenge perhaps. Plus it might be useful to learn it for work.)

I was trying to figure out how we give people a quick trial of Hopper/Blackwell GPUs and realized I already made a free product for this at my last job lol, and it’s on-demand + collaborative. So we’ll use Notebooks.

https://t.co/ArFSF8ESxA

745

750

39K

hongbosherlock retweeted

Hamza Elshafie

@hamzaelshafie

5 months ago

Wrote an in depth breakdown of Paged Attention and KV cache management in modern inference systems like vLLM. Starting from first principles: - LLM training vs inference - Prefill vs decoding - Why KV caching exists - Where memory fragmentation comes from Then how vLLM style paged KV caching fixes it. Appendix also covers continuous batching, speculative decoding, and quantisation. Blog: https://t.co/L97njbPDkM

hamzaelshafie's tweet photo. Wrote an in depth breakdown of Paged Attention and KV cache management in modern inference systems like vLLM.

Starting from first principles:

- LLM training vs inference
- Prefill vs decoding
- Why KV caching exists
- Where memory fragmentation comes from

Then how vLLM style paged KV caching fixes it. Appendix also covers continuous batching, speculative decoding, and quantisation.

Blog: https://t.co/L97njbPDkM

761

803

28K

hongbosherlock retweeted

kepano

@kepano

6 months ago

if you're using Obsidian with Claude Code, tell me about your workflow, and what you've used it for

417

316

hongbosherlock retweeted

Vivek Galatage

@vivekgalatage

7 months ago

A great start here https://t.co/99R9huiHK6

881

101

66K

Leoneo @hongbosherlock

7 months ago

@lee_toyofbob 已关注

Leoneo @hongbosherlock

7 months ago

Do you remember when you joined X? I do! #MyXAnniversary

hongbosherlock retweeted

Fernando

@Franc0Fernand0

8 months ago

There is a reason why System Design is hard for most software engineers. They don't understand how distributed systems work. If you want to learn the basics of distributed systems, read these 13 curated articles: ↓

Franc0Fernand0's tweet photo. There is a reason why System Design is hard for most software engineers.

They don't understand how distributed systems work.

If you want to learn the basics of distributed systems, read these 13 curated articles: ↓ https://t.co/hvQ5uvPcuu

183

87K

hongbosherlock retweeted

nathan chen

@nathancgy4

9 months ago

(1/6) triton kernels are a great way to understand ML models. but tutorials are scattered the learning method for me was jst to read real, high performance code so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel 🧵also a thread

nathancgy4's tweet photo. (1/6) triton kernels are a great way to understand ML models. but tutorials are scattered

the learning method for me was jst to read real, high performance code

so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel

🧵also a thread https://t.co/kW45q1Dbkk

111

103K

hongbosherlock retweeted

Aleksa Gordić (水平问题)

@gordic_aleksa

10 months ago

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up this one - i quickly realized i understimated the effort. 😅 It could have easily been a book/booklet (lol). I covered: * Basics of inference engine flow (input/output request processing, scheduling, paged attention, continuous batching) * "Advanced" stuff: chunked prefill, prefix caching, guided decoding (grammar-constrained FSM), speculative decoding, disaggregated P/D * Scaling up: going from smaller LMs that can be hosted on a single GPU all the way to trillion+ params (via TP/PP/SP) -> multi-GPU, multi-node setup * Serving the model on the web: going from offline deployment to multiple API servers, load balancing, DP coordinator, multiple engines setup :) * Measuring perf of inference systems (latency (ttft, itl, e2e, tpot), throughput) and GPU perf roofline model Lots of examples, lots of visuals! --- I realize i've been silent on social - many of you noticed and thanks for reaching out! :) --> I'm so back! lots of things happened. Also, in general, I'm a bit sick of superficial content, it really is an equivalent of junk food (h/t @karpathy). I want to do the best/deepest technical work of my life over the next years and write much more in depth (high quality organic food ;)) so I might not be as frequent around here as i used to be (? we'll see). I'll make it a goal to share a few paper summaries a week or stuff that's relevant / in the zeitgeist. If you have any topics that happened over the past few weeks/months drop it down in the comments i might focus on some of those in my next posts. --- Huge thank you to @Hyperstackcloud for giving me an H100 node to run some of the experiments and analysis that i needed to write this up. The team there led by Christopher Starkey is amazing! Also a big thank you to Nick Hill (who did a very thorough review of the post - basically a code review lol; Nick's a core vLLM contributor and principal SWE at RedHat) and to my friends Kyle Krannen (NVIDIA Dynamo), @marksaroufim (PyTorch), and @ashVaswani (goat) for taking the time during weekend when they didn't have to!

gordic_aleksa's tweet photo. New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!

Took me a while to get this level of understanding of the codebase and then to write up this one - i quickly realized i understimated the effort. 😅 It could have easily been a book/booklet (lol).

I covered:

* Basics of inference engine flow (input/output request processing, scheduling, paged attention, continuous batching)

* "Advanced" stuff: chunked prefill, prefix caching, guided decoding (grammar-constrained FSM), speculative decoding, disaggregated P/D

* Scaling up: going from smaller LMs that can be hosted on a single GPU all the way to trillion+ params (via TP/PP/SP) -> multi-GPU, multi-node setup

* Serving the model on the web: going from offline deployment to multiple API servers, load balancing, DP coordinator, multiple engines setup :)

* Measuring perf of inference systems (latency (ttft, itl, e2e, tpot), throughput) and GPU perf roofline model

Lots of examples, lots of visuals!

---

I realize i've been silent on social - many of you noticed and thanks for reaching out! :) --> I'm so back! lots of things happened.

Also, in general, I'm a bit sick of superficial content, it really is an equivalent of junk food (h/t @karpathy).

I want to do the best/deepest technical work of my life over the next years and write much more in depth (high quality organic food ;)) so I might not be as frequent around here as i used to be (? we'll see). I'll make it a goal to share a few paper summaries a week or stuff that's relevant / in the zeitgeist.

If you have any topics that happened over the past few weeks/months drop it down in the comments i might focus on some of those in my next posts.

---

Huge thank you to @Hyperstackcloud for giving me an H100 node to run some of the experiments and analysis that i needed to write this up. The team there led by Christopher Starkey is amazing!

Also a big thank you to Nick Hill (who did a very thorough review of the post - basically a code review lol; Nick's a core vLLM contributor and principal SWE at RedHat) and to my friends Kyle Krannen (NVIDIA Dynamo), @marksaroufim (PyTorch), and @ashVaswani (goat) for taking the time during weekend when they didn't have to!

400

324K

hongbosherlock retweeted

卫斯理

@imwsl90

10 months ago

这个网站不错 freedium，把medium的付费文章链接粘贴进去，就可以免费阅读了... https://t.co/d7gec3QZIV

956

228

110K

hongbosherlock retweeted

Jacob Austin @jacobaustin132

10 months ago

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

jacobaustin132's tweet photo. Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n https://t.co/Lci5vwhaRh

519

404K

hongbosherlock retweeted

Graham Helton (too much for zblock) @GrahamHelton3

about 1 year ago

Before moving from my role at Google to Snowflake I sat down and did a braindump of all the guidelines that I follow (or followed at one point and wanted to reintroduce). For those interested, here are the ~34 guidelines that made the cut

498

15K

Leoneo @hongbosherlock

over 1 year ago

I'm glad to have made a minor contribution to SGLang and vLLM. I also have worked on several PRs as a co-author. Thanks to my friends and community, this is a good start for me.

hongbosherlock's tweet photo. I'm glad to have made a minor contribution to SGLang and vLLM. I also have worked on several PRs as a co-author. Thanks to my friends and community, this is a good start for me. https://t.co/1MJteubXYJ

Leoneo @hongbosherlock

over 1 year ago

@zhyncs42 @HandH1998 tql, an amazing work after QQQ👏@HandH1998

146

Leoneo

@hongbosherlock

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users