Nadeem Wani 🐻‍❄️

@nhwaani

Salafi I Software Engineer. Making & breaking things at scale. Cycling 🚵‍♂️, books 📚 🐾🐱

Srinagar Kashmir

Joined May 2018

1.6K Following

158 Followers

3.1K Posts

Pinned Tweet

Nadeem Wani 🐻‍❄️ @nhwaani

about 5 years ago

“Every man has two lives, and the second starts when he realizes he has just one” — Confucius.

Nadeem Wani 🐻‍❄️ @nhwaani

5 days ago

@alexocheema @exolabs @nvidia Lets code now 😛 🧑‍💻 @alexocheema

154

nhwaani retweeted

antirez @antirez

8 days ago

Why I'm taking this DwarfStar thing so serious? It is from the times of Redis that this didn't happen. I believe strongly in local inference, as a safety net. But there is more: I enjoy doing this stuff. So at 50 you are still not wise enough to avoid doing new stuff. So lame.

809

38K

nhwaani retweeted

antirez @antirez

9 days ago

In the DwarfStar Github you can now find the "streaming" branch. If you have a MacBook M<something> with 64GB, and with 32GB, please test the generation speed, and the prefill speed with /read README.md and report back. Thanks. P.S. It is alpha quality code until it gets tested more.

182

26K

Who to follow

I have written my name. Isn't that enough?

Hussain Mujtaba

@Hussain68018934

Nadeem Wani 🐻‍❄️ @nhwaani

9 days ago

@lennysan And how much is garbage?

nhwaani retweeted

antirez @antirez

10 days ago

DeepSeek v4 PRO running via SSD streaming on my 128GB MacBook m5 max. 1.6 trillion parameters.

118

825

264K

nhwaani retweeted

Alex Cheema

@alexocheema

12 days ago

A year ago at GTC, Jensen brought out a DGX Spark in one hand and a MacBook in the other. Yesterday, at GTC Taipei, Jensen brought out NVIDIA's new RTX Spark laptop in both hands. This is the start of a new era of personal computing - the personal AI era. In the new era, there are two competing platforms: - @apple with macOS / MLX - @nvidia with Windows / CUDA Everyone will have an always-on personal agent that runs locally, constantly looking out for you, working for you proactively, monitoring the internet and talking to other agents. This will be a personal AI agent you own, that's private, that's aligned with you (not OpenAI or Anthropic). @karpathy calls it personal computing v2. Let's set the scene for the new era of personal computing by diving into the one thing that will matter the most - the hardware. The best hardware for local AI isn't what's running in a data center. It's a radically different problem. Here's a breakdown of the 3 most important things: 1. Memory. LLMs are big. To run a model locally, you need to fit the entire model into memory. Apple (with Apple Silicon) and NVIDIA (with DGX Spark + RTX Spark) have both moved towards unified memory, which puts all the memory on one chip - leveraging cheaper LPDDR5X memory - useful for making more memory accessible to the GPU. The alternative competing architecture is a disaggregated CPU/GPU architecture - which is what the DGX Station uses. It has a large pool of slow LPDDR5X CPU memory (496GB @ 396GB/s), and a small pool of high-speed HBM3e GPU memory (252GB @ 7.1TB/s). It has a high bandwidth link (900GB/s) between the CPU memory and GPU memory, enabling fast disaggregated inference e.g. Attention on GPU, FFN on CPU. This enables running really large models like Kimi K2.6 (1T parameters) by offloading experts from CPU memory to GPU memory as they are needed. You could imagine something like this in a smaller form factor. Hardware today: - Apple M5 Max MacBook Pro: 128GB unified memory. - NVIDIA DGX Spark / RTX Spark: 128GB unified memory. 2. Memory bandwidth. In a data center, multiple user's requests can be batched together, which amortizes the cost of moving model weights into memory across many requests, pushing up arithmetic intensity to compute bound territory - meaning FLOPS matters a lot. Locally, everything runs at low batch size, which is low arithmetic intensity, i.e. memory bound - so FLOPS don't matter. What matters memory bandwidth. High memory bandwidth -> fast TPS. Low memory bandwidth -> slow TPS. Hardware today: - Apple M5 Max MacBook Pro: 617GB/s memory bandwidth. - NVIDIA DGX Spark: 273GB/s memory bandwidth. - NVIDIA RTX Spark: TBC. 3. Power. In a data center, we talk about MegaWatts. Locally, we talk about Watts. Laptops have limited battery life. The best laptop batteries have a capacity of ~100Wh. LLM inference on a MacBook Pro consumes ~140W, meaning battery life with a persistent personal agent is less than an hour. This is unusable. The game will become how long can you run a useful agent on a laptop battery. Apple and NVIDIA will compete on how long an agent can run on battery - this will become the new battery life metric. This could be where an NPU or NPU/GPU hybrid really shines. Apple ANE has about 10x better power efficiency than the GPU on Apple Silicon (but has ~4-5x less memory bandwidth, with about the same FLOPS as the GPU). There will be an entire design space of how to build energy efficient agents - this will involve co-optimizing the harness, models, inference engines together. Hardware today: - Apple M5 Max MacBook Pro: Consumes 140W, battery capacity ~100Wh - NVIDIA DGX Spark: Rated for 240W, consumes 140W. No battery (direct PSU). - NVIDIA RTX Spark: TBC. The hardware battle will be fierce, and I expect a move towards co-design, i.e. hardware designed *with* personal agent workloads. On top of this, models are improving, we're getting more intelligence per bit/watt, and open-source harnesses like @NousResearch Hermes / OpenClaw are improving rapidly. Within the next 2 years, we'll inevitably have unmetered, private Opus-4.8 / GPT-5.5 level intelligence running locally on a future version of a MacBook or RTX Spark. I like this future a lot better than the one where OpenAI / Anthropic control the intelligence layer of the internet and can rent-seek on intelligence. Beyond this, NVIDIA is ahead on general AI ecosystem, i.e. the CUDA moat. Apple is ahead on local AI ecosystem, i.e. models quantized/rightsized for MacBooks, native macOS apps, and ease of setup. We'll see how this might change as the new RTX Spark also brings full native CUDA to Windows-on-Arm laptops for the first time, potentially closing the gap. There are many other factors I haven't mentioned here, but I believe I've covered the timeless, most important things for the new era of personal computing.

alexocheema's tweet photo. A year ago at GTC, Jensen brought out a DGX Spark in one hand and a MacBook in the other.

Yesterday, at GTC Taipei, Jensen brought out NVIDIA's new RTX Spark laptop in both hands.

This is the start of a new era of personal computing - the personal AI era.

In the new era, there are two competing platforms:
- @apple with macOS / MLX
- @nvidia with Windows / CUDA

Everyone will have an always-on personal agent that runs locally, constantly looking out for you, working for you proactively, monitoring the internet and talking to other agents. This will be a personal AI agent you own, that's private, that's aligned with you (not OpenAI or Anthropic). @karpathy calls it personal computing v2.

Let's set the scene for the new era of personal computing by diving into the one thing that will matter the most - the hardware.

The best hardware for local AI isn't what's running in a data center. It's a radically different problem. Here's a breakdown of the 3 most important things:

1. Memory.
LLMs are big. To run a model locally, you need to fit the entire model into memory. Apple (with Apple Silicon) and NVIDIA (with DGX Spark + RTX Spark) have both moved towards unified memory, which puts all the memory on one chip - leveraging cheaper LPDDR5X memory - useful for making more memory accessible to the GPU. The alternative competing architecture is a disaggregated CPU/GPU architecture - which is what the DGX Station uses. It has a large pool of slow LPDDR5X CPU memory (496GB @ 396GB/s), and a small pool of high-speed HBM3e GPU memory (252GB @ 7.1TB/s). It has a high bandwidth link (900GB/s) between the CPU memory and GPU memory, enabling fast disaggregated inference e.g. Attention on GPU, FFN on CPU. This enables running really large models like Kimi K2.6 (1T parameters) by offloading experts from CPU memory to GPU memory as they are needed. You could imagine something like this in a smaller form factor.
Hardware today:
- Apple M5 Max MacBook Pro: 128GB unified memory.
- NVIDIA DGX Spark / RTX Spark: 128GB unified memory.

2. Memory bandwidth.
In a data center, multiple user's requests can be batched together, which amortizes the cost of moving model weights into memory across many requests, pushing up arithmetic intensity to compute bound territory - meaning FLOPS matters a lot. Locally, everything runs at low batch size, which is low arithmetic intensity, i.e. memory bound - so FLOPS don't matter. What matters memory bandwidth. High memory bandwidth -> fast TPS. Low memory bandwidth -> slow TPS.
Hardware today:
- Apple M5 Max MacBook Pro: 617GB/s memory bandwidth.
- NVIDIA DGX Spark: 273GB/s memory bandwidth.
- NVIDIA RTX Spark: TBC.

3. Power.
In a data center, we talk about MegaWatts. Locally, we talk about Watts. Laptops have limited battery life. The best laptop batteries have a capacity of ~100Wh. LLM inference on a MacBook Pro consumes ~140W, meaning battery life with a persistent personal agent is less than an hour. This is unusable. The game will become how long can you run a useful agent on a laptop battery. Apple and NVIDIA will compete on how long an agent can run on battery - this will become the new battery life metric. This could be where an NPU or NPU/GPU hybrid really shines. Apple ANE has about 10x better power efficiency than the GPU on Apple Silicon (but has ~4-5x less memory bandwidth, with about the same FLOPS as the GPU). There will be an entire design space of how to build energy efficient agents - this will involve co-optimizing the harness, models, inference engines together.
Hardware today:
- Apple M5 Max MacBook Pro: Consumes 140W, battery capacity ~100Wh
- NVIDIA DGX Spark: Rated for 240W, consumes 140W. No battery (direct PSU).
- NVIDIA RTX Spark: TBC.

The hardware battle will be fierce, and I expect a move towards co-design, i.e. hardware designed *with* personal agent workloads. On top of this, models are improving, we're getting more intelligence per bit/watt, and open-source harnesses like @NousResearch Hermes / OpenClaw are improving rapidly. Within the next 2 years, we'll inevitably have unmetered, private Opus-4.8 / GPT-5.5 level intelligence running locally on a future version of a MacBook or RTX Spark. I like this future a lot better than the one where OpenAI / Anthropic control the intelligence layer of the internet and can rent-seek on intelligence.

Beyond this, NVIDIA is ahead on general AI ecosystem, i.e. the CUDA moat. Apple is ahead on local AI ecosystem, i.e. models quantized/rightsized for MacBooks, native macOS apps, and ease of setup. We'll see how this might change as the new RTX Spark also brings full native CUDA to Windows-on-Arm laptops for the first time, potentially closing the gap.

There are many other factors I haven't mentioned here, but I believe I've covered the timeless, most important things for the new era of personal computing.

519

460

108K

Nadeem Wani 🐻‍❄️ @nhwaani

12 days ago

@davidsenra @IvankaTrump @naval We need one with @naval x @davidsenra

511

Nadeem Wani 🐻‍❄️ @nhwaani

14 days ago

@ivanfioravanti Will 8bit also fit well on 128GB M5 max @ivanfioravanti ?

Nadeem Wani 🐻‍❄️ @nhwaani

15 days ago

@antirez Qwen3.6 35B A3B 8-bit, seems to do really good as well @antirez on M5 Max 128GB. cc @ivanfioravanti

153

Nadeem Wani 🐻‍❄️ @nhwaani

15 days ago

@antirez DS4 is the stronger coding model in my runs too.

Nadeem Wani 🐻‍❄️ @nhwaani

15 days ago

@antirez My local M5 Max benchmarks line up with this. DS V4 Flash q2-imatrix scored 21/26, DS V4 Flash 20/26, while Step 3.7 Flash IQ4_XS scored 12/26. Step did pass the agentic patch tasks, but had much worse prompt-output reliability and was slower overall.

133

nhwaani retweeted

antirez @antirez

15 days ago

Based on the above tests DeepSeek overall is a fairly more capable model for coding tasks. It delivers better code, iteratively fixes bugs more efficiently, and it does all that using less tokens. And fewer tool calls, which is also remarkable.

146

10K

nhwaani retweeted

Mitchell Hashimoto

@mitchellh

16 days ago

I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem. As an experiment, I rewrote the Ghostty core render state in Go, with access to identically laid out data structures as Ghostty and the exact same validation tests. I made a purposely naive renderer (simple, correct, but slow). 88ms per frame with 150,000 allocations (horrendous, lol)! I then kickstarted a Ralph loop to bring the frame times down. I told it it can't modify input data structures or the public API or tests (they're correct), but it can do anything else it wants. It got to work. It has worked for about 4 hours. I've spent around $350 on this experiment so far. The results? 88ms => 1.5ms 150K allocs => ~500 allocs Incredible right? Nope. My hand-written renderer I ported has frame times (same benchmark) of ~20us (0.020ms) and 0 allocations in the update path. This is the problem with psychosis and lacking systems understanding. If you don't understand the system, you're going to accept that this is an incredible result. If you understand the system, you'll see better solutions immediately and can do roughly 75x better on throughput. The people who blindly trust agent output are in the former camp. They're sheeple, overdrinking from a fountain of mediocrity. Standard disclaimer: I use AI all the time. I like AI. The point I'm making is to not blindly accept results. Think. Analyze. Learn.

308

979

791K

nhwaani retweeted

antirez @antirez

17 days ago

English video: DwarfStar distributed inference demo, theory and implementation: https://t.co/yjtJtolbce

154

23K

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

@antirez Great work, Just want to add exo has something on this https://t.co/l1kW1jE7au we can leverage there ecosystem and add support to accesss ds4 there ?

292

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

@alexocheema @antirez @ivanfioravanti which one you find better for 128gb for ur usecaes

Nadeem Wani 🐻‍❄️ @nhwaani

18 days ago

@alexocheema @antirez to get more audience

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

Results: - Agent JSONL: 14.7 → 30.5 tok/s - Code boilerplate: 13.5 → 34.6 tok/s Safer opt-in. @ivanfioravanti @antirez https://t.co/XEMTkdzJ3a

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

Tuned ds4 suffix decoding defaults. Old default over-drafted: guessed long token spans, verifier accepted short prefixes → ~2× slowdown on M5 Max. New default uses short drafts.

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

@mitsuhiko @mitsuhiko

Nadeem Wani 🐻‍❄️ @nhwaani

17 days ago

@mitsuhiko There is power option support already maybe try that . https://t.co/oXfXUMyDlE

777

Nadeem Wani 🐻‍❄️

@nhwaani

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users