Winkle

@robbwinkle

Building AI for the real world. token maxxing. prev: martech, enterprise consulting, @Techstars

Columbus, OH

Joined January 2009

883 Following

292 Followers

1.8K Posts

robbwinkle retweeted

Alex Cheema

@alexocheema

3 days ago

A year ago at GTC, Jensen brought out a DGX Spark in one hand and a MacBook in the other. Yesterday, at GTC Taipei, Jensen brought out NVIDIA's new RTX Spark laptop in both hands. This is the start of a new era of personal computing - the personal AI era. In the new era, there are two competing platforms: - @apple with macOS / MLX - @nvidia with Windows / CUDA Everyone will have an always-on personal agent that runs locally, constantly looking out for you, working for you proactively, monitoring the internet and talking to other agents. This will be a personal AI agent you own, that's private, that's aligned with you (not OpenAI or Anthropic). @karpathy calls it personal computing v2. Let's set the scene for the new era of personal computing by diving into the one thing that will matter the most - the hardware. The best hardware for local AI isn't what's running in a data center. It's a radically different problem. Here's a breakdown of the 3 most important things: 1. Memory. LLMs are big. To run a model locally, you need to fit the entire model into memory. Apple (with Apple Silicon) and NVIDIA (with DGX Spark + RTX Spark) have both moved towards unified memory, which puts all the memory on one chip - leveraging cheaper LPDDR5X memory - useful for making more memory accessible to the GPU. The alternative competing architecture is a disaggregated CPU/GPU architecture - which is what the DGX Station uses. It has a large pool of slow LPDDR5X CPU memory (496GB @ 396GB/s), and a small pool of high-speed HBM3e GPU memory (252GB @ 7.1TB/s). It has a high bandwidth link (900GB/s) between the CPU memory and GPU memory, enabling fast disaggregated inference e.g. Attention on GPU, FFN on CPU. This enables running really large models like Kimi K2.6 (1T parameters) by offloading experts from CPU memory to GPU memory as they are needed. You could imagine something like this in a smaller form factor. Hardware today: - Apple M5 Max MacBook Pro: 128GB unified memory. - NVIDIA DGX Spark / RTX Spark: 128GB unified memory. 2. Memory bandwidth. In a data center, multiple user's requests can be batched together, which amortizes the cost of moving model weights into memory across many requests, pushing up arithmetic intensity to compute bound territory - meaning FLOPS matters a lot. Locally, everything runs at low batch size, which is low arithmetic intensity, i.e. memory bound - so FLOPS don't matter. What matters memory bandwidth. High memory bandwidth -> fast TPS. Low memory bandwidth -> slow TPS. Hardware today: - Apple M5 Max MacBook Pro: 617GB/s memory bandwidth. - NVIDIA DGX Spark: 273GB/s memory bandwidth. - NVIDIA RTX Spark: TBC. 3. Power. In a data center, we talk about MegaWatts. Locally, we talk about Watts. Laptops have limited battery life. The best laptop batteries have a capacity of ~100Wh. LLM inference on a MacBook Pro consumes ~140W, meaning battery life with a persistent personal agent is less than an hour. This is unusable. The game will become how long can you run a useful agent on a laptop battery. Apple and NVIDIA will compete on how long an agent can run on battery - this will become the new battery life metric. This could be where an NPU or NPU/GPU hybrid really shines. Apple ANE has about 10x better power efficiency than the GPU on Apple Silicon (but has ~4-5x less memory bandwidth, with about the same FLOPS as the GPU). There will be an entire design space of how to build energy efficient agents - this will involve co-optimizing the harness, models, inference engines together. Hardware today: - Apple M5 Max MacBook Pro: Consumes 140W, battery capacity ~100Wh - NVIDIA DGX Spark: Rated for 240W, consumes 140W. No battery (direct PSU). - NVIDIA RTX Spark: TBC. The hardware battle will be fierce, and I expect a move towards co-design, i.e. hardware designed *with* personal agent workloads. On top of this, models are improving, we're getting more intelligence per bit/watt, and open-source harnesses like @NousResearch Hermes / OpenClaw are improving rapidly. Within the next 2 years, we'll inevitably have unmetered, private Opus-4.8 / GPT-5.5 level intelligence running locally on a future version of a MacBook or RTX Spark. I like this future a lot better than the one where OpenAI / Anthropic control the intelligence layer of the internet and can rent-seek on intelligence. Beyond this, NVIDIA is ahead on general AI ecosystem, i.e. the CUDA moat. Apple is ahead on local AI ecosystem, i.e. models quantized/rightsized for MacBooks, native macOS apps, and ease of setup. We'll see how this might change as the new RTX Spark also brings full native CUDA to Windows-on-Arm laptops for the first time, potentially closing the gap. There are many other factors I haven't mentioned here, but I believe I've covered the timeless, most important things for the new era of personal computing.

alexocheema's tweet photo. A year ago at GTC, Jensen brought out a DGX Spark in one hand and a MacBook in the other.

Yesterday, at GTC Taipei, Jensen brought out NVIDIA's new RTX Spark laptop in both hands.

This is the start of a new era of personal computing - the personal AI era.

In the new era, there are two competing platforms:
- @apple with macOS / MLX
- @nvidia with Windows / CUDA

Everyone will have an always-on personal agent that runs locally, constantly looking out for you, working for you proactively, monitoring the internet and talking to other agents. This will be a personal AI agent you own, that's private, that's aligned with you (not OpenAI or Anthropic). @karpathy calls it personal computing v2.

Let's set the scene for the new era of personal computing by diving into the one thing that will matter the most - the hardware.

The best hardware for local AI isn't what's running in a data center. It's a radically different problem. Here's a breakdown of the 3 most important things:

1. Memory.
LLMs are big. To run a model locally, you need to fit the entire model into memory. Apple (with Apple Silicon) and NVIDIA (with DGX Spark + RTX Spark) have both moved towards unified memory, which puts all the memory on one chip - leveraging cheaper LPDDR5X memory - useful for making more memory accessible to the GPU. The alternative competing architecture is a disaggregated CPU/GPU architecture - which is what the DGX Station uses. It has a large pool of slow LPDDR5X CPU memory (496GB @ 396GB/s), and a small pool of high-speed HBM3e GPU memory (252GB @ 7.1TB/s). It has a high bandwidth link (900GB/s) between the CPU memory and GPU memory, enabling fast disaggregated inference e.g. Attention on GPU, FFN on CPU. This enables running really large models like Kimi K2.6 (1T parameters) by offloading experts from CPU memory to GPU memory as they are needed. You could imagine something like this in a smaller form factor.
Hardware today:
- Apple M5 Max MacBook Pro: 128GB unified memory.
- NVIDIA DGX Spark / RTX Spark: 128GB unified memory.

2. Memory bandwidth.
In a data center, multiple user's requests can be batched together, which amortizes the cost of moving model weights into memory across many requests, pushing up arithmetic intensity to compute bound territory - meaning FLOPS matters a lot. Locally, everything runs at low batch size, which is low arithmetic intensity, i.e. memory bound - so FLOPS don't matter. What matters memory bandwidth. High memory bandwidth -> fast TPS. Low memory bandwidth -> slow TPS.
Hardware today:
- Apple M5 Max MacBook Pro: 617GB/s memory bandwidth.
- NVIDIA DGX Spark: 273GB/s memory bandwidth.
- NVIDIA RTX Spark: TBC.

3. Power.
In a data center, we talk about MegaWatts. Locally, we talk about Watts. Laptops have limited battery life. The best laptop batteries have a capacity of ~100Wh. LLM inference on a MacBook Pro consumes ~140W, meaning battery life with a persistent personal agent is less than an hour. This is unusable. The game will become how long can you run a useful agent on a laptop battery. Apple and NVIDIA will compete on how long an agent can run on battery - this will become the new battery life metric. This could be where an NPU or NPU/GPU hybrid really shines. Apple ANE has about 10x better power efficiency than the GPU on Apple Silicon (but has ~4-5x less memory bandwidth, with about the same FLOPS as the GPU). There will be an entire design space of how to build energy efficient agents - this will involve co-optimizing the harness, models, inference engines together.
Hardware today:
- Apple M5 Max MacBook Pro: Consumes 140W, battery capacity ~100Wh
- NVIDIA DGX Spark: Rated for 240W, consumes 140W. No battery (direct PSU).
- NVIDIA RTX Spark: TBC.

The hardware battle will be fierce, and I expect a move towards co-design, i.e. hardware designed *with* personal agent workloads. On top of this, models are improving, we're getting more intelligence per bit/watt, and open-source harnesses like @NousResearch Hermes / OpenClaw are improving rapidly. Within the next 2 years, we'll inevitably have unmetered, private Opus-4.8 / GPT-5.5 level intelligence running locally on a future version of a MacBook or RTX Spark. I like this future a lot better than the one where OpenAI / Anthropic control the intelligence layer of the internet and can rent-seek on intelligence.

Beyond this, NVIDIA is ahead on general AI ecosystem, i.e. the CUDA moat. Apple is ahead on local AI ecosystem, i.e. models quantized/rightsized for MacBooks, native macOS apps, and ease of setup. We'll see how this might change as the new RTX Spark also brings full native CUDA to Windows-on-Arm laptops for the first time, potentially closing the gap.

There are many other factors I haven't mentioned here, but I believe I've covered the timeless, most important things for the new era of personal computing.

502

460

106K

robbwinkle retweeted

dan

@irl_danB

3 days ago

> What do you use subagents for? many things, but my favorite: the good old fan-out-fan-in and I think there is more to this than "you can parallelize token spraying" (which... is fun, but... careful) rather, the more important fan-out pattern is one in which each branch (subagent) accumulates experience, and then the fan-in synthesizes this experience into condensed learnings what do I mean by experience? it's conventional wisdom by now that the models do better when they have back-pressure to spew their tokens against the thinking goes: if you're using agents to one-shot something, chances are it may be wrong on the first try. but if you give them some back-pressure--say, tests that they can run against the real world and whose results they can observe--their outputs converge on something more accurate and it's not just parallelizing unit tests... any experiential "theory meets reality" observation of the world rolls up into this category. the one that emerges most often in my own usage is parallel research and synthesis so it's not only interesting to parallelize work to just generate more tokens, it's interesting to parallelize work because you can accumulate experience faster the fan-out-fan-in is an efficient empirical learning pattern imagine splitting yourself to parallelize your lived experience into a sort of multiverse reality all of which you remember after your shards re-converge with learnings in tow

irl_danB's tweet photo. > What do you use subagents for?

many things, but my favorite:

the good old fan-out-fan-in

and I think there is more to this than "you can parallelize token spraying" (which... is fun, but... careful)

rather, the more important fan-out pattern is one in which each branch (subagent) accumulates experience, and then the fan-in synthesizes this experience into condensed learnings

what do I mean by experience?

it's conventional wisdom by now that the models do better when they have back-pressure to spew their tokens against

the thinking goes: if you're using agents to one-shot something, chances are it may be wrong on the first try. but if you give them some back-pressure--say, tests that they can run against the real world and whose results they can observe--their outputs converge on something more accurate

and it's not just parallelizing unit tests... any experiential "theory meets reality" observation of the world rolls up into this category. the one that emerges most often in my own usage is parallel research and synthesis

so it's not only interesting to parallelize work to just generate more tokens, it's interesting to parallelize work because you can accumulate experience faster

the fan-out-fan-in is an efficient empirical learning pattern

imagine splitting yourself to parallelize your lived experience into a sort of multiverse reality all of which you remember after your shards re-converge with learnings in tow

333

269

25K

robbwinkle retweeted

Anthropic

@AnthropicAI

4 days ago

Anthropic has confidentially submitted a draft S-1 registration statement to the Securities and Exchange Commission. Pending completion of SEC review, this gives us the option to pursue an initial public offering. Read more: https://t.co/onGZAhRLvD

980

22K

20M

Winkle

@robbwinkle

7 days ago

Guillotine is the new load-bearing

Who to follow

Bounteous

@Bounteous

Global AI Services firm. We help enterprises design, build & scale AI-driven products and platforms. AI that sticks. 5,000+ team members worldwide.

Jason Yun

@yuntraining

Your light environment is more important than food or exercise. #sunlightwithyun #circadianbiologywithyun https://t.co/jPBVSRbCYG

Tad Reeves

@TurboDad

2X AEM Champion | Principal Architect @arborydigital | Daddy x 3 Podcast on AEM, done while mountain biking: https://t.co/K4hfrjDOA0

Winkle

@robbwinkle

7 days ago

@irl_danB I finally had a chance to try the dynamic workflows. Agreed. It is definitely RLM-shaped.

robbwinkle retweeted

dan

@irl_danB

7 days ago

if you've been using OpenProse, you now have a bunch of dynamic workflows saved as code that lean on best practice classical engineering principles to build composable scalable dynamic workflows and all your programs got better and faster and cheaper for free model: opus-4.8 harness: claude code + /workflow is rapidly approaching Prose Completeness

robbwinkle retweeted

mass

@Memetic_Theory

8 days ago

We present empirical evidence of the first general economic scaling law beyond language data. We are incredibly excited to publish it, and definitively say: Recursive Self-Improvement is a Portfolio Optimization Problem https://t.co/edRoJLiIxW

518

593

101K

robbwinkle retweeted

alex zhang

@a1zhang

8 days ago

In case you're curious about why dynamic workflows are so powerful and the future, read the RLM paper! Opus 4.8 + dynamic workflows in Claude Code is perhaps the first instance of a frontier model seriously trained to be an RLM. I suspect within a year they'll just become the standard for nearly all coding agent interactions.

a1zhang's tweet photo. In case you're curious about why dynamic workflows are so powerful and the future, read the RLM paper! Opus 4.8 + dynamic workflows in Claude Code is perhaps the first instance of a frontier model seriously trained to be an RLM.

I suspect within a year they'll just become the standard for nearly all coding agent interactions.

169

293K

robbwinkle retweeted

alex zhang

@a1zhang

8 days ago

I think it's becoming clearer that programmatic sub-agent calling is the way to go over the legacy tool-calling format (which I've been pushing for since RLMs came out)! I do wonder though if the generated "workflow" looks more eager or compiled (a design decision I've also been unsure about, because it affects how these models are trained to act); dynamic seems to imply the former but the example they give in the blog makes it kind of unclear. either way, scaling the flexibility of subagent deployment without polluting the context of the main Claude Code instance is gonna be huge

robbwinkle retweeted

Ethan Mollick

@emollick

8 days ago

There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narrative tells Fascinating differences between AI & human narrative, and asking AI to write in different styles doesn't do much to change it https://t.co/azkRHz34NQ

emollick's tweet photo. There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narrative tells

Fascinating differences between AI & human narrative, and asking AI to write in different styles doesn't do much to change it https://t.co/azkRHz34NQ https://t.co/oTxSGBNYYE

121

586

392K

robbwinkle retweeted

Joe Schmidt IV

@joeschmidtiv

9 days ago

https://t.co/i7teOUTWgZ

206

robbwinkle retweeted

Wyatt Benno

@wyatt_benno

10 days ago

LLM will help us make all software secure, with formal verification? Right!? On my last post, a lot of people reached out to me about how "they" love "Lean".. and how it works for 'them' & their LLM workflow, that's great! But for the rest of 'people', how the hell do we match 'our intent' with the LLM output of formal verification in the first place? How do normal people know if software is FV or AI slop? Stay tuned for paper drops and insightful discussion, but for now lets look into how we match intent with spec today 👇

wyatt_benno's tweet photo. LLM will help us make all software secure, with formal verification? Right!?

On my last post, a lot of people reached out to me about how "they" love "Lean".. and how it works for 'them' & their LLM workflow, that's great!

But for the rest of 'people', how the hell do we match 'our intent' with the LLM output of formal verification in the first place?

How do normal people know if software is FV or AI slop?

Stay tuned for paper drops and insightful discussion, but for now lets look into how we match intent with spec today 👇

robbwinkle retweeted

Siddhartha Saxena

@siddsax

12 days ago

Anthropic onboarding day: Michael Scott introducing Karpathy like he just signed Wemby in free agency.

393

18K

robbwinkle retweeted

dan

@irl_danB

11 days ago

later this week I'll be excited to release a new kind of harness we've been building it's one suited particularly well to the OpenProse paradigms of: * declaring intent in structured language * pushing all further decision making into the agent I found that with almost all of the OpenProse systems that we'd been writing, the missing piece was something like an event-based architecture to add dependencies across runs, one that looked something like a data-flow graph between OpenProse Responsibilities we built a prototype of this on in our private cloud before deciding to pull the guts of it out into a relatively simple sdk, which we'll be open sourcing we've started calling it a Reactor harness, because it draws on elements of React to be efficient at keeping a composed world-model up to date with respect to events and upstream changes this harness is arguably an RLM, or if not a fully generalized RLM, draws on many RLM-class principles in its architecture, whereby many layers of ReAct sessions are themselves being managed by meta-sessions, where context state is passed--as in the original OpenProse release--via pointer rather than in-context, and where every instantiation of an agent (reactor session or fulfillment session) is in an environment where it has access to a shell that includes the full system state as a git repository that it can write code to interact with in the same way that React manages dependencies between declarative components, re-rendering only those that need update, the Reactor Harness manages dependencies between "OpenProse Responsibilities" (declared ideal world model state) in keeping with with the current OpenProse tenets, the re-render is still a question that we push into the intelligent agents, so it's more of an "intelligent memoization". in further keeping with our tenets, all intent authoring/specification still happens at the OpenProse Framework layer: declared markdown contracts and (optional, imperative) ProseScript to dictate agent behavior inside of any given session we're taking ideas for benchmarks for long-running systems like this, though we're also designing some of our own in absence of others, stay tuned for results initially I was going to wait until we have benchmarks to release this, but I think it's worth just getting a rough version out to collect feedback before declaring some hill-climbed numbers that don't pass the community vibe check, so please vibe check and send feedback bookmark this tweet and grok will kindly put the release tweet in your feed later this week. and if you want to try an early-waste-of-time release candidate, send me a DM (h/t @raw_works and @josemontesdeoca for major contributions here)

114

157

13K

Winkle

@robbwinkle

13 days ago

+1 on using property-based testing. Hypothesis for python. Fastcheck for Typescript

Uncle Bob Martin

@unclebobmartin

13 days ago

My brain is reeling with the implications. I keep having these revelations and I'm beginning to wonder when they will stop. It turns out that property testing is yet another hardening technique that the agents can profitably engage. Agents can determine whether a function is appropriate for property testing, and can specify the range and domain of those tests. They can implement them quickly, run them, and fix any detected issues. I just found two production bugs this way. Property testing is going to be part of my normal practice, along with Crap analysis, Function mutation, acceptance test mutation, Dry analysis, etc.

846

869

182K

robbwinkle retweeted

Ahmad

@TheAhmadOsman

14 days ago

https://t.co/uRDNr16FHA

650

263K

Winkle

@robbwinkle

13 days ago

Go deep on this Q&A and the podcast if you want to know what works and what doesn’t for accelerating your team with agents. Even in a regulated industry

Darragh Curran

@darraghcurran

14 days ago

https://t.co/Amcmzn0V8j

18K

robbwinkle retweeted

Jake Messer

@jakemesser_

14 days ago

Glass has become an integral part of my daily process at Ramp.

robbwinkle retweeted

Wyatt Benno

@wyatt_benno

17 days ago

Once again, I think Lean is overkill for doing checks on things like simple smart contracts. If you are doing novel math… or writing a cryptography paper, Lean is the way to go. But if yo are checking, insufficient funds, has authorization, etc these are sat problems where SMT works better. Moreover you can take smt solvers, and wrap them in ZK making verification succinct, non interactive, foldable, and for on-chain verification. This is non trivial for Lean proofs. Lean has interaction that people argue is good for LLM.. I think it’s this very complexity that makes it worse for taking natural language and one shot outputting specs. With automated reasoning over SMT we already have over 99%+ soundness on this task. I.e you can already take NL and convert it to smt with little battle testing. Lean would require a lot more interaction. Lastly the verification aware programming languages and platforms all already use SMT. So if you want to take those specs and convert them into formally verified code, for sol, rust, go, c# and many others, you already have a good start with SMT tooling. Default, “I like Lean, it works well for math and is powerful” does not mean it’s the best tool for vericoding. It is one option for sure! And will help secure complex cryptography at the maths level.. at the “I want to create a simple formally verified program”level it’s overkill.

robbwinkle retweeted

Wyatt Benno

@wyatt_benno

16 days ago

“General provers (like Lean) over SMT tooling” for formal vericoding of software, is chasing the hype... You will hear Lean for vericoding x1000 times after Vitalik’s post.. but the data says different. How well do LLM do when generating formally verified code with different tools? Dafny (smt) 82%, Verus 44%, Lean 27% (Bursuc et al., 2509.22908). The gap is automation, not rigor. Lean is the destination and dream; while SMT is most of the road there. Another note: since this study some smt tooling has gotten to 99%+ with minimal human battle testing. For many small programs you can take natural language given to an automated reasoning model and one shot produce formally verified code. There are benefits of all approaches. Be wary of the hype around one approach! And just wait until you hear about how we make this all succinctly verifiable ⚡️

Winkle

@robbwinkle

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users