Formal organizational structures are a useful way to think about the challenges of agents.
They provide a template to thinking about how work gets delegated up and down between smart expensive agents & cheaper weaker ones, as well as between narrow specialists & generalists.
The most important weird thing about LLMs is that they are so general. A bigger LLM that is better at coding is also better at ideation & ethical advice & medicine & math. This isn’t true of everything, jaggedness again (see fiction writing!), but it is remarkably true.
As engineering, product, design, DS, etc. melt into a new kind of role, I was reflecting on what roles might look like in the future. For example, when I look at the Claude Code team I see what I think is five archetypes:
1. Prototyper: comes up with brand new ideas; churns out many ideas, most of which don't ship
2. Builder: quickly turns a prototype/idea into production-grade product/infra
3. Sweeper: cleans up the UI, simplifies the code and system, unships, optimizes performance
4. Grower: takes a product that has been built and iterates on it to improve Product-Market Fit
5. Maintainer: owns a mature system to make it secure, reliable, fast, and efficient as it scales
Many people span across 2 roles, and sometimes 3 roles. I also notice that these roles are not really tied to job function -- eg. across Anthropic, some designers match category 1, some 2, some 3; same for engineers, PM, DS.
A healthy team needs a mix of these, depending on the product:
- A product that is new and pre-PMF needs people that are strong at 1+2+3
- A product that is growing and has found PMF needs 2+3+4 and some 5
- A product that has strong PMF needs 3+4+5 and some 2
Maybe product roles of the future will look more like this, and less like the domain-specific roles of today?
I stole this idea and now use it with every single employee.
It’s the best illustration I’ve seen of teaching someone to be high agency.
It says there are 5 levels of work:
Level 1: “There is a problem.”
Level 2: “There is a problem, and I’ve found some causes.”
Level 3: “Here’s the problem, here are some possible causes, and here are some possible solutions.”
Level 4: “Here’s the problem, here’s what I think caused it, here are some possible solutions, and here’s the one I think we should pick.”
Level 5: “I identified a problem, figured out what caused it, researched how to fix it, and I fixed it. Just wanted to keep you in the loop.”
Using this framework, here’s what I say to every new employee…
You will live at Level 4 from Day 1 and as we build trust you will rise to Level 5.
Being high agency doesn’t just mean tackling problems in this way. It means your entire way of working should be oriented to being a Level 4+ employee.
Plz feel free to steal it as well.
And ty @stephsmithio for the framework!
Satya Nadella says the next AI interface is not chat.
It is the control room for 100 agents working overnight under your delegated authority.
"Coding has worked so well that we now have to rebuild the IDE."
"Oh my god, I have these 100 agent sessions. The cognitive load it transfers back to me as a human is so excessive that now I need a new UI."
"The chat as the only artifact is also impossible. So that's why we need a canvas."
Then he gives the enterprise version:
"All through the night there was a bunch of stuff that all these autopilots that I have working on my behalf with my delegated authority did."
And the operator question becomes:
"What did you do? Did I do this work?"
The real bottleneck is not whether agents can generate more tasks.
It is whether humans get the interface, memory, permissions, and audit trail to safely manage 100 delegated workers.
The Claude Code team has been shipping with Claude Tag internally all year.
It now writes 65% of our product team's code, including most of what built Claude Tag itself.
Here are a few ways we use it every day: 🧵
https://t.co/7PLrW06TvH
If you're on your way to building a billion dollar company that involves a web app, here are some of my notes on architecting the frontend.
if you don't do this, it's probably fine but one day you'll hire someone to fix it but truly that person could be doing some other higher value thing if you make some key optimizations on day 1
you don't even have to learn anything you're gonna tell your agents to do it anyways!
okay here it goes:
- Make your server code generate a openapi spec which then generates all the relevant client side code. Never do this by hand. Typing backend types instead of generating them should be banned
- You need to make a decision on how the client talks to the backend. rest/graphql works in which case please just use tanstack query. other libraries will look similar but tanstack query truly is goated.
- if you want linear style sync setups or offline mode, think about this HARD and architect it from day 1. Bolting this on later is so tedious.
- People like using plain react router but things have gotten a lot better since then. Try their new framework mode or just even use tanstack router. Use route data loaders.
- If you store a lot of state in query params, make that a first class citizen and make sure its type safe. use nuqs or tanstack query.
- Most apps just need a single state management situation for server state and thats it. If you have other bespoke needs, i have quite like zustand and xstate/store.
- If you have a super interactive app where things come in and out of view, theres a lot of frontend state to maintain, music is playing and what not, lock in and learn xstate. Trust me if you wanna keep ur sanity, you need to model ur frontend as a state machine otherwise you're gonna be deep in useEffect hell
- React compiler is here my friends, the days of useMemo and useCallback are gone. Update your priors accordingly
- Tailwind is easy and fun but makes it really hard to maintain a large app with consistent styling. You need a "agent-first design system/component library" but maybe this is a rant for another day
- Don't be afraid to hack your routing library to fit your needs more closely. A lot of apps have "drawers" to show additional info. You should 100% be able to say "here's a route, make it a drawer" and everything should be handled from there.
- Managing loading and error states using isPending and isError is madness. Lean into Suspense and ErrorBoundary.
- Figuring out a blessed path for websockets and SSE on day 1 i think will pay dividends in the long term if you're building anything AI related.
- If you're building a SPA, don't use next.js. it literally makes no sense. Why would you do this.
- Definitely deploy on Cloudflare or vercel. There are other services but trust, there have weird missing features.
- Assuming you build something people want, the next job is to build the factory so it can efficiently build the thing. Act accordingly.
I was trying to explain to a friend today how Claude Fable and Claude Mythos works and why they are so interesting
Here's what I came up with: Imagine you come over to my house and accidentally leave your car unlocked in the driveway
No big deal, right? Nobody drains your bank account because you forgot to lock your car
But a patient thief sees this as a starting line. He sneaks into your car while we're here having coffee and he doesn't take anything obvious. Instead he finds your gym fob in the cupholder and quietly clones it
That fob opens your locker at the gym the next time you go, and inside he can get your house key
The house key gets him through your front door and on the counter is your work badge
The badge gets him into your office after hours
In your desk drawer is your cafeteria card. The cafeteria card is tied to the building payment system. I don't know why but just follow me. The payment system is linked to payroll. And payroll is linked to the company bank account.
You get the idea: Your unlocked car door became a path to the company bank account through a chain of small little exploits
Every single step was somewhat worthless on its own. The gym fob couldn't touch the bank. The cafeteria card couldn't touch payroll, until it could. The trick was never any one hack. It was seeing how they all connected
That is what Mythos is shockingly good at
It finds a pile of small, boring flaws that everyone ignored because none of them matter alone, and it chains them together in ways nobody thought to try. One bug might only let it peek at a sliver of computer memory. Another might only let it scribble a single byte somewhere on your disk. These things are useless apart. But chained together they get more interesting.
But now extrapolate and give the car burglar one more superpower: He never gets tired, and he can be in a thousand places at once. A normal burglar checks one house at a time. But Mythos checks every door on every street, all night long
That's it! How'd I do?
I think this explains why Mythos and Fable are so interesting. Fable can't do security stuff but it does have the ability to be a long-running agent that can chain together tasks to accomplish big goals
It’s a story as old as time: in 20 years you’ll be a compilation of the books you’ve read, the people you’ve met, and the experiences you’ve had. Human experience should be part of a constant AI feedback loop, therefore we need human input & learning at a scale we haven’t seen.
15 AI related accounts you should follow on Twitter:
1. @karpathy
His tweets already create LLMs narratives that you later see on linkedin in 2 months.
2. @fchollet
posts thoughtful research on intelligence, benchmarks, and AI limitations. Keras creator + ARC-AGI
3. @ylecun
Yann LeCun is Deep learning pioneer & Meta Chief AI Scientist; big-picture research takes and critiques (and drama).
4. @AndrewYNg
Andrew Ng is AI education legend; practical ML advice, courses, and real-world implementation. creator of deeplearning ai
5 @rasbt
Sebastian Raschka posts on Practical ML/LLM implementations, "build from scratch" tutorials, and books.
6. @dair_ai
Weekly ML/AI paper threads and accessible research explainers (high-signal for staying current).
7. @lilianweng
Lilian Weng is ex-OpenAI and her Lil'Log-style threads are good. has In-depth LLM research breakdowns
8. @jeremyphoward
posts interesting takes on AI/crypto news, and works on democratizing practical deep learning and accessible education.
9. @simonw
Simon post Practical LLM tools, takes, experiments, prompting, and engineering breakdowns. django co-founder
10. @_akhaliq
Curates the latest arXiv papers, model releases, and open-source AI drops.
11. @ID_AA_Carmack
AGI/low-level optimization takes that makes you think about the problem.
12. @gwern
Really high-quality long-form AI research notes and essays.
13. @goodside
LLM evaluation, prompting research, and real capabilities testing
14 @drfeifei
Computer vision pioneer; human-centered AI and spatial intelligence research
15 @demishassabis
Been following his work for 9 years. Demmis is my hope against google usurpating their power with AI. Demmis is google DeepMind's CEO
Let me know who I missed guys
Separate reports by the publicity firm Edelman and Pew Research show that Americans, and more broadly large parts of Europe and the western world, do not trust AI and are not excited about it. (Links in original text, below.) Despite the AI community’s optimism about the tremendous benefits AI will bring, we should take this seriously and not dismiss it. The public’s concerns about AI can be a significant drag on progress, and we can do a lot to address them.
According to Edelman’s survey, in the U.S., 49% of people reject the growing use of AI, and 17% embrace it. In China, 10% reject it and 54% embrace it. Pew’s data also shows many other nations much more enthusiastic than the U.S. about AI adoption.
Positive sentiment toward AI is a huge national advantage. On the other hand, widespread distrust of AI means:
- Individuals will be slow to adopt it. For example, Edelman’s data shows that, in the U.S., those who rarely use AI cite Trust (70%) more than lack of Motivation and Access (55%) or Intimidation by the technology (12%) as an issue.
- Valuable projects that need societal support will be stymied. For example, local protests in Indiana brought down Google’s plan to build a data center there. Hampering construction of data centers will hurt AI’s growth. Communities do have concerns about data centers beyond the general dislike of AI; I will address this in a later letter.
- Populist anger against AI raises the risk that laws will be passed that hamper AI development.
To be clear, all of us working in AI should look carefully at both the benefits and harmful effects of AI (such as deepfakes polluting social media and biased or inaccurate AI outputs misleading users), speak truthfully about both benefits and harms, and work to ameliorate problems even as we work to grow the benefits. But hype about AI’s danger has done real damage to trust in our field. Much of this hype has come from leading AI companies that aim to make their technology seem extraordinarily powerful by, say, comparing it to nuclear weapons. Unfortunately, a significant fraction of the public has taken this seriously and thinks AI could bring about the end of the world. The AI community has to stop self-inflicting these wounds and work to win back society’s trust.
Where do we go from here?
First, to win people’s trust, we have a lot of work ahead to make sure AI broadly benefits everyone. “Higher productivity” is often viewed by general audiences as a codeword for “my boss will make more money,” or worse, layoffs. As amazing as ChatGPT is, we still have a lot of work to do to build applications that make an even bigger positive impact on people’s lives. I believe providing training to people will be a key piece of the puzzle. https://t.co/zpIxRSuky4 will continue to lead the charge on AI training, but we will need more than this.
Second, we have to be genuinely worthy of trust. This means every one of us has to avoid hyping things up or fear mongering, despite the occasional temptation to do so for publicity or to lobby governments to pass laws that stymie competing products (such as open source).
I hope our community can also call out journalism that spreads hype. For example, Nirit Weiss-Blatt wrote a remarkable article about how 60 Minutes’ coverage of an Anthropic study in which Claude, threatened with being shut down, resorted to “blackmail,” was highly misleading. The study carried out a red-teaming exercise in which skilled researchers, after a lot of determined work, finally pushed an AI system into a corner so it demonstrated “blackmailing” behavior. Unfortunately, news reports distorted this and led many to think the “blackmail” behavior occurred naturally rather than only because skilled researchers engineered it to happen. The reports left many with a wildly exaggerated picture of how often AI actually “schemes.” Red-teaming exercises are important to test vulnerabilities of systems, but this particular piece of hype, which was widely circulated, will hurt AI for a long time.
Living in Silicon Valley, I realize I live in a bubble of AI enthusiasts, which is great for exchanging ideas and encouraging each other to build! At the same time, I recognize that AI does have problems, and the AI community needs to address them. I frequently speak with people from many different walks of life. I’ve spoken with artists concerned about AI devaluing their work, college seniors worried about the tough job market and whether AI is exacerbating their challenges, and parents worried about their kids being addicted to, and receiving harmful advice from, chatbots.
I don’t know how to solve all of these problems, but I will work hard to solve as many as I can. And I hope you will too. It will only be through all of us doing this that we can win back society’s trust.
[Original text, with links: https://t.co/oi29S8uu6C ]
Super energized about this! With our App Builder and Workflow agents, you can now build apps and automate workflows in minutes, right in M365 Copilot chat. Here's an example.
My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good.
I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers:
AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet https://t.co/NiSn6jftqq Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way.
Animals vs Ghosts. My earlier writeup on Sutton's podcast https://t.co/rSp1noyGBr . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about.
On RL. I've critiqued RL a few times already, e.g. https://t.co/mYrMFVdVDW . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" https://t.co/2L7FiaoKsw. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" https://t.co/df5mJDdN3C , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms.
Cognitive core. My earlier post on "cognitive core": https://t.co/q2s1ihGy0T , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" https://t.co/6k0FZRGXsb
Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: https://t.co/fQgqaXPyp6 . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of.
nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) https://t.co/SIetgyoKWN
On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. https://t.co/8556ESSpyY
Job automation. How the radiologists are doing great https://t.co/FVUI872dkD and what jobs are more susceptible to automation and why.
Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell https://t.co/p72Elk8lPV I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon.
Thanks again Dwarkesh for having me over!