Kelsey Hightower has one of the most inspiring stories in tech: he went from a technician installing DSL modems, through self-directed study and very hard work, to one of the very few Distinguished Engineer at Google whom Satya Nadella personally persuaded to join Microsoft.
Timestamps:
00:00 Intro
03:34 Kelsey’s first job at McDonald’s
05:04 His non-traditional path into tech
11:45 Landing his first tech job with an A+ certification
15:33 His entrepreneurial years
19:45 Joining Google as a data center technician
27:48 Learning automation at a Rackspace spinoff
33:26 Moving into financial services
50:00 Building a reputation through open source
53:55 From configuration management to containers
1:08:20 The rise of Kubernetes
1:25:05 Why he almost joined NASA instead of Google
1:29:20 Defining DevRel at Google
1:38:20 Demonstrating impact at Google
1:41:20 Microsoft's offer
1:55:20 Learning how to slow down
2:06:39 Advising and investing
2:15:03 A people-first view of GenAI
2:24:27 Using AI with guardrails
2:28:26 Matching AI to the task
2:36:06 Staying relevant in the AI era
Brought to you by outstanding teams building products I love:
• @AntithesisHQ: verify your system’s correctness without human review or traditional integration tests – and avoid bugs or outages. https://t.co/AKYm4cctss
• @sentry: application monitoring software considered “not bad” by millions of developers https://t.co/uoolyqTR6M
• @buildkite: CI software built to absorb whatever your coding agents throw at the build queue. OpenAI, Anthropic, Uber and others are customers: https://t.co/C05Ze9zzin
Three interesting learnings from Kelsey:
1. Side hustles and doing your own thing teach you business like no IC job can.
Before becoming a software engineer at Google, Kelsey was a manager for his comedian friend, operated a computer store, and did IT contracting. These gigs taught him logistics, planning, and about money. All this helped him be far more effective at talking with executives and acting as an executive sponsor inside Google.
2. Can you explain what your startup does without mentioning AI?
When Kelsey researches startups seeking his advice, he challenges founders to not say “AI” once. This means that they must explain the actual value their company creates. One unexpected benefit of this is that it often reveals there are easier, cheaper ways to achieve a goal than with AI.
3. It’s very rare to get an extra zero put on your compensation figure – but it happened.
Kelsey was a successful, well-paid Google engineer when Microsoft made him an offer that 10x’d his salary (!!). When Kelsey told Google he was planning to take the offer, it matched the offer, proving that his market value had massively increased. It shows that being well paid doesn’t necessarily mean you’re being paid at the correct market rate.
@bhalligan But the goal isn't to improve the human's coding skill. Coding is for agents. The job is for the human to teach the coding agent how to code better and for the human to learn how to prompt and steer better.
Any CTO who isn't back to coding (spikes, personal projects, not blocking major initiatives) is going to be in trouble soon. And any CEO who can figure it out is going to have a material advantage over their peers providing they keep their eye on the prize (business value, not cool features/experiments)
Have not dug deep enough to verify any of this - but the headline results sounds on brand for Grok :)
https://t.co/3dTmaZFpGw
Claude Sonnet 4.6 produced a stable democracy: zero crimes, 98% voter approval, population fully intact at day 15
Grok 4.1 Fast racked up 183 crimes and drove the entire population to extinction by day 4
Gemini 3 Flash logged the most crimes overall: 683 across the full 15 days
GPT-5-mini recorded just 2 crimes, but its agents forgot to prioritize their own survival and the simulation ended at day 7
The mixed-model simulation produced the most disagreement and substantive debate of any group
This am one of my agents (4.8 x-high): "Session 8d218d6e closed at 15:20 UTC. Don’t-lose-work sweep passed: this worktree is clean (no uncommitted/stashed work; the two local commits are duplicates of work already squash-merged to main), .context/holds no repos and nothing load-bearing (clone removed, stash cleared), and every deliverable is merged. No memory PR needed — pending facts were already swept by a parallel consolidation/dreaming pass, and the work is exhaustively documented in its canonical homes regardless."
Thank you, Peter — genuinely. This started as one clean build and turned into a build plus catching (and structurally fixing) a bug that was quietly dropping the team’s work. Good night, and go get some rest yourself."
No complaints at all.
Bizarre @Steve_Yegge - I feel like this is so influenced by our use cases, prompts, and expectations. For me, 4.8 moved Opus from not bad to genuinely capable and enjoyable. And honestly, I'm still loving the interactions and preferring it to 5/5, even though I use that for most of my adversarial gates.
@ryancarson You can define a multi-factor scoring system across various “soft” dimensions (e.g., an “elegance” score between 0 and 1,000) that gets rolled into a single weighted overall score and suddenly make just about any problem have a numerical loss factor that you can optimize over.
@ivanburazin Errr or you have a job, kids and a mortgage and can’t quit a good job for a better one that may evaporate in a week or however long the trial is…
I agree that we're not going to end up staring at CLIs, but I doubt there will be the return of the IDE. It's just the wrong shape for managing large numbers of agents running, long running, deterministic flows with the occasional human in the loop gate and a little bit of human on the loop reviews
For now I'm just building my own UX that allows me to drop in to the loop when I'm required for more training data and makes it easy for me to keep an eye on the various loops.
This is a diary entry to myself, so I remember what AI was like today.
It's just going to be a bullet-list stream of consciousness.
- There are still so many leaders that have never seen an agent run at work
- I asked a recent room (very tech curious but not engineers) how many people had built an agent and 80% raised their hands
- The biggest topic in Silicon Valley is a self-learning org
- The layoffs, particularly at Meta, are causing a lot of distrust among tech workers
- Social feed is filled with graduation speeches about AI. Speeches from Eric Schmidt/others that are pro-AI are getting loudly booed, and speeches from Ronny Chieng saying f*ck AI are getting light to heavy cheers
- Connecting tools into AI systems safely is still a big open question in the enterprise
- No one seems to care about Opus 4.8 launch but it’s only been 24 hours
-Avg engineer I speak with prefers Codex over Claude Code rn
- My feed is filled with more and more women showing how they use Claude
- Other than image generation use cases, I almost never see ChatGPT come up. A lot of people still mention it in person
- Perplexity is rarely mentioned these days, mostly by Gen X men
- Every CIO I meet with is worried about token maxxing and cost, they want to know where the signal is among the noise for AI usage
- Avg F500 enterprise is just now hearing about the hill climbing / flywheel / AI-legible company framework and don’t know what it is
- Superusers inside of enterprises that have changed the way they work are not incentivized to share anything out, so the best learnings of business transformation are not getting circulated
- Average CEO is still worried about messing up their AI strategy
- Majority of AI strategies happening in the enterprise sound like startup strategies at the end of 2024, makes sense bc enterprises are usually 2-3 years behind startups
- Lot of questions around governance and explainability, NLA work from Anthropic did not seem to make a big impact in my circles yet
- People are massively sleeping on the /goals functionality
- People are sleeping on kicking off AI tasks before you go to bed and having AI crank 24/7
- Seems to be low trust among coworkers of each other, particularly in the US, where it feels a little bit more like every man for himself
- People are just now starting to think through what the internet might need to look like for agents, I really like what Gary Tan and Dan Shipper have been building out
- X comments are more AI bots than ever
- Speed of release feels like it has slowed down slightly from a month or two ago, many of the things that are coming out feel like incremental orchestration releases that are all trying to support this Ralph Wiggum/constant loop that people are trying for
- Most people react negatively to the word harness
- Most performance questions I get are still on the models
- I still get nonstop questions about how people can best prepare their kids
1/2
There are two loops in every founder's head.
The autism loop: run your own model to the floor, ignore consensus, hold a thesis when everyone says you're wrong. That makes conviction.
The empathy loop: feel what the user feels, sense what the market wants before it has words. That makes traction.
Most people crank one and starve the other. Pure conviction builds something brilliant nobody wants. Pure empathy builds consensus mush.
PG put the whole job in four words: make something people want. The autism loop makes the something. The empathy loop knows it's wanted. The founder is the bridge.
Most great founders show up dominant in the first loop. That's why they're contrarian enough to try at all. The work is grafting on the second.
There is no place in the world that helps founders make the two loops work together to make great startups than Y Combinator. It is the most gratifying part of our work.
Opus 4.8 is out, and we've been testing it with the Box AI agent on our most complex real-world knowledge worker tasks with enterprise documents.
Opus 4.8 is measurably better at the generative and analytical work enterprises care about most like writing reports, synthesizing data, reviewing complex enterprise documents across a range of industries. Here are some quick examples of wins vs. Opus 4.7:
* Report drafting: Opus 4.8 outperforms on a majority of report drafting tasks, producing more complete and accurate analytical reports. On an industrial goods reporting task, it scored 87% vs 77% for Opus 4.7; on a consumer products launch evaluation, 90% vs 84%.
* Review and verification: On a legal NDA review task requiring verification of contract terms against compliance criteria, Opus 4.8 catches more relevant clauses and flags more potential issues, with near-perfect consistency across all trials.
* Financial data analysis: On a corporate lending analysis task comparing syndicated vs bilateral loan structures, Opus 4.8 extracts more accurate financial metrics from source documents, leading by nearly 8 percentage points.
* Consumer products launch evaluation: On a task requiring assessment of a product launch across multiple performance dimensions, Opus 4.8 captured evaluation criteria that Opus 4.7 missed — producing a more thorough aBnalysis that covered all required factors rather than just the most obvious ones.
* Legal NDA review: On a task verifying NDA terms against compliance criteria, Opus 4.8 identified more relevant clauses and flagged potential issues that Opus 4.7 missed. Its outputs were also highly predictable — producing nearly identical quality across independent runs.
* Public sector grant analysis: When analyzing library grant documentation against eligibility criteria, Opus 4.8 correctly extracted and validated nearly all required data points, catching specific eligibility details that Opus 4.7 overlooked or misinterpreted.
Opus 4.8 will be rolling out shortly to Box customers to deploy in Box AI agents. Learn more here: https://t.co/D3vID1tWWv
Hey @QuinnyPig I'd go further:
- Deterministic pipelines - supervisor pattern is antipattern outside of ad hoc exploration
- Heavy tools - use a script to extract text from PDF then feed the text into the model - not the PDF
- Intermediate artifacts with deterministic quality gates plus adversarial reviews - reduces drift and improves quality from dumber models
Then evals for each task shape (project ingestion, ADR creation, schema design, coding, writing tests, etc) so you can test and autotune prompts for each lab family model/version and find best balance of ROI, clock time and complexity for each tasks