10+ years SRE building AlertKick. Notes on running real systems at scale - Kafka, eBPF, Temporal, AWS Bedrock & AI, and systems engineering glue between them.
Kafka lessons I learned the hard way running it in production.
Not the tutorial stuff. The things that matter once you're past "hello world" and actually depending on it 🧵
@shreyas I see same failure in incident postmortems, clever sounding root cause is never the real one, it just makes everyone feel smart. Writing in boring language forces you down to what broke.
Linux load average is not CPU usage. This trips up everyone exactly once.
Load is the number of processes either running, waiting to run, OR stuck in uninterruptible sleep which usually means disk IO. So a load of 8 on an 8 core box with low CPU isn't fully utilised, it's IO bound and probably suffering.
If load is high but CPU is low, stop looking at compute and start looking at disk.
@RobHoffman_ Totally agree, the trap is that perfectionism feels like high standards but it's usually just fear wearing a nicer outfit. Shipping something and fixing in public beats polishing something nobody's seen.
This applies cleanly to building software.
The best products I've used were clearly made by someone who wanted to use them. The worst were made for a market. You can feel the difference in the first five minutes.
Build the house you'd want to live in. The audience comes second.
Rick Rubin’s House on the Mountain test:
Create according to your own taste, not for applause, critics, algorithms, or market demand.
“Imagine going to live on a mountaintop by yourself, forever. You build a home that no one will ever visit. Still, you invest the time and effort to shape the space in which you’ll spend your days. The wood, the plates, the pillows—all magnificent. Curated to your taste.”
“This is the essence of great art. We create our art so we may inhabit it ourselves.”
“I'm willing to go to extremes to make the thing that I want to inhabit and it's not for anyone else. it's just for me.”
Rick Rubin’s House on the Mountain test:
Create according to your own taste, not for applause, critics, algorithms, or market demand.
“Imagine going to live on a mountaintop by yourself, forever. You build a home that no one will ever visit. Still, you invest the time and effort to shape the space in which you’ll spend your days. The wood, the plates, the pillows—all magnificent. Curated to your taste.”
“This is the essence of great art. We create our art so we may inhabit it ourselves.”
“I'm willing to go to extremes to make the thing that I want to inhabit and it's not for anyone else. it's just for me.”
Reading about companies burning millions on AI tokens, some blowing through their whole annual budget in months. And of course the predictable next move: internal dashboards measuring "which team is burning the most tokens," presented as AI adoption.
It's the wrong metric. Measures consumption, not creation. And in a year of AI-related layoffs, it tells every employee one thing - use tokens or look obsolete.
So they go tokenmaxxing to look busy. Creating code nobody asked for, docs nobody reads, markdown files that exist to prove an AI was used. And then those files get fed back into the next session as context, so now we're burning more tokens to process the slop the last session created.
Jensen Huang says he'd be alarmed if his engineers weren't burning hundreds of thousands in tokens. Uber's own COO has quietly admitted there's no link between tokenmaxxing and shipping anything useful.
Measure what's built, not what's burned.
@tibo_maker Thats right, time away is actually good, I always feel more energy and focused mind after a break. Also seeing new things and places does something to your brain and kick you out of your default mode.
@nishantmodak Very cool 😎. yes slack has most of the context, Ive solved so many issues by just searching slack history and with slack org search its even more helpful, I sometime find changes from channels I'm not even in. 😀
Next thing I'm building into the AlertKick Slack app: agentic chat. Not "ask the bot a question" but actually let it do things on your behalf during an incident
The pattern I want: you @ the bot in the alert thread and say "what changed in the last hour" or "silence this for 30 minutes" or "show me the metric that triggered this."
No leaving Slack. No clicking through dashboards. The bot has tools and uses them.
The wider thesis: incident response is mostly context-gathering, and most of that context-gathering is the same five queries every time.
If a tool-using agent can do that work in the alert thread, the on-call engineer gets to focus on the decision, not the digging.
More soon.
The interesting design problem is permissions and scope. An agent that can silence alerts, run queries, or pull logs is useful.
The same agent without guardrails is a great way to mute a real outage by accident.
Trying to get the trust model right before the feature lands.
AlertKick now has Slack App
Alerts post to your channel with full context, ack and resolve work from the buttons, and the channel updates as alerts close.
Built so the on-call engineer never has to leave Slack to triage.
Small but it was the missing piece for actually using it day-to-day. https://t.co/576PLpjv6z