This month's Detect covers:
- Lessons from an LLM–triggered outage
- How to get more out of OpenTelemetry with Common Reliability Enumerations (CREs)
- Tips for avoiding application migration pain
- Quick hits on recent Google & Cloudflare incidents
Link below
This month's Detect covers:
- Lessons from an LLM–triggered outage
- How to get more out of OpenTelemetry with Common Reliability Enumerations (CREs)
- Tips for avoiding application migration pain
- Quick hits on recent Google & Cloudflare incidents
Link below 👇🏼
🚨 Dropping soon: Two new open source projects to spot bugs, misconfigs & other pitfalls — before they bite. Host: Kelsey Hightower + @snowboardvstree.
🧠 No pitch. No slides. Just code, Q&A, and instant repo access. ✅ Tricks you can use today. Be first in. #reliability
🚨 April issue of Detect is out — and it’s a banger.
✅ OpenAI 429s (spoiler: not just traffic)
✅ Cloudflare’s rollout gone wrong
✅ Memory lies devs still believe
✅ A smarter take on availability from Riot Games
For the engineers on the hook when things break.
Link below👇
featuring the incredible work of:
Lorin Hochstein Brendan Humphreys Canva Lawrence Abrams Ahmet Alp Balkan Ankush Menat Rustunit josson paul Dmitry Pogrebnoy Rustam Kovhaev Sooter Saalu Rebecca Weng Sean Madden sidm0 Lucas Pardue Evan Rittenhouse Magnus Groß Kyle Wiggers
🚨 Outages, debugging, and scaling chaos! 🚨
OpenAI and Canva faced downtime, a Kubernetes add-on showed its quirks, and debugging Rust proves no walk in the park.....
Our latest newsletter is out. https://t.co/BhBjKYmqvH
#Kubernetes#debug#SRE#reliability#problemdetection
Proactively searching for these anti-patterns is hard enough. Doing something about it before it explodes is even harder. This is why I like community driven problem detection. You can leverage their experience on top of yours to help make the case for action.
🎙️ Missed our live podcast with Denis Bakhvalov, Intel's performance ninja? 🥷🏻
🚀 CPU trends shaping app performance
🔍 How compilers optimize modern software
📈 Benchmarking best practices
⚖️ Performance vs. scalability tradeoffs
Link in the comments.
#Performance#SRE#CPU
🙌 Honored to have @niallm as the first guest on our podcast; We discuss "Problem Detection & Management".
Co-author Site Reliability Engineering: How @google Runs Production Systems; CEO/co-founder @StanzaSystems
Host: Prequel CTO @snowboardvstree.
Full episode below. #sre
🚨 September's newsletter is live! 🚨 Packed with real-world debugging stories,⚠️ Major Incidents at OpenAI, Anthropic, HubSpot, and Google, a hidden bug in kafka, and more...
Link below
Featuring: @danslimmon@fntlnz@srecon@bschiett@Swizec@andreabergia@jirfag